This is a self-archived – parallel-published version of an original article. This 
version may differ from the original in pagination and typographic details. 
When using please cite the original. 
 
 
AUTHORS: Maryam Teimouri, Jenna Kanerva, Filip Ginter  
 
 
TITLE: A Deep Dive into Multi-Head Attention and Multi-Aspect Embedding 
 
 
 
YEAR: 2025  
 
 
DOI: 10.26615/978-954-452-098-4-146  
 
 
VERSION: Publishers PDF 
 
 
 
 
CITATION: Teimouri, Maryam, Kanerva Jenna & Ginter Filip (2025). A Deep Dive 
into Multi-Head Attention and Multi-Aspect Embedding. Proceedings of 2025 
RANLP Recent Advances in Natural Language Processing Conference, 1263–1270. 
https://doi.org/10.26615/978-954-452-098-4-146  
 
 
 
 
LICENSE: CC-BY  
 
 
 
Proceedings of Recent Advances in Natural Language Processing,pages 1263–1270
Varna, Sep 8–10, 2025
https://doi.org/10.26615/978-954-452-098-4-146
1263
A Deep Dive into Multi-Head Attention and Multi-Aspect Embedding
Maryam Teimouri
TurkuNLP
University of Turku
mtebad@utu.fi
Jenna Kanerva
TurkuNLP
University of Turku
jmnybl@utu.fi
Filip Ginter
TurkuNLP
University of Turku
figint@utu.fi
Abstract
Multi-vector embedding models play an
increasingly important role in retrieval-
augmented generation, yet their internal
behaviour lacks comprehensive analysis. We
conduct a systematic, head-level study of the
32-head Semantic Feature Representation
(SFR) encoder with the FineWeb corpus
containing 10 billion tokens. For a set of
4,000 web documents, we pair head-specific
embeddings with GPT-4o topic annotations and
analyse the results using t-SNE visualisations,
heat maps, and a 32-way logistic probe.
The analysis shows that (i) clear semantic
separation between heads emerges only at an
intermediate layer, (ii) some heads align with
specific topics while others capture broader
corpus features, and (iii) naive pooling of head
outputs can blur these distinctions, leading
to frequent topic mismatches. The study
offers practical guidance on where to extract
embeddings, which heads may be pruned,
and how to aggregate them to support more
transparent and controllable retrieval pipelines.
1 Introduction
Recent advances in retrieval-augmented models
have significantly improved large language models’
(LLMs) ability to access and reason over exter-
nal knowledge (Lie´vin et al., 2024). A key con-
tributor to this progress is the use of multi-vector
and multi-head embedding strategies, which en-
able richer and more interpretable document rep-
resentations (Khattab and Zaharia, 2020). These
methods have shown strong performance in com-
plex retrieval tasks involving multi-faceted queries.
However, critical questions remain about how these
embeddings operate internally. In this work, we
focus specifically on multi-head representations.
While multi-vector approaches also produce mul-
tiple embeddings per input, some do so without
relying on attention heads. Our analysis is focused
on head-based methods, where each representa-
tion corresponds to a specific attention head. Do
multi-vector models capture more aspects simply
because they use larger embedding spaces, or do
individual heads learn distinct, complementary fea-
tures? How much do different heads overlap in
what they represent? How far apart are their outputs
in the embedding space? Do all heads contribute
meaningfully, or could some be pruned without
impacting performance? These questions point to
a need for more transparent, fine-grained analysis
of multi-head embedding behavior. Understanding
where to extract embeddings within a model is an
important consideration for analyzing their behav-
ior. Previous works ((Zheng et al., 2024), (Besta
et al., 2024)) has suggested using representations
from the final attention layer, under the assumption
that this stage captures the most meaningful struc-
ture. However, in the current models’ architecture,
substantial transformation occurs after the final at-
tention block, which may influence the usefulness
or interpretability of these embeddings. Without
this consideration, downstream evaluations may
obscure or misrepresent the functional roles of in-
dividual heads.
In this paper, we investigate the relationship
between multi-head attention embeddings and
document-level topic structures. To bridge the inter-
pretability gap, we propose a visualization method
that maps document-topic alignment to individual
head activations, uncovering latent structure within
the embedding space. Our work is supported by an
automated data pipeline, including topic label an-
notation of web documents using LLMs. Through
a series of visualization experiments and similarity-
based evaluations, we examine how alignment with
topics varies across heads, how responsive indi-
vidual heads are to different topics, and how head
activity levels influence the resulting document em-
beddings and their representations. In addition, we
1264
conduct a detailed examination of the model’s in-
ternal structure to identify the most informative
stage for embedding extraction, so that the repre-
sentations we analyze reflect meaningful semantic
organization. This process is guided and validated
through visualization, allowing us to isolate and
study embedding behavior with greater precision.
Furthermore, we run comparisons revealing pos-
sible mismatches and inconsistencies between ex-
ternal label assignments and internal embedding
structures.
2 Related Work
Understanding what pre-trained transformers at-
tend to has become a central topic in NLP inter-
pretability research. One influential study investi-
gates the attention mechanisms in BERT, reveal-
ing that many attention heads focus on syntactic
roles such as determiners, prepositional objects,
and coreference links. These findings suggest that
BERT captures rich syntactic structures internally
through its attention layers (Clark et al., 2019).
Building on this idea, other research has exam-
ined the contribution of individual attention heads
in multi-head self-attention architectures. It was
found that a small subset of highly specialized
heads carries most of the model’s performance bur-
den. Using a novel pruning approach, the authors
showed that a significant portion of heads can be
removed with negligible performance drop, high-
lighting redundancy in attention layers and point-
ing to opportunities for model compression (Voita
et al., 2019). Head pruning can be useful in model
quantization, where reducing computational cost
without performance loss is a central goal. Trans-
former quantization research by Bondarenko et al.
(Bondarenko et al., 2023) has shown that strong ac-
tivation outliers often originate from specific atten-
tion heads that attempt to perform ”no-op” updates
by pushing attention scores to extremes. These
outliers hinder low-bit quantization. To address
this, clipped softmax and gated attention mecha-
nisms are introduced to suppress such behaviors
during training. This reduces outlier magnitude
and improves quantization compatibility without
sacrificing model performance. Beyond language
and efficiency, similar head-wise specialization has
been observed in other modalities, such as music.
In generative music modeling, attention head prob-
ing has revealed that individual heads can indepen-
dently capture distinct musical properties, such as
instrument identity or rhythm. This head-wise spe-
cialization supports more interpretable and control-
lable generation, suggesting parallels to the modu-
lar roles observed in language models (Koo et al.,
2024).
While these studies highlight the role of atten-
tion in interpretability, efficiency, and control, re-
cent work also explores how attention can support
retrieval-based generation. Retrieval-Augmented
Generation (RAG) (Jurafsky and Martin, 2023)
combines traditional retrieval techniques with neu-
ral generation by first retrieving relevant documents
and then conditioning a language model on both the
query and the retrieved content. This hybrid frame-
work allows models to access external knowledge
beyond their training data, addressing limitations
in parametric memory and improving factual accu-
racy in open-domain tasks. Building on both ap-
proaches, Multi-Head RAG (MRAG) (Besta et al.,
2024) extends RAG to handle complex queries that
require synthesizing information from semantically
diverse sources. Unlike standard RAG, which relies
on a single embedding vector for retrieval, MRAG
constructs a multi-aspect embedding by leveraging
the activations from the Transformer’s multi-head
attention layer, capturing diverse semantic facets
of the input. This design utilizes different heads
to specialize in distinct semantic aspects, enhanc-
ing recall for multi-faceted queries. MRAG has
been shown to achieve up to 20% improvements
in relevance over baseline methods and integrates
seamlessly with existing RAG pipelines and eval-
uation frameworks such as RAGAS (Tendle et al.,
2023). However, important challenges remain. In
particular, the interpretability of individual atten-
tion head contributions is not well understood, and
the mechanisms by which different heads special-
ize in distinct semantic dimensions are still unclear.
3 Methodology
To begin our investigation, we critically exam-
ined a core assumption underlying MRAG: that
the multi-vector’s embedding models’ multiple at-
tention heads capture semantically distinct aspects
of document content. While this claim underpins
the model’s design, it remains unclear whether the
observed performance gains stem from meaning-
ful specialization across heads or simply from the
larger embedding space and computational capac-
ity.
1265
3.1 Data
Effective evaluation of embedding models requires
large-scale, high-quality datasets. The FineWeb
corpus (Penedo et al., 2024) offers precisely this: a
rich, web-scale dataset that supports both retrieval
benchmarking and visualization tasks aimed at un-
covering semantic structure in large embedding
spaces.
To analyze how individual attention heads re-
spond to different types of content, documents need
to be labeled with meaningful categories. To con-
tinue this line of investigation using the FineWeb
dataset, we applied topic labeling using GPT-4o
(OpenAI, 2024). In the FineWeb paper, Appendix
F.3 (”Topic Distribution”) presents a list of topics
and their corresponding distributions. While these
topics were originally intended for classification
tasks, several of them exhibited semantic overlap
(e.g., Math, Formulae, Education and Math, Educa-
tion, Teaching), while others were overly specific
(e.g., Sports, Football, Soccer).
To address this, we merge the original 39 topics
and reorganize them into 25 broader categories,
aiming to minimize redundancy while ensuring
comprehensive coverage across all major themes.
Using SFR embeddings (Meng et al., 2024), we
generate a heat map of topic similarities with co-
sine similarity (Salton et al., 1983) as a distance
metric. This visualization highlights the seman-
tic relationships between the original topics and
guides the merging process. As shown in Figure 1,
the selected topics exhibit sufficient differentiation
to support meaningful classification and analysis.
Figure 1: Heatmap of merged topics and their similari-
ties using the SFR embeddings and cosine similarity.
3.2 Model
To continue our investigation, we need to choose
a multi-vector embedding-based model. The Se-
mantic Feature Representation (SFR) model (Meng
et al., 2024), particularly in its Mistral variant, is
designed for dense retrieval tasks. What makes the
Mistral-based structure especially relevant for our
analysis is its multi-head projection design. This
architecture allows multiple ways to extract atten-
tion head representations. One approach, stage 1,
involves using the embedding vectors directly af-
ter the attention layer. This is the stage used in
some previous work (Besta et al., 2024); however,
since residual connections and normalization layers
follow the attention mechanism, important transfor-
mations may still occur afterward. To account for
this, we define two additional stages for embedding
extraction that capture these later processing steps.
The model involves 32 layers, where each layer
involves layer normalizations, grouped-query at-
tention (Ainslie et al., 2023), residual connec-
tions, as well as up- and down-projection. In
the grouped-query attention, the query tensor
has the shape [batch size, seq len, 32,
head dim], while both the key and value ten-
sors are shaped [batch size, seq len, 8,
head dim]. To enable attention computation,
the key tensor is repeated four times along the
head dimension, effectively transforming its shape
from [batch, 8, seq len, head dim] to
[batch, 32, seq len, head dim]. This
means that attention heads within a group share
the same key and value weight while each have
different query weights. This architectural detail
is crucial for understanding how information is
distributed and reused across heads, and provides
a concrete foundation for interpreting the embed-
ding behavior in later stages of our analysis. As
part of the attention, the heads are concatenated
and transformed through the first projection layer
(o proj), whose output serves as the second point
from which we extract embeddings, stage 2.
After the attention, the model includes a residual
connection (summing the original layer input with
the attention output), layer normalization, up- and
down-projecting (projecting from a hidden size of
4096 to an intermediate size of 14336 and back),
as well as a second residual connection (summing
the output of the previous residual with the current
projected output). This produces the final model
outputs (stage 3), which we slice into 32 parts rep-
resenting the attention heads. We note that the
connection between the actual attention heads and
these 32 slices is not necessarily preserved, due to
1266
the two intervening projection layers. To aid in un-
derstanding the different stages, Figure 2 provides
a schematic overview of the model architecture and
indicates where embeddings are extracted at each
stage.
Figure 2: An overview of the model architecture and
stages
3.3 Analysis
To begin our analysis, we first visualize the atten-
tion heads at stage 1. We use 10,000 documents
from the FineWeb Subset: sample-10BT, and SFR
embeddings. Each document is processed to pro-
duce 32 head-specific embeddings, which we then
project into a two-dimensional space using the t-
SNE method (van der Maaten and Hinton, 2008).
In the resulting plot, each small dot represents one
head-specific embedding for a document, yielding
32 dots per document. The dots are color-coded
according to their corresponding head index, allow-
ing us to observe clustering patterns and potential
distinctions between the roles of different heads.
To support exploration of the repeated key tensor,
we assigned eight main colors to the 32 heads and
varied the shades within each color to correspond
to the fourfold repetition. This setup allows us
to visually examine whether any patterns related
to the repetition are noticeable in the embedding
space. As shown in Figure 3a, the 32 attention
heads do not form clearly separable clusters. This
suggests that several heads may be overlapping or
producing similar representations, as indicated by
different-colored dots (e.g., purple) appearing on
top of areas dominated by another color (e.g., pink).
To explore this further, we zoomed in on heads 1–4
(Figure 3b) and observed that some heads appear
to be covered by others and are located in close
proximity, with only subtle differences in shade
within the same main color (red). This reinforces
the idea that head-level representations are not yet
fully distinguishable at this stage of the model. Al-
though the broader groups appear to cluster well,
several individual heads within a group often over-
lap or lie very close together, suggesting limited
differentiation among heads in the same group.
At stage 2 (Figure 4a), we extract the embed-
dings after the projection layer that follows the
multi-head attention layers, where the output is
sliced to get the 32 heads. Each head represented
by a distinct color consistent with the color scheme
of stage 1, are now visible as clearly separated
groups. Zooming into heads 1 to 4 (Figure 4b) re-
veals that these heads are no longer overlapping;
instead, there is noticeable space between them.
Overall, Figure 4 suggests a transition from stage 1,
where some previously observed patterns begin to
fade while new ones emerge. The attention heads
appear to be becoming more independent in their
behavior.
To further investigate the 32 attention heads, we
take a different approach by directly slicing the
model output to extract the individual head repre-
sentations (stage 3). In Figure 5a, the heads are
color-coded, and the shading indicates the strength
of each head’s activation: lighter (paler) dots re-
flect weaker activations, while darker dots indi-
cate stronger ones. Notably, the heads are well-
separated and occupy distinct regions in the space,
suggesting that each head captures unique aspects
of the document representations.
For better visualization, heads 1 to 4 were se-
lected and plotted individually. As shown in Figure
5b, these heads are not scattered randomly; instead,
they tend to cluster within distinct regions, indicat-
ing consistent behavior.
3.4 Linear Separability
To quantitatively validate the visual structure ob-
served in the t-SNE projection, we performed a
multi-class classification analysis using logistic re-
gression. Specifically, we aimed to predict the head
index from the vector embeddings to assess the ex-
tent to which the heads are linearly separable in
this representation space. Head numbers served as
class labels, and the associated embedding vectors
were used as features. The data was randomly shuf-
fled prior to training. The classifier was trained on
25,600 samples and evaluated on 6,400 test sam-
ples. The results, presented in Table 1, confirm that
the head-specific embeddings are linearly distin-
guishable.
3.5 Topic Correlation
To evaluate the alignment between the assigned
topics and the SFR model’s representations, we de-
1267
(a) Projection of 32 heads. Not all of the 32 heads are
visible; instead, they are grouped into clusters.
(b) Zoomed into heads 1-4
Figure 3: t-SNE projections with 32 attention heads at stage 1.
(a) Projection of 32 heads. Grouped clusters have turned
into clearly separated individual head clusters.
(b) Zoomed into heads 1-4
Figure 4: t-SNE projections with 32 attention heads at stage 2.
(a) Vector Embedding visualization using t-SNE (b) Single heads visualization using t-SNE
Figure 5: t-SNE projections with 32 attention heads at stage 3.
signed a correlation test. The document set, 4,000
articles from the Fineweb’s subset sample-10BT, is
first classified into our 25 predefined topics using
GPT-4o. The same set of documents, along with
1268
(a) Stage 1
(b) Stage 2
(c) Stage 3
Figure 6: Heatmap showing attention head responses by topic. Each topic contains 120 documents. The vertical
axis begins with full embeddings at the top, followed by attention heads 1 to 32. The horizontal axis represents
document labels A to H, corresponding to the following topics: A. Entertainment, Film, Theater, Music, Arts B.
Business, Finance, Law C. Sports, Teams, Games, News D. Gaming, Technology, Games, Gadgets, Innovation
E. Personal, Family, Leisure F. Health, Nutrition, Diet, Medicine, Diseases, Biology G. Politics, Conflict,
International Affairs H. Places, Travel, Real Estate
the 25 predefined topics, was processed through the
SFR model to generate vector embeddings. Cosine
similarity was applied to evaluate the accuracy of
the topic assignments generated by GPT-4o. The
analysis showed that the correct topic appeared
as the top-1 match for 30.57% of the documents.
When considering the top-5 most similar topics, the
match rate increased to 64.82%, and further rose
1269
Stage Accuracy
Stage 1 0.964
Stage 2 1.000
Stage 3 0.999
Table 1: Accuracy of logistic prob for each stage
to 79.06% when the top-10 matches were taken
into account. Although the top-1 accuracy may
appear modest, the results suggest that the SFR
model captures meaningful topic-related structures
in the embedding space. The subsequent heatmaps
further explore this relationship by visualizing how
different attention heads respond to topic-labeled
documents, providing a more detailed view of topic
sensitivity across the model’s internal representa-
tions.
In Figure 6, a heatmap illustrates the similarity
between 32 vector embeddings (heads) and 960
documents on 7 topics. The color bar on the right
represents the similarity scale, where red indicates
higher similarity and blue indicates lower similar-
ity (i.e., greater dissimilarity). The ”full embed”
row at the top represents the full document embed-
dings. Within each topic column, documents are
sorted based on their similarity between the full
document embedding and the corresponding topic
embedding. In Figure 6a, which corresponds to
stage one, generally sets of four consecutive at-
tention heads (i.e., one group) behave similarly to
each other. However, no clear signal is observ-
able across documents, topics, or individual heads.
In Figure 6b (stage two), this pattern across sets
of four heads disappears, and the heads begin to
behave more independently. Still, no strong align-
ment with document or topic structure is evident.
One notable exception is head 17, which follows
the sorting pattern of the full embedding row, sug-
gesting the emergence of some meaningful struc-
ture. Moving to Figure 6c, attention heads exhibit
different attitudes: for instance, head 10 is selec-
tive, responding only to the topic ’politics, conflict,
international’, whereas head 12 responds to a gen-
eral feature shared across all documents. Head
18 appears to actively avoid one of these shared
characteristics. Meanwhile, head 13 seems sparse,
reacting independently to individual documents,
while head 1 is uniformly smooth, treating all doc-
uments similarly. When sorting documents based
on their full embedding similarity, head 25 aligns
well with this ordering, whereas head 11 does not.
4 Discussion and Conclusion
Our investigation set out to determine whether the
advantages of multi–head document embeddings
stem from genuine semantic specialization. Given
that substantial transformation occurs after the final
attention layer, Stage 1 may be quite early to har-
vest the embeddings. The results provide converg-
ing evidence that head-level specialization does
exist, but also highlight the importance of the layer
from which embeddings are extracted, indicating
that meaningful structure may only emerge at cer-
tain depths of the model. Somewhat unexpectedly,
the strongest topic-wise signal appeared when we
directly sliced the final embedding into 32 parts.
The t-SNE projections in Figures 5a and 5b show
that the 32 heads carve the space into largely dis-
joint regions: each head gives rise to a distinct clus-
tering pattern, suggesting that the embeddings they
produce capture different structural aspects of the
data. The heat-map in Figure 6c further qualifies
this observation.
These results suggest that attention heads do
not act uniformly or redundantly. Instead, they
display specialized, sometimes contrasting behav-
iors—some being topic-specific, others capturing
general or even orthogonal features. This diversity
supports the idea that attention heads operate as
distinct functional units rather than simply forming
a unified embedding vector.
This insight reinforces the value of multi-head
architectures for semantic modeling and highlights
the potential for more targeted embedding extrac-
tion strategies in retrieval-augmented systems.
5 Future Work
Understanding the internal behavior of attention
heads reveals their potential to capture diverse se-
mantic dimensions within complex data. While
current models implicitly learn to attend to dif-
ferent aspects such as topic or style, this process
remains opaque and largely uncontrolled. By mak-
ing these latent distinctions more interpretable and
steerable, we can move toward models that are not
only more accurate but also more adaptable, trans-
parent, and capable of being controlled cheaply
at inference time. This approach is particularly
valuable for complex datasets that contain diverse
features such as language, topic, genre, or regis-
ter, since models trained for specific tasks often
overlook these aspects or fail to leverage them ef-
fectively. In future work, we aim to address this
1270
by developing benchmarks for multi-aspect em-
bedding models (e.g., SFR, stella (Zhang et al.,
2025)) and datasets, enabling us to selectively con-
trol model attention—effectively ”switching on or
off” focus on particular aspects.
Acknowledgments
This research was conducted as part of the EU Hori-
zon project SEUS – Smart European Shipbuilding
(Grant Agreement No. 101096224), funded by
the European Union. Additional support was pro-
vided by the Human Diversity Consortium under
the Profi7 program of the Research Council of Fin-
land. Computational resources were provided by
CSC – IT Center for Science.
References
Joshua Ainslie, James Lee-Thorp, Michiel De Jong,
Yury Zemlyanskiy, Federico Lebro´n, and Sumit Sang-
hai. 2023. Gqa: Training generalized multi-query
transformer models from multi-head checkpoints.
arXiv preprint arXiv:2305.13245.
Maciej Besta, Ales Kubicek, Roman Niggli, Robert
Gerstenberger, Lucas Weitzendorf, Mingyuan Chi,
Patrick Iff, Joanna Gajda, Piotr Nyczyk, Ju¨rgen
Mu¨ller, et al. 2024. Multi-head rag: Solving
multi-aspect problems with llms. arXiv preprint
arXiv:2406.05085.
Yelysei Bondarenko, Markus Nagel, and Tijmen
Blankevoort. 2023. Quantizable transformers: Re-
moving outliers by helping attention heads do noth-
ing. Advances in Neural Information Processing
Systems, 36:75067–75096.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D Manning. 2019. What does bert look
at? an analysis of bert’s attention. arXiv preprint
arXiv:1906.04341.
Dan Jurafsky and James H. H. Martin. 2023. Chap-
ter 14: Question answering and summarization.
https://web.stanford.edu/˜jurafsky/
slp3/14.pdf. Draft chapter from *Speech and
Language Processing*, 3rd ed.
Omar Khattab and Matei Zaharia. 2020. Colbert: Effi-
cient and effective passage search via contextualized
late interaction over bert. In Proceedings of the 43rd
International ACM SIGIR conference on research
and development in Information Retrieval, pages 39–
48.
Junghyun Koo, Gordon Wichern, Franc¸ois G Germain,
Sameer Khurana, and Jonathan Le Roux. 2024. Un-
derstanding and controlling generative music trans-
formers by probing individual attention heads. In
IEEE ICASSP Satellite Workshop on Explainable
Machine Learning for Speech and Audio (XAISA).
Valentin Lie´vin, Christoffer Egeberg Hother, An-
dreas Geert Motzfeldt, and Ole Winther. 2024. Can
large language models reason about medical ques-
tions? Patterns, 5(3).
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne. Journal of Machine
Learning Research, 9(Nov):2579–2605.
Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming
Xiong, Yingbo Zhou, and Semih Yavuz. 2024. Sfr-
embedding-mistral: Enhance text retrieval with trans-
fer learning. https://www.salesforce.com/
blog/sfr-embedding/. Accessed: 2025-04-11.
OpenAI. 2024. Gpt-4o system card. https://arxiv.
org/abs/2410.21276. Accessed: 2025-04-11.
Guilherme Penedo, Hynek Kydlı´cˇek, Anton Lozhkov,
Margaret Mitchell, Colin A Raffel, Leandro
Von Werra, Thomas Wolf, et al. 2024. The fineweb
datasets: Decanting the web for the finest text data
at scale. Advances in Neural Information Processing
Systems, 37:30811–30849.
Gerard Salton, Edward A Fox, and Harry Wu. 1983.
Extended boolean information retrieval. Communi-
cations of the ACM, 26(11):1022–1036.
Atharva Tendle, Nikhil Kandpal, Marzieh Saeidi, Sumit
Bhatia, and Ankur P. Parikh. 2023. Ragas: An evalu-
ation framework for retrieval-augmented generation.
https://arxiv.org/abs/2309.15217. ArXiv
preprint arXiv:2309.14850.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-
head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418.
Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong
Wang. 2025. Jasper and stella: distillation of sota
embedding models.
Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao
Song, Mingchuan Yang, Bo Tang, Feiyu Xiong, and
Zhiyu Li. 2024. Attention heads of large language
models: A survey. arXiv preprint arXiv:2409.03752.