The Remaining Mysteries of Attention Sinks

<aside> 💡

TL;DR: The “remaining problem” in attention sink is how LLMs know which token is physically the first token in a sequence to be made into the attention sink. We re-established past findings that it is NOT because of either RoPE or semantic embeddings. While we find that it is causally correlated with the fact that only the first token attends to itself, we do not find any significant change in the first token activation as a result of this. In particular, we do not observe any “outlier dimension” that is surfaced that will causally lead to the creation of either attention sinks or their preceding massive activation phenomenon.

</aside>

Note: all preliminary studies are done on Qwen3-8B.

Attention sinks have been one of the most interesting accidental phenomena in all of LLM arch research. Numerous studies have revealed what architecture constraint forces them to develop sink (answer: softmax in attention), why it is beneficial for LLMs to bypass these constraints (answer: to prevent over-mixing between tokens in attention), and even what mechanism in model layers create the attention sink (ie. using massive activations in the residual stream).

The big question: How do LLMs attend to the first token?

However, so far all of the studies have eluded one question. In order for the model to put massive dim-specific activations on the first token, they have to first identify which token is the first token, ie. the token that physically exists in memory before all other tokens.

This ability is more mysterious and impressive than it sounds, because the model can identify the physically first token without the help of RoPE. Even if we change RoPE of each token, so that physically the first token does not have the first encoded position, we see that attention sinks still occur at physically the first token.

Changing the RoPE of the first token to a non-first encoding has no influence.

Changing the RoPE of a non-first token to a first encoding has no infuence.

Hypothesis 1: The model identifies the first token by semantic embedding

A very convenient hypothesis is that the model is able to know which token is physically the first by statistics. It simply identifies the token that is most likely to occur in the first position by their semantic content, and attend to it. In actual examples, these special tokens by statistics actually occur at the first position, causing the attention sink to occur at the first token.

To test this hypothesis, for 32 naturally sequences of 512 tokens, we iterate through the whole vocabulary of Qwen3 tokenizer and put each token on the first token, and compute the average attention score of that token across all layers for all samples. We find that all tokens have similar attention scores tightly clustered around a mean value of 0.56. This shows that attention sink does NOT occur primarily because of special semantic embedding in some tokens.

Almost all tokens show similar mean attention score from other tokens when placed on the first token position.

We then rank all tokens’ average attention score descending. We find that there are a few special tokens that have scores significantly above the mean, and some with scores significantly below the mean. This shows that attention sinks are partially tied to semantic content of tokens, and can be strengthened or weakened by it in special cases.

Ranking tokens by their mean attention score. A few tokens have outlier attention scores, significantly above or below the mean.

Hypothesis 2: The model identifies the first token by special activation pattern in self-attention tokens (validated).

The second hypothesis is based on the observation that the first token is the only token that does not have any real attention computation. Since it is only able to attend to itself, the attention layer is essentially reduced to a MLP for the first token. The hypothesis is that only attending to itself causes the first token’s activation to have a special pattern, which the model can identify as the signature of the first token.

To test this hypothesis, for non-first tokens (token 1/4/8, 0-indexed), we mask out the attention to all previous tokens in the first two layers, and compute the average attention scores of the subsequent tokens to them vs to the first token. We find that subsequent tokens allocate comparable attention to these tokens and to the first token, effectively making them attention sinks as well. This validates the idea that self-attention exposes special activation patterns that the model can use to register attention sinks.

Screenshot 2026-04-04 at 11.45.45 AM.png

Screenshot 2026-04-04 at 11.46.04 AM.png

Masking only attention in the first attention layer drops attention to first token.

Screenshot 2026-04-04 at 11.46.21 AM.png

Screenshot 2026-04-04 at 11.46.29 AM.png

Masking attention in the first two attention layers drops attention to first token even more.

Mechanism: how do special activation patterns create attention sinks?

As we have established that special activation patterns as early as the first two attention layers (and thus the first two blocks) establish the identity of the first token, we want to understand (a) what this special pattern consists of, and (b) how does it propagate to later layers that eventually surface the attention sink.

The second question (how special patterns create the sink) is relatively straightforward. We find that the activation of the first token will trigger an usually large spike in only two dimensions (out of 4096) in the output of mlp.6. Since the output will merge with the rest of the residual stream, they are so large that they will dominate the entire residual stream of the first token. In any subsequent layer, after RMSNorm, this guarantees that the intermediate activation will essentially point to only those two directions.