Models choose the token with the smallest relative, not absolute, position encoding in the global context as the sink token.

When token=10 is the globally smallest idx and has continuity with following tokens

When both token at 0 and 10 are globally the smallest, but token > 10 has continuity with token at 10

When token=14 is the globally smallest idx and has continuity with following tokens

When both token at 0 and 14 are globally the smallest, but token > 10 has continuity with token at 10
Attention sinks causally depend on information passage between target token and the sink token. We intervene on each layer’s attention computation and reduce the attention score of the sink token to 50% of the magnitude. This means half the amount of the updates in the direction of the value vector of the sink token in each layer. As a result, we observe no sink phenomenon, showing that updates in the direction of sink token at each layer is essential for the formation of the sink phenomenon. The attention layer needs to register the target token as having some correlation with the sink token to properly allocate attention to the sink.

@misc{HeMuyu2025eAttentionSink,
title = {Let That Sink In: What are the Causal Mechanisms of Attention Sinks \\in Large Language Models?},
url = {<https://smoothcriminal.notion.site/let-that-sink-in>},
author = {Muyu He and Yuchen Liu},
year = {2025},
month = {Oct},
}