Notes on: Attention Residuals by Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su et al. (2026)

tags: Transformers, LLM, Scaling laws, Attention, Residual neural networks
source: (Chen et al. 2026)

Summary

Standard residual connections in modern LLMs accumulate all layer outputs with fixed unit weights via PreNorm, causing uncontrolled hidden-state growth with depth and progressively diluting each layer’s contribution. This paper proposes Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs. Each layer selectively aggregates earlier representations using learned, input-dependent weights computed from a single pseudo-query vector per layer. The authors draw a formal duality between depth-wise accumulation in residual networks and sequential recurrence in RNNs, arguing that just as Transformers replaced recurrence with attention over the sequence dimension, AttnRes does the same over the depth dimension.

To make this practical at scale, the paper introduces Block AttnRes, which partitions layers into blocks and applies attention over block-level representations, reducing memory and communication overhead from \(O(Ld)\) to \(O(Nd)\). Combined with cross-stage caching for pipeline parallelism and a two-phase inference computation strategy, Block AttnRes becomes a drop-in replacement with less than 2% inference latency overhead and under 4% training overhead.

Scaling law experiments across five model sizes confirm consistent improvements, with Block AttnRes matching the loss of a baseline trained with 1.25x more compute. The method is integrated into the Kimi Linear architecture (48B total / 3B activated MoE parameters) and pre-trained on 1.4T tokens, yielding improvements across all downstream benchmarks including MMLU (+1.1), GPQA-Diamond (+7.5), HumanEval (+3.1), and Minerva Math (+3.6).

Key Ideas

Depth-wise attention: Replace fixed unit-weight residual accumulation with softmax attention over all preceding layer outputs, allowing content-dependent, selective aggregation across depth.
Pseudo-query mechanism: Each layer has a single learned pseudo-query vector \(\mathbf{w}_l \in \mathbb{R}^d\) that computes attention weights over RMSNorm’d keys from previous layer outputs — decoupled from the forward computation, enabling parallel computation.
Block AttnRes: Partition \(L\) layers into \(N\) blocks; within each block, layers accumulate via standard residual; across blocks, apply full attention over \(N\) block-level representations. \(N \approx 8\) recovers most of the gain.
Unified structured-matrix framework: Standard residuals, Highway networks, mHC, DenseFormer, and AttnRes are all characterized as different depth mixing matrices \(\mathbf{M} \in \mathbb{R}^{L \times L}\) with varying semiseparable rank and input dependence.
Cross-stage caching: For pipeline parallelism, cache block representations locally across virtual stages to eliminate redundant communication, reducing peak per-transition cost from \(O( C)\) to \(O(P)\).
Two-phase inference: Phase 1 batches inter-block attention for all layers in a block simultaneously; Phase 2 handles sequential intra-block attention with online softmax merge, amortizing memory access.
PreNorm dilution mitigation: AttnRes confines hidden-state growth within blocks, yielding bounded periodic output magnitudes and more uniform gradient distribution across depth.

Comments

This paper makes a compelling argument by drawing the parallel between how attention replaced recurrence over the sequence dimension and extending the same idea to the depth dimension. The insight that standard residual connections are equivalent to depth-wise linear attention (all-ones mixing matrix) is elegant and provides a clean theoretical motivation.

The practical engineering is notable: Block AttnRes achieves most of the gains of full AttnRes while being genuinely deployable at scale with minimal overhead. The \(N \approx 8\) finding is convenient — it means only 8 stored block representations regardless of total depth.

The structured-matrix perspective (Section 6.2) unifying residual variants is particularly valuable, connecting DenseFormer, mHC, Highway networks, and AttnRes through the lens of semiseparable rank and input dependence. This provides a systematic vocabulary for comparing future residual connection designs.

One limitation acknowledged by the authors: the architecture sweep (Figure 7) shows AttnRes favors deeper, narrower models, but this does not directly translate to deployment recommendations due to the sequential computation cost of deeper models. The scaling law experiments also use hyperparameters optimized for the baseline, which the authors note makes the comparison conservative.

Connections

Directly extends Residual neural networks by replacing fixed unit-weight accumulation with learned, input-dependent depth-wise attention
Applies the same transition from linear to softmax attention that Transformers made over the sequence dimension, now over the depth dimension
Draws formal duality with Recurrent neural networks: standard residuals are to depth what RNNs are to sequence, and AttnRes replaces both with attention
Scaling laws experiments validate consistent improvement across model sizes, with Block AttnRes equivalent to 1.25x compute advantage
Addresses LayerNorm (PreNorm) dilution problem: the unweighted accumulation in PreNorm residuals causes hidden-state magnitudes to grow as \(O(L)\), which AttnRes mitigates
Uses RMSNorm on keys to prevent magnitude bias in depth-wise attention, connecting to Batch normalization principles
Related to linear attention work: the structured-matrix analysis shows standard residuals perform depth-wise linear attention, while AttnRes upgrades this to softmax attention
Architecture uses Mixture-of-Experts (related to Switch Transformer and GLaM) with 48B total / 3B activated parameters

Bibliography

Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, et al.. March 16, 2026. "Attention Residuals". https://arxiv.org/abs/2603.15031.

Attention Residuals by Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su et al. (2026)

Summary

Key Ideas

Comments

Connections

Bibliography

Comments

Leave a comment