Residual Matrix Transformers: Scaling the Size of the Residual Stream by Brian Mak, Jeffrey Flanigan (2025)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
Transformers, LLM, Scaling laws, Attention, Residual neural networks, Memory in neural networks
source
(Mak, Flanigan 2025)

Summary

Standard Transformers use a residual stream of dimension \(D\) as a “memory bus” where every layer reads and writes features (Elhage et al., 2021). Resizing this bus also resizes every weight matrix, so the bandwidth of the residual stream is structurally tied to parameter count and per-token FLOPs. The Residual Matrix Transformer (RMT) breaks this coupling by replacing the residual vector at each token position with an outer-product memory matrix \(M \in \mathbb{R}^{D_k \times D_v}\) (Kohonen, 1972; Anderson, 1972). Each attention head becomes a (key, value) pair stored into the matrix via \(M = \text{Norm}(\sum_p q^{(p)} \otimes x^{(p)})\), and is read out by tensor contraction \(x^{( r)} \approx q^{( r)} \cdot_1 M\). The key vectors \(r_Q, r_K, r_V, r_O \in \mathbb{R}^{D_k}\) take the place of the standard \(W_Q, W_K, W_V, W_O\) projection matrices, so scaling the residual stream along the \(D_k\) axis costs almost no extra parameters or FLOPs.

The authors prove (Table 1) that storage/retrieval through the outer-product memory has comparable or better mean/variance propagation than linear projections, satisfying Glorot-style initialization criteria in nearly every layer. Empirically, on OpenWebText and matched to GPT-2 dimensions, the RMT matches a 405M-parameter scaling-law-trained transformer with 25% fewer parameters, 58% fewer FLOPs, and 41% fewer training tokens (Table 3). Scaling the residual stream from \(D_k=384\) up to \(D_k=4096\) at fixed model size and fixed compute monotonically reduces dev loss (Figure 6) — a new scaling axis that does not exist in the standard transformer. On zero-shot downstream evaluations the RMT also outperforms a transformer that is 33% larger across LAMBADA, OpenWebText perplexity, ARC-C/E, HellaSwag, PIQA, and Winogrande.

The main practical caveat is wall-clock runtime: per-step time is currently ∼4% slower than the transformer because the implementation is not hardware-aware. Memory at training is roughly equal (the larger residual is offset by the smaller model), and inference memory is actually lower since only one residual matrix per sequence needs to be cached for autoregressive decoding.

Key Ideas

  • Outer-product memory residual stream: each token’s residual is a matrix \(M = \text{Norm}(\sum_p q^{(p)} \otimes x^{(p)})\) rather than a vector, storing features as outer products of learned key vectors and head outputs.
  • Key-vector adapters replace projection matrices: \(W_Q^{(h)}, W_K^{(h)}, W_V^{(h)}, W_O^{(h)} \in \mathbb{R}^{D_h \times D}\) are replaced by key vectors \(r_Q^{(h)}, r_K^{(h)}, r_V^{(h)}, w_O^{(h)} \in \mathbb{R}^{D_k}\). Tensor contraction \(\cdot_1\) retrieves $D_v × N$-shaped Q/K/V tensors from the residual matrix, so the SHA op is unchanged in shape and cost.
  • Feed-forward keeps its weights: unlike the attention block, the FF layer retains its \(W_1, W_2\) matrices because Geva et al. (2021) and Meng et al. (2022) show FF weights store factual associations — only the read/write interface is replaced by key vectors.
  • Independent scaling of residual stream: \(D_k\) can be increased without touching the rest of the model. A 100% increase in \(D_k\) adds <1% parameters and <1% FLOPs, versus 100%/94% for the standard transformer (Figure 1).
  • Improved variance propagation: closed-form analysis shows that, except for the attention storage op, RMT layer ratios \(\sigma^2_{\text{out}} / \sigma^2_{\text{in}}\) are closer to 1 under Xavier initialization, making deeper RMTs better behaved at init.
  • New scaling axis: at fixed model size, dataset size, and compute, expanding \(D_k\) monotonically lowers loss (Figure 6, \(D_k\) from 384 to 4096), yielding 23% FLOP and 25% token savings to reach the same final loss as a matched-residual GPT-2-small.
  • Beats prior residual-stream variants: outperforms Depthwise LSTM, Hierarchical Aggregation, Highway Transformer, and the standard transformer on per-FLOP loss (Figure 5) while using fewer parameters than all of them.
  • Connection to associative memory: the storage/retrieval operations are exactly those of classical correlation matrix memories (Kohonen 1972, Anderson 1972, Gmitro et al. 1989) — the residual stream is reframed as a per-token associative store.

Comments

The reframing of the residual stream as an associative outer-product memory is elegant and exposes a real architectural slack: the residual-stream “bandwidth” was always tied to projection-matrix width, and decoupling these turns out to give a free scaling axis. The Pareto improvement on per-FLOP and per-parameter loss is striking — 25% fewer parameters and 58% fewer FLOPs to reach the same loss as a GPT-2-medium-shaped transformer is a large delta for a “drop-in” architectural change.

The connection to Hopfield Networks / classical correlation matrix memories (Kohonen 1972, Anderson 1972) is more than an analogy: the storage rule \(M = \sum_p q^{(p)} \otimes x^{(p)}\) is the textbook outer-product associative memory, and retrieval \(x^{( r)} \approx q^{( r)} \cdot_1 M\) is the matched-key read-out. This places RMT in a lineage with memory-augmented neural networks, recasting “the residual stream is a memory bus” (Elhage et al.) as literally true rather than metaphorical.

The paper is also a nice complement to other recent residual-stream modifications. Where Attention Residuals (Chen et al. 2026) adds learned softmax attention across depth (replacing fixed unit-weight accumulation with learned weights), RMT changes the storage substrate within each layer position (vector → matrix). The two modifications operate on orthogonal axes and could in principle be combined.

The runtime gap (4% slower per step) is the headline limitation, and it is purely an engineering artifact: the tensor contractions are not yet hardware-aware. If a fused kernel closed this gap, the FLOP/parameter efficiency would translate directly into wall-clock training-time savings — making this a strong candidate for production-scale revisits. The LayerNorm-based normalization choice (vs. unnormalized outer-product memories) also raises questions about how it interacts with LayerNorm’s known role in stabilizing pre-norm transformers.

The experiments stop at 405M parameters and 6B tokens due to compute constraints, so it is genuinely unknown whether the gains persist at billion-token frontier scale or whether other residual-stream pathologies emerge. Still, the Chinchilla-optimal training-token regime is respected, which makes the comparison cleaner than many recent architecture papers.

Connections

  • Directly extends Transformers by replacing the residual vector with an outer-product residual matrix, decoupling residual-stream width from projection-matrix width.
  • Builds on Residual neural networks: keeps the additive identity-shortcut structure but changes what flows through the shortcut.
  • Reframes the residual stream as a memory-in-neural-network mechanism; storage/retrieval is the classical Kohonen/Anderson correlation matrix memory, closely related to Hopfield Networks and modern continuous Hopfield variants.
  • Defines a new axis along which scaling laws / the scaling hypothesis can be explored: residual-stream size \(D_k\) at fixed model size and per-example compute.
  • Trained at Chinchilla-optimal token budgets (\(20\times\) non-embedding parameters), making the per-FLOP comparison meaningful.
  • Complements Attention Residuals (Chen et al. 2026): AttnRes modifies how residual contributions are mixed across depth (depth-wise attention), RMT modifies what data structure each token’s residual is (matrix vs vector). Orthogonal modifications.
  • Cites and improves over Highway Transformer (Chai et al. 2020) and depthwise-LSTM residual managers (Xu et al. 2024) — all variants the Block-AttnRes paper also compares against.
  • Uses LayerNorm-style normalization on the outer-product memory; pre-norm insertion is structurally identical to the standard transformer block.
  • Inherits the attention block’s softmax SHA computation unchanged — the RMT change is in the interface to the residual stream, not in attention itself.
  • Connects to More Is Different (Anderson 1972) only via shared author Anderson on the original outer-product memory paper (Anderson, J.A. 1972 — a different Anderson than P.W. Anderson; bibliographic overlap, not conceptual).

Bibliography

  1. . . "Residual Matrix Transformers: Scaling the Size of the Residual Stream". https://arxiv.org/abs/2506.22696.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes