LoRA Learns Less and Forgets Less by Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham (2024)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
LLM, Continual learning, Catastrophic forgetting, Transfer learning, Language modeling
source
(Biderman et al. 2024)

Summary

This paper provides a rigorous head-to-head comparison of Low-Rank Adaptation (LoRA) against full finetuning of Llama-2-7B on two challenging target domains (code and math) under two training regimes: continued pretraining (CPT, ~20B unlabeled tokens) and instruction finetuning (IFT, ~100K prompt–response pairs). The central question is: under which conditions does LoRA approximate full finetuning accuracy, and to what extent does it mitigate catastrophic forgetting of base model capabilities?

The headline finding is captured by the title: in standard low-rank settings, LoRA substantially underperforms full finetuning on target-domain accuracy (HumanEval for code, GSM8K for math), but it forgets less of the source domain (HellaSwag, ARC-Challenge, WinoGrande). High LoRA ranks (r=256) can close most of the gap in IFT but not in CPT, especially for code. The authors also show, via SVD of the weight perturbations, that full finetuning produces high-rank updates (10–100× the rank typical LoRA configurations use), offering a mechanistic explanation for the accuracy gap. Finally, they characterise a learning–forgetting Pareto frontier and offer practical recipes (α=2r, target all transformer modules, sweep learning rates 1e-5 to 5e-4).

The work is notable for being one of the few studies that benchmarks LoRA on modern billion-parameter LLMs in hard target domains where the base model distribution is far from the target — settings where prior LoRA-vs-full comparisons (e.g. on GLUE/RoBERTa) had concluded parity.

Key Ideas

  • Low-rank adaptation: freeze pretrained weight \(W\) and learn \(\Delta = \gamma_r AB\) with \(A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}\), \(\gamma_r = \alpha/r\). Trains \(\sim 1\%\) of parameters for r=16 on a 7B model.
  • LoRA underperforms in CPT: across r ∈ {16, 64, 256} on StarCoder-Python and OpenWebMath, LoRA never closes the gap to full finetuning on HumanEval/GSM8K.
  • LoRA can match full FT in IFT — at high rank: r=256 on Magicoder-Evol can match full finetuning HumanEval (0.498 vs 0.497), but typical low ranks (r=8–64) lag.
  • LoRA forgets less: on HellaSwag+ARC+WinoGrande averaged forgetting metric, LoRA preserves base-model performance markedly better than full finetuning, with rank acting as a knob trading learning for forgetting.
  • LoRA beats classical regularisation: dropout and weight decay learn-and-forget similarly to full finetuning; only LoRA achieves the favourable learning–forgetting tradeoff at moderate ranks.
  • Output diversity: full finetuning collapses output diversity (akin to RLHF “distribution collapse”); LoRA preserves it, sitting between base and full-finetuned models.
  • Full FT learns high-rank perturbations: SVD analysis shows the rank of \(\Delta\) for full finetuning is 10–100× higher than the LoRA rank typically used, and grows with training tokens. MLP modules need higher rank than attention.
  • Best practices: use LoRA preferentially for IFT (not CPT); target all transformer modules (attention + MLP), with MLP being the dominant locus; set α=2r; use learning rates an order of magnitude above full FT (5e-5 to 5e-4); rank 256 if memory permits.

Comments

This paper is a useful corrective to optimistic LoRA folklore inherited from the original LLM adaptation literature, where the targets (GLUE, MNLI on RoBERTa-340M) were trivially close to the pretraining distribution. Once the target domain genuinely demands new capability — Python code, formal math — the low-rank inductive bias of LoRA acts as a real constraint. The framing of LoRA as a knob on the learning–forgetting Pareto frontier (rather than as “free” parameter efficiency) is the most useful contribution: it makes the tradeoff explicit and gives a principled reason to choose LoRA when source-domain preservation matters (e.g. continual instruction following).

The SVD analysis (Section 4.6) is the most mechanistically interesting piece. That full finetuning produces \(\Delta\) with rank \(\sim d/2\), while typical LoRA uses \(r \le 64 \ll d=4096\), gives a clean explanation for the underperformance: LoRA can’t physically express the perturbation that full finetuning finds. The fact that MLPs need higher rank than attention modules is a structural hint about where continual learning happens in transformers — an angle worth following in the continual learning literature.

A limitation worth flagging: all results are on Llama-2-7B with one round of finetuning. Modern post-training pipelines stack many stages (SFT → DPO → RLHF), and the cumulative forgetting picture under repeated LoRA vs full updates is left open. Also, the “forgetting” metric is a fixed average of three classical NLU benchmarks; richer evaluations (instruction following, factuality, safety) might shift the tradeoff.

Connections

  • Related to Catastrophic forgetting — the paper is essentially a careful empirical study of forgetting under two finetuning paradigms, and shows LoRA acts as an implicit regulariser against it.
  • Related to Continual learning — instruction finetuning and continued pretraining are both forms of continual learning on a pretrained base, and the learning–forgetting Pareto frontier is a continual-learning concept.
  • Related to Transfer learning — LoRA is a low-parameter transfer mechanism; the paper interrogates its expressive limits on hard transfer.
  • Related to LLM — direct study of how LLMs are adapted post-pretraining.
  • Related to Language modeling — the CPT regime continues language modeling on a domain-specific corpus.

Bibliography

  1. . . "LoRA Learns Less and Forgets Less". https://arxiv.org/abs/2405.09673.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes