- tags
- LLM, Continual learning, Catastrophic forgetting, Transfer learning, Language modeling
- source
- (Biderman et al. 2024)
Summary
This paper provides a rigorous head-to-head comparison of Low-Rank Adaptation (LoRA) against full finetuning of Llama-2-7B on two challenging target domains (code and math) under two training regimes: continued pretraining (CPT, ~20B unlabeled tokens) and instruction finetuning (IFT, ~100K prompt–response pairs). The central question is: under which conditions does LoRA approximate full finetuning accuracy, and to what extent does it mitigate catastrophic forgetting of base model capabilities?
The headline finding is captured by the title: in standard low-rank settings, LoRA substantially underperforms full finetuning on target-domain accuracy (HumanEval for code, GSM8K for math), but it forgets less of the source domain (HellaSwag, ARC-Challenge, WinoGrande). High LoRA ranks (r=256) can close most of the gap in IFT but not in CPT, especially for code. The authors also show, via SVD of the weight perturbations, that full finetuning produces high-rank updates (10–100× the rank typical LoRA configurations use), offering a mechanistic explanation for the accuracy gap. Finally, they characterise a learning–forgetting Pareto frontier and offer practical recipes (α=2r, target all transformer modules, sweep learning rates 1e-5 to 5e-4).
The work is notable for being one of the few studies that benchmarks LoRA on modern billion-parameter LLMs in hard target domains where the base model distribution is far from the target — settings where prior LoRA-vs-full comparisons (e.g. on GLUE/RoBERTa) had concluded parity.
Key Ideas
- Low-rank adaptation: freeze pretrained weight \(W\) and learn \(\Delta = \gamma_r AB\) with \(A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}\), \(\gamma_r = \alpha/r\). Trains \(\sim 1\%\) of parameters for r=16 on a 7B model.
- LoRA underperforms in CPT: across r ∈ {16, 64, 256} on StarCoder-Python and OpenWebMath, LoRA never closes the gap to full finetuning on HumanEval/GSM8K.
- LoRA can match full FT in IFT — at high rank: r=256 on Magicoder-Evol can match full finetuning HumanEval (0.498 vs 0.497), but typical low ranks (r=8–64) lag.
- LoRA forgets less: on HellaSwag+ARC+WinoGrande averaged forgetting metric, LoRA preserves base-model performance markedly better than full finetuning, with rank acting as a knob trading learning for forgetting.
- LoRA beats classical regularisation: dropout and weight decay learn-and-forget similarly to full finetuning; only LoRA achieves the favourable learning–forgetting tradeoff at moderate ranks.
- Output diversity: full finetuning collapses output diversity (akin to RLHF “distribution collapse”); LoRA preserves it, sitting between base and full-finetuned models.
- Full FT learns high-rank perturbations: SVD analysis shows the rank of \(\Delta\) for full finetuning is 10–100× higher than the LoRA rank typically used, and grows with training tokens. MLP modules need higher rank than attention.
- Best practices: use LoRA preferentially for IFT (not CPT); target all transformer modules (attention + MLP), with MLP being the dominant locus; set α=2r; use learning rates an order of magnitude above full FT (5e-5 to 5e-4); rank 256 if memory permits.
Comments
This paper is a useful corrective to optimistic LoRA folklore inherited from the original LLM adaptation literature, where the targets (GLUE, MNLI on RoBERTa-340M) were trivially close to the pretraining distribution. Once the target domain genuinely demands new capability — Python code, formal math — the low-rank inductive bias of LoRA acts as a real constraint. The framing of LoRA as a knob on the learning–forgetting Pareto frontier (rather than as “free” parameter efficiency) is the most useful contribution: it makes the tradeoff explicit and gives a principled reason to choose LoRA when source-domain preservation matters (e.g. continual instruction following).
The SVD analysis (Section 4.6) is the most mechanistically interesting piece. That full finetuning produces \(\Delta\) with rank \(\sim d/2\), while typical LoRA uses \(r \le 64 \ll d=4096\), gives a clean explanation for the underperformance: LoRA can’t physically express the perturbation that full finetuning finds. The fact that MLPs need higher rank than attention modules is a structural hint about where continual learning happens in transformers — an angle worth following in the continual learning literature.
A limitation worth flagging: all results are on Llama-2-7B with one round of finetuning. Modern post-training pipelines stack many stages (SFT → DPO → RLHF), and the cumulative forgetting picture under repeated LoRA vs full updates is left open. Also, the “forgetting” metric is a fixed average of three classical NLU benchmarks; richer evaluations (instruction following, factuality, safety) might shift the tradeoff.
Connections
- Related to Catastrophic forgetting — the paper is essentially a careful empirical study of forgetting under two finetuning paradigms, and shows LoRA acts as an implicit regulariser against it.
- Related to Continual learning — instruction finetuning and continued pretraining are both forms of continual learning on a pretrained base, and the learning–forgetting Pareto frontier is a continual-learning concept.
- Related to Transfer learning — LoRA is a low-parameter transfer mechanism; the paper interrogates its expressive limits on hard transfer.
- Related to LLM — direct study of how LLMs are adapted post-pretraining.
- Related to Language modeling — the CPT regime continues language modeling on a domain-specific corpus.
Bibliography
- Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, et al.. . "LoRA Learns Less and Forgets Less". https://arxiv.org/abs/2405.09673.
Loading comments...