- tags
- Machine learning, Optimization, Meta-learning, Program synthesis
- source
- (Lee et al. 2026)
Summary
This paper introduces Meta-Harness, an outer-loop system for automatically optimizing the “harness” of LLM applications — the code that determines what information to store, retrieve, and present to the model at each step. The key insight is that harness design matters as much as model weights: changing only the harness around a fixed LLM can produce a 6x performance gap on the same benchmark.
Unlike prior text optimization methods (OPRO, TextGrad, AlphaEvolve, GEPA, Feedback Descent, TTT-Discover) that compress feedback into short summaries or scalar scores, Meta-Harness gives its proposer agent full filesystem access to the source code, evaluation scores, and raw execution traces of all prior candidate harnesses. The proposer is a coding agent (Claude Code with Opus 4.6) that reads a median of 82 files per iteration and can reason about why previous harnesses failed before proposing targeted edits.
The system is evaluated on three domains: (1) online text classification, where it improves over the state-of-the-art ACE system by 7.7 points while using 4x fewer context tokens; (2) retrieval-augmented math reasoning on 200 IMO-level problems, where a single discovered harness improves accuracy by 4.7 points on average across five held-out models; and (3) agentic coding on TerminalBench-2, where discovered harnesses surpass hand-engineered baselines and rank #2 among all Opus 4.6 agents.
Key Ideas
- A harness is defined as a stateful program wrapping an LLM that determines prompt construction, retrieval, memory, and context management. Harness optimization is formalized as finding \(H^* = \arg\max_H \mathbb{E}_{x \sim \mathcal{X}, \tau \sim p_M(H,x)} r(\tau, x)\).
- The central design choice is full filesystem access: instead of compressed summaries, all prior candidates’ source code, scores, and execution traces are stored in a growing directory that the proposer queries via grep and cat.
- The proposer is a coding agent, not a raw next-token model. It decides which prior artifacts to inspect, diagnoses failure modes from execution traces, and makes targeted code edits.
- Ablation studies show that access to raw execution traces is the key ingredient: scores-only reaches 41.3 best accuracy, scores+summary 38.7, while the full Meta-Harness interface reaches 56.7.
- Meta-Harness is 10x faster than comparable text optimizers (OpenEvolve, TTT-Discover) at converging to a good harness, attributed to its minimal outer-loop structure.
- Discovered harnesses generalize to out-of-distribution datasets (73.1% average accuracy across 9 unseen tasks) and to unseen models in the math reasoning setting.
Comments
This work makes a compelling case that the “harness” — the scaffolding code around an LLM — is a first-class optimization target. The framing is reminiscent of Meta-learning but operates in code space rather than weight space, and connects to Program synthesis through its use of a coding agent to search over programs.
The paper’s key contribution over prior text optimization methods is the argument that richer, less compressed feedback (full execution traces rather than summaries) enables more effective search. This is validated by strong ablation results.
The connection to AI-GAs is notable: Meta-Harness can be seen as an AI-generating algorithm that searches over the scaffolding rather than the model itself, with the coding agent serving as the “meta” level.
A limitation acknowledged by the authors is that the system has only been tested with one proposer agent (Claude Code). The approach’s effectiveness may vary with different coding agents.
Connections
- Related to Meta-learning because it optimizes the learning procedure (harness) rather than the model weights directly
- Related to Program synthesis because harness search is fundamentally a program search problem, using a coding agent to explore code space
- Related to AI-GAs: AI-generating algorithms because it uses AI (coding agent) to generate better AI systems (harnesses), fitting the AI-GA paradigm
- Related to Optimization because the outer loop performs search over a combinatorial code space with Pareto frontier tracking
- Related to Few-shot learning because one of the key applications is optimizing in-context few-shot example selection and presentation
Bibliography
- Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn. . "Meta-Harness: End-to-End Optimization of Model Harnesses". https://arxiv.org/abs/2603.28052.
Loading comments...