- tags
- NLP, LLM, Semantic similarity
- source
- (Lee et al. 2024)
Summary
Gecko is a compact text embedding model (1.2B parameters) from Google DeepMind that achieves strong retrieval performance by distilling knowledge from LLMs into a retriever. The key innovation is a two-step LLM distillation process called FRet (Few-shot prompted Retrieval). First, an LLM generates diverse synthetic task descriptions and queries from sampled web passages. Second, the LLM refines data quality by reranking retrieved candidate passages using two scoring functions (query likelihood and relevance classification), selecting better positive and hard negative passages than the original seed passages.
On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing 768-dimension models. The 768-dimension version achieves 66.31 average score, competing with models 7x larger (7B parameters) using 3-4k dimensional embeddings. The model is trained with a unified fine-tuning mixture combining FRet data with academic datasets (Natural Questions, HotpotQA, FEVER, etc.) in a consistent format of task description, query, positive passage, and negative passage.
Key Ideas
- Two-step LLM distillation via FRet: (1) LLM-based diverse query generation from web passages, (2) LLM-based positive/negative mining and relabeling
- LLM relabeling discovers that the best positive passage for a generated query differs from the original seed passage ~15% of the time
- Two LLM ranking functions combined via Reciprocal Rank Fusion: query likelihood and relevance classification
- Unified fine-tuning format combining synthetic FRet data with human-annotated academic datasets
- Same-tower negatives for symmetric tasks (e.g., semantic similarity) improve STS performance
- Matryoshka Representation Learning (MRL) loss enables multiple embedding dimensions (256, 768) from a single model
- Pre-finetuning on large-scale unsupervised text pairs before fine-tuning on the FRet mixture
Comments
This paper demonstrates that compact embedding models can match or exceed much larger ones when trained with high-quality LLM-distilled data. The key insight is that LLMs can serve as both data generators and data quality refiners for training embedding models.
The FRet dataset creation process is particularly clever: rather than assuming the seed passage is always the best positive example, the method retrieves candidates and lets the LLM pick better ones. This addresses a fundamental limitation of synthetic data generation where the source text may not be the most relevant passage for the generated query.
The zero-shot performance (trained only on FRet, no human-labeled data) achieving 62.64 on MTEB is notable, showing the approach creates genuinely useful training signal. The work connects to broader themes around self-supervised learning and knowledge distillation from large to small models, similar in spirit to DistilBERT but applied to embedding models rather than general language understanding.
Connections
- Related to Semantic similarity because Gecko is evaluated on semantic textual similarity tasks and achieves state-of-the-art STS performance
- Related to LLM because the method uses LLMs both for generating synthetic training data and for reranking/relabeling passages
- Related to Retrieval augmented generation because the model targets retrieval tasks and embedding-based passage retrieval is a core component of RAG systems
- Related to DistilBERT because both use knowledge distillation to create compact models, though Gecko distills from LLMs into an embedding model rather than compressing a specific architecture
- Related to Self-supervised learning because the pre-finetuning stage uses contrastive learning on large-scale unsupervised text pairs
- Related to BERT because Gecko builds on the lineage of transformer-based text encoders that BERT pioneered
Bibliography
- Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, et al.. . "Gecko: Versatile Text Embeddings Distilled from Large Language Models". https://arxiv.org/abs/2403.20327.
Loading comments...