Notes on: Gecko: Versatile Text Embeddings Distilled from Large Language Models by Jinhyuk Lee, et al. (2024)

tags: NLP, LLM, Semantic similarity
source: (Lee et al. 2024)

Summary

Gecko is a compact text embedding model (1.2B parameters) from Google DeepMind that achieves strong retrieval performance by distilling knowledge from LLMs into a retriever. The key innovation is a two-step LLM distillation process called FRet (Few-shot prompted Retrieval). First, an LLM generates diverse synthetic task descriptions and queries from sampled web passages. Second, the LLM refines data quality by reranking retrieved candidate passages using two scoring functions (query likelihood and relevance classification), selecting better positive and hard negative passages than the original seed passages.

On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing 768-dimension models. The 768-dimension version achieves 66.31 average score, competing with models 7x larger (7B parameters) using 3-4k dimensional embeddings. The model is trained with a unified fine-tuning mixture combining FRet data with academic datasets (Natural Questions, HotpotQA, FEVER, etc.) in a consistent format of task description, query, positive passage, and negative passage.

Key Ideas

Two-step LLM distillation via FRet: (1) LLM-based diverse query generation from web passages, (2) LLM-based positive/negative mining and relabeling
LLM relabeling discovers that the best positive passage for a generated query differs from the original seed passage ~15% of the time
Two LLM ranking functions combined via Reciprocal Rank Fusion: query likelihood and relevance classification
Unified fine-tuning format combining synthetic FRet data with human-annotated academic datasets
Same-tower negatives for symmetric tasks (e.g., semantic similarity) improve STS performance
Matryoshka Representation Learning (MRL) loss enables multiple embedding dimensions (256, 768) from a single model
Pre-finetuning on large-scale unsupervised text pairs before fine-tuning on the FRet mixture

Comments

This paper demonstrates that compact embedding models can match or exceed much larger ones when trained with high-quality LLM-distilled data. The key insight is that LLMs can serve as both data generators and data quality refiners for training embedding models.

The FRet dataset creation process is particularly clever: rather than assuming the seed passage is always the best positive example, the method retrieves candidates and lets the LLM pick better ones. This addresses a fundamental limitation of synthetic data generation where the source text may not be the most relevant passage for the generated query.

The zero-shot performance (trained only on FRet, no human-labeled data) achieving 62.64 on MTEB is notable, showing the approach creates genuinely useful training signal. The work connects to broader themes around self-supervised learning and knowledge distillation from large to small models, similar in spirit to DistilBERT but applied to embedding models rather than general language understanding.

Connections

Related to Semantic similarity because Gecko is evaluated on semantic textual similarity tasks and achieves state-of-the-art STS performance
Related to LLM because the method uses LLMs both for generating synthetic training data and for reranking/relabeling passages
Related to Retrieval augmented generation because the model targets retrieval tasks and embedding-based passage retrieval is a core component of RAG systems
Related to DistilBERT because both use knowledge distillation to create compact models, though Gecko distills from LLMs into an embedding model rather than compressing a specific architecture
Related to Self-supervised learning because the pre-finetuning stage uses contrastive learning on large-scale unsupervised text pairs
Related to BERT because Gecko builds on the lineage of transformer-based text encoders that BERT pioneered

Bibliography

Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, et al.. March 29, 2024. "Gecko: Versatile Text Embeddings Distilled from Large Language Models". https://arxiv.org/abs/2403.20327.

Gecko: Versatile Text Embeddings Distilled from Large Language Models by Jinhyuk Lee, et al. (2024)

Summary

Key Ideas

Comments

Connections

Bibliography

Comments

Leave a comment