# Semantic similarity

tags
NLP, Evaluating NLP

## N-gram matching

For two sequences $$x$$ and $$\hat{x}$$, we denote the sequence of $n$-grams with $$S_x^n$$ and $$S^n_{\hat{x}}$$. The number of matched $n$-grams between the two sentences is: $\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]$ with $$\mathbb{I}$$ the indicator function.

From this we can construct the exact match precision (Exact-$$P_n$$) and recall (Exact-$$R_n$$): $\text{Exact-}P_n = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|}$ and $\text{Exact-}R_n = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|}$

Here are some well-known metrics based on the number of $n$-grams matches.

### BLEU

The BLEU metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-$$P_n$$ with some key modifications:

1. $n$-grams in the reference can be matched only once.
2. The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
3. A brevity penalty discourages very short candidates.

The score is computed with various values of $$n$$ and geometrically averaged.

An smoothed extension of the BLEU metric, SENT-BLEU (Koehn et al. 2007) is computed at the sentence level.

### METEOR

METEOR (Banerjee, Lavie 2005) computes Exact-$$P_1$$ and Exact-$$R_1$$ with the possibility to match word stems and synonyms.

The extension METEOR-1.5 (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently METEOR++ 2.0 adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of METEOR is only available for few languages.

### ROUGE

ROUGE (Lin 2004) is a widely used metric for summarization evaluation.

ROUGE-$$n$$ computes Exact-$$R_n$$ (usually $$n = 1, 2$$), while ROUGE-$$L$$ is a variant of Exact-$$R_1$$ with the numerator replaced by the length of the longest common subsequence.

## Embedding-based metrics

### Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

## BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

## Bibliography

1. . . "Bleu: A Method for Automatic Evaluation of Machine Translation". In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18.
2. . . "Moses: Open Source Toolkit for Statistical Machine Translation". In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, edited by John A. Carroll, Antal van den Bosch, and Annie Zaenen. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/.
3. . . "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments". In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or [email protected] 2005, Ann Arbor, Michigan, USA, June 29, 2005, edited by Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/.
4. . . "Meteor Universal: Language Specific Translation Evaluation for Any Target Language". In Proceedings of the Ninth Workshop on Statistical Machine Translation, [email protected] 2014, June 26-27, 2014, Baltimore, Maryland, USA, 376–80. The Association for Computer Linguistics. DOI.
5. . . "ROUGE: A Package for Automatic Evaluation of Summaries". In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013.
6. . . "Universal Sentence Encoder". Corr abs/1803.11175. http://arxiv.org/abs/1803.11175.
7. . . "Bertscore: Evaluating Text Generation with BERT". arXiv. DOI.
Last changed | authored by