Semantic similarity

tags
NLP, Evaluating NLP

N-gram matching

For two sequences $$x$$ and $$\hat{x}$$, we denote the sequence of $n$-grams with $$S_x^n$$ and $$S^n_{\hat{x}}$$. The number of matched $n$-grams between the two sentences is: $\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]$ with $$\mathbb{I}$$ the indicator function.

From this we can construct the exact match precision (Exact-$$P_n$$) and recall (Exact-$$R_n$$): $\text{Exact-}P_n = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|}$ and $\text{Exact-}R_n = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|}$

Here are some well-known metrics based on the number of $n$-grams matches.

BLEU

The BLEU metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-$$P_n$$ with some key modifications:

1. $n$-grams in the reference can be matched only once.
2. The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
3. A brevity penalty discourages very short candidates.

The score is computed with various values of $$n$$ and geometrically averaged.

An smoothed extension of the BLEU metric, SENT-BLEU (Koehn et al. 2007) is computed at the sentence level.

METEOR

METEOR (Banerjee, Lavie 2005) computes Exact-$$P_1$$ and Exact-$$R_1$$ with the possibility to match word stems and synonyms.

The extension METEOR-1.5 (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently METEOR++ 2.0 adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of METEOR is only available for few languages.

ROUGE

ROUGE (Lin 2004) is a widely used metric for summarization evaluation.

ROUGE-$$n$$ computes Exact-$$R_n$$ (usually $$n = 1, 2$$), while ROUGE-$$L$$ is a variant of Exact-$$R_1$$ with the numerator replaced by the length of the longest common subsequence.

Embedding-based metrics

Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

Bibliography

1. . . "Bleu: A Method for Automatic Evaluation of Machine Translation". In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18.
2. . . "Moses: Open Source Toolkit for Statistical Machine Translation". In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, edited by John A. Carroll, Antal van den Bosch, and Annie Zaenen. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/.
3. . . "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments". In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or [email protected] 2005, Ann Arbor, Michigan, USA, June 29, 2005, edited by Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/.
4. . . "Meteor Universal: Language Specific Translation Evaluation for Any Target Language". In Proceedings of the Ninth Workshop on Statistical Machine Translation, [email protected] 2014, June 26-27, 2014, Baltimore, Maryland, USA, 376–80. The Association for Computer Linguistics. DOI.
5. . . "ROUGE: A Package for Automatic Evaluation of Summaries". In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013.
6. . . "Universal Sentence Encoder". Corr abs/1803.11175. http://arxiv.org/abs/1803.11175.
7. . . "Bertscore: Evaluating Text Generation with BERT". arXiv. DOI.
Last changed | authored by