- tags
- NLP, Evaluating NLP

## N-gram matching

For two sequences \(x\) and \(\hat{x}\), we denote the sequence of $n$-grams with \(S_x^n\) and \(S^n_{\hat{x}}\). The number of matched $n$-grams between the two sentences is: \[ \sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ] \] with \(\mathbb{I}\) the indicator function.

From this we can construct the exact match precision (Exact-\(P_n\)) and recall (Exact-\(R_n\)): \[ \text{Exact-}$P_n$ = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|} \] and \[ \text{Exact-}$R_n$ = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|} \]

Here are some well-known metrics based on the number of $n$-grams matches.

### BLEU

The **BLEU** metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-\(P_n\) with some key modifications:

- $n$-grams in the reference can be matched only once.
- The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
- A brevity penalty discourages very short candidates.

The score is computed with various values of \(n\) and geometrically averaged.

An smoothed extension of the **BLEU** metric, **SENT-BLEU** (Koehn et al. 2007) is computed at the sentence level.

### METEOR

**METEOR** (Banerjee, Lavie 2005) computes Exact-\(P_1\) and Exact-\(R_1\) with the possibility to match word stems and synonyms.

The extension **METEOR-1.5** (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently **METEOR++ 2.0** adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of **METEOR** is only available for few languages.

### ROUGE

**ROUGE** (Lin 2004) is a widely used metric for summarization evaluation.

**ROUGE-\(n\)** computes Exact-\(R_n\) (usually \(n = 1, 2\)), while **ROUGE-\(L\)** is a variant of Exact-\(R_1\) with the numerator replaced by the length of the longest common subsequence.

## Edit-distance-based metrics

## Embedding-based metrics

### Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

## Learned metrics

## BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

## Bibliography

- Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. . "Bleu: A Method for Automatic Evaluation of Machine Translation". In
*Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics*, 311–18. - Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, et al.. . "Moses: Open Source Toolkit for Statistical Machine Translation". In
*ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic*, edited by John A. Carroll, Antal van den Bosch, and Annie Zaenen. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/. - Satanjeev Banerjee, Alon Lavie. . "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments". In
*Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization@acl 2005, Ann Arbor, Michigan, USA, June 29, 2005*, edited by Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/. - Michael J. Denkowski, Alon Lavie. . "Meteor Universal: Language Specific Translation Evaluation for Any Target Language". In
*Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA*, 376–80. The Association for Computer Linguistics. DOI. - Chin-Yew Lin. . "ROUGE: A Package for Automatic Evaluation of Summaries". In
*Text Summarization Branches Out*, 74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013. - Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, et al.. . "Universal Sentence Encoder".
*Corr*abs/1803.11175. http://arxiv.org/abs/1803.11175. - Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. . "Bertscore: Evaluating Text Generation with BERT". arXiv. DOI.