Semantic similarity

tags: NLP, Evaluating NLP

N-gram matching

For two sequences $x$ and $\hat{x}$, we denote the sequence of $n$-grams with $S_x^n$ and $S^n_{\hat{x}}$. The number of matched $n$-grams between the two sentences is: \[ \sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ] \] with $\mathbb{I}$ the indicator function.

From this we can construct the exact match precision (Exact-$P_n$) and recall (Exact-$R_n$): \[ \text{Exact-}$P_n$ = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|} \] and \[ \text{Exact-}$R_n$ = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|} \]

Here are some well-known metrics based on the number of $n$-grams matches.

BLEU

The BLEU metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-$P_n$ with some key modifications:

$n$-grams in the reference can be matched only once.
The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
A brevity penalty discourages very short candidates.

The score is computed with various values of $n$ and geometrically averaged.

An smoothed extension of the BLEU metric, SENT-BLEU (Koehn et al. 2007) is computed at the sentence level.

METEOR

METEOR (Banerjee, Lavie 2005) computes Exact-$P_1$ and Exact-$R_1$ with the possibility to match word stems and synonyms.

The extension METEOR-1.5 (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently METEOR++ 2.0 adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of METEOR is only available for few languages.

ROUGE

ROUGE (Lin 2004) is a widely used metric for summarization evaluation.

ROUGE-$n$ computes Exact-$R_n$ (usually $n = 1, 2$), while ROUGE-$L$ is a variant of Exact-$R_1$ with the numerator replaced by the length of the longest common subsequence.

Edit-distance-based metrics

Embedding-based metrics

Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

Learned metrics

BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

Bibliography

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2002. "Bleu: A Method for Automatic Evaluation of Machine Translation". In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, et al.. 2007. "Moses: Open Source Toolkit for Statistical Machine Translation". In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, edited by John A. Carroll, Antal van den Bosch, and Annie Zaenen. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/.
Satanjeev Banerjee, Alon Lavie. 2005. "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments". In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization@acl 2005, Ann Arbor, Michigan, USA, June 29, 2005, edited by Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/.
Michael J. Denkowski, Alon Lavie. 2014. "Meteor Universal: Language Specific Translation Evaluation for Any Target Language". In Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA, 376–80. The Association for Computer Linguistics. DOI.
Chin-Yew Lin. July 2004. "ROUGE: A Package for Automatic Evaluation of Summaries". In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, et al.. 2018. "Universal Sentence Encoder". Corr abs/1803.11175. http://arxiv.org/abs/1803.11175.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. February 24, 2020. "Bertscore: Evaluating Text Generation with BERT". arXiv. DOI.