## N-gram matching

For two sequences \(x\) and \(\hat{x}\), we denote the sequence of $n$-grams with \(S_x^n\) and \(S^n_{\hat{x}}\). The number of matched $n$-grams between the two sentences is: \[ \sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ] \] with \(\mathbb{I}\) the indicator function.

From this we can construct the exact match precision (Exact-\(P_n\)) and recall (Exact-\(R_n\)): \[ \text{Exact-}$P_n$ = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|} \] and \[ \text{Exact-}$R_n$ = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|} \]

Here are some well-known metrics based on the number of $n$-grams matches.

### BLEU

The **BLEU** metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-\(P_n\) with some key modifications:

- $n$-grams in the reference can be matched only once.
- The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
- A brevity penalty discourages very short candidates.

The score is computed with various values of \(n\) and geometrically averaged.

An smoothed extension of the **BLEU** metric, **SENT-BLEU** (Koehn et al. 2007) is computed at the sentence level.

### METEOR

**METEOR** (Banerjee, Lavie 2005) computes Exact-\(P_1\) and Exact-\(R_1\) with the possibility to match word stems and synonyms.

The extension **METEOR-1.5** (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently **METEOR++ 2.0** adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of **METEOR** is only available for few languages.

### ROUGE

**ROUGE** (Lin 2004) is a widely used metric for summarization evaluation.

**ROUGE-\(n\)** computes Exact-\(R_n\) (usually \(n = 1, 2\)), while **ROUGE-\(L\)** is a variant of Exact-\(R_1\) with the numerator replaced by the length of the longest common subsequence.

## Edit-distance-based metrics

## Embedding-based metrics

### Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

## Learned metrics

## BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

