Evaluating NLP

Natural language processing

Language model evaluation


For a given word sequence \(\mathbf{w} = (w_1, …, w_n)\), perplexity (PPL) is defined \[ PPL = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-1} … w_1 )} \] It can be seen as the cross-entropy between an empirical distribution of test words and the predicted conditional word distribution. A language model that would encode each word with an average 8 bits has a perplexity of 256 (\(2^8\)).

It is often used as an evaluation of language models, probably for two main reasons:

  • Because it yields an easier to remember and reason about number than the average bits per word value (easier to compare 155 and 128 than 7.27 bits and 7 bits).
  • Because a language model with the lowest possible perplexity would be the closest to the “true” model that generated the data.

Because of its connection with entropy, there is also a clear connection between perplexity and Compression, and finding the best language model for a given task is equivalent to finding the best compressor for the data (Mahoney 1999).

Perplexity is the dominant metric for Language model evaluation. However, it has a few drawbacks:

  • Perplexity is computed assuming perfect history, which might not always be the case when using previously generated data to generate a new word.
  • Perplexity improvements can be misleading, since a fixed perplexity improvement is exponentially harder the closest it is to zero. Therefore, it is advantageous to report a sub-optimal baseline to maximize the value of perplexity improvement, while the actual entropy improvement might be small.

Winograd Schema Challenge

This challenge (Levesque, Davis, and Morgenstern, n.d.), part of the GLUE benchmark is specifically about sentences that are hard to deal with for computers: there are implicit and ambiguous references within the sentence that can only be solved with contextual knowledge of the things being talked about. To me this is a very hard challenge that we are nowhere near solving. GPT-2’s 70% accuracy on this task is impressive but still doesn’t convince me we are on the right track of solving it.

This seems like a good way of testing a artificially intelligent system.


I poured water from the bottle into the cup until it was full.

It is not sufficient to learn this sentence structure because when we change only a word the meaning changes significantly.

I poured water from the bottle into the cup until it was empty.

If a machine solves this ambiguous reference, can we say that it has learned some meaningful concept about a bottle or a cup?

When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.

Oren Etzioni, Allen Institute for AI


Levesque, Hector, Ernest Davis, and Leora Morgenstern. n.d. “The Winograd Schema Challenge,” 10.

Mahoney, Matthew V. 1999. “Text Compression as a Test for Artificial Intelligence.” In Proceedings of AAAI-1999, 3.

← Back to Notes