Evaluating NLP

Natural language processing

Language model evaluation


For a given word sequence \(\mathbf{w} = (w_1, …, w_n)\), perplexity (PPL) is defined \[ PPL = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-1} … w_1 )} \] It can be seen as the cross-entropy between an empirical distribution of test words and the predicted conditional word distribution. A language model that would encode each word with an average 8 bits has a perplexity of 256 (\(2^8\)).

It is often used as an evaluation of language models, probably for two main reasons:

  • Because it yields an easier to remember and reason about number than the average bits per word value (easier to compare 155 and 128 than 7.27 bits and 7 bits).
  • Because a language model with the lowest possible perplexity would be the closest to the “true” model that generated the data.

Because of its connection with entropy, there is also a clear connection between perplexity and Compression, and finding the best language model for a given task is equivalent to finding the best compressor for the data (Mahoney 1999).

Perplexity is the dominant metric for Language model evaluation. However, it has a few drawbacks:

  • Perplexity is computed assuming perfect history, which might not always be the case when using previously generated data to generate a new word.
  • Perplexity improvements can be misleading, since a fixed perplexity improvement is exponentially harder the closest it is to zero. Therefore, it is advantageous to report a sub-optimal baseline to maximize the value of perplexity improvement, while the actual entropy improvement might be small.

Winograd Schema Challenge nlp

This challenge (<cite itemprop=“citation” itemscope=““Levesque et al. ,n.d.), part of the GLUE benchmark is specifically about sentences that are hard to deal with for computers: there are implicit and ambiguous references within the sentence that can only be solved with contextual knowledge of the things being talked about. To me this is a very hard challenge that we are nowhere near solving. GPT-2’s 70% accuracy on this task is impressive but still doesn’t convince me we are on the right track of solving it.

This seems like a good way of testing a artificially intelligent system.


I poured water from the bottle into the cup until it was full.

It is not sufficient to learn this sentence structure because when we change only a word the meaning changes significantly.

I poured water from the bottle into the cup until it was empty.

If a machine solves this ambiguous reference, can we say that it has learned some meaningful concept about a bottle or a cup?

When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.

Oren Etzioni, Allen Institute for AI


  1. . . "Text Compression as a Test for Artificial Intelligence". In Proceedings of AAAI-1999, 3.
  2. . n.d.. "The Winograd Schema Challenge", 10.


← Back to Notes