Language model evaluation
Perplexity
For a given word sequence \(\mathbf{w} = (w_1, …, w_n)\), perplexity (PPL) is defined \[ PPL = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-1} … w_1 )} \] It can be seen as the cross-entropy between an empirical distribution of test words and the predicted conditional word distribution. A language model that would encode each word with an average 8 bits has a perplexity of 256 (\(2^8\)).
It is often used as an evaluation of language models, probably for two main reasons:
- Because it yields an easier to remember and reason about number than the average bits per word value (easier to compare 155 and 128 than 7.27 bits and 7 bits).
- Because a language model with the lowest possible perplexity would be the closest to the “true” model that generated the data.
Because of its connection with entropy, there is also a clear connection between perplexity and Compression, and finding the best language model for a given task is equivalent to finding the best compressor for the data (Mahoney 1999).
Perplexity is the dominant metric for Language model evaluation. However, it has a few drawbacks:
- Perplexity is computed assuming perfect history, which might not always be the case when using previously generated data to generate a new word.
- Perplexity improvements can be misleading, since a fixed perplexity improvement is exponentially harder the closest it is to zero. Therefore, it is advantageous to report a sub-optimal baseline to maximize the value of perplexity improvement, while the actual entropy improvement might be small.
Winograd Schema Challenge nlp
This challenge (<cite itemprop=“citation” itemscope=““Levesque et al. ,n.d.), part of the GLUE benchmark is specifically about sentences that are hard to deal with for computers: there are implicit and ambiguous references within the sentence that can only be solved with contextual knowledge of the things being talked about. To me this is a very hard challenge that we are nowhere near solving. GPT-2’s 70% accuracy on this task is impressive but still doesn’t convince me we are on the right track of solving it.
This seems like a good way of testing a artificially intelligent system.
Example:
I poured water from the bottle into the cup until it was full.
It is not sufficient to learn this sentence structure because when we change only a word the meaning changes significantly.
I poured water from the bottle into the cup until it was empty.
If a machine solves this ambiguous reference, can we say that it has learned some meaningful concept about a bottle or a cup?
When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.
Oren Etzioni, Allen Institute for AI
Bibliography
- Matthew V Mahoney. . "Text Compression as a Test for Artificial Intelligence". In Proceedings of AAAI-1999, 3.
- Hector Levesque, Ernest Davis, Leora Morgenstern. n.d.. "The Winograd Schema Challenge", 10.