Regular RNNs process input in sequence. When applied to a language modeling task, one tries to predict a word given the previous ones. For example, with the sentence The quick brown fox jumps over the lazy
, a classical RNN will initialize and internal state \(s_0\) and process each word in sequence, starting from The
and updating its internal state with each new word in order to make a final prediction.
A backward RNN reverses this idea, building it internal state starting from the last word lazy
(the one we can expect to have the most information for prediction) and process the sentence backward. It should be simpler for such network to learn to properly select meaningful information rather than starting from the first word which might not be useful at all.
A variant of this idea, would be to use the same principle and apply it forward in a transformer-like language model training situation. To predict fox
in The quick brown <?> jumps over the lazy dog
, the network would start by looking at brown
and jumps
, and then quick
and over
, etc.