Regular RNNs process input in sequence. When applied to a language modeling task, one tries to predict a word given the previous ones. For example, with the sentence
The quick brown fox jumps over the lazy, a classical RNN will initialize and internal state \(s_0\) and process each word in sequence, starting from
The and updating its internal state with each new word in order to make a final prediction.
A backward RNN reverses this idea, building it internal state starting from the last word
lazy (the one we can expect to have the most information for prediction) and process the sentence backward. It should be simpler for such network to learn to properly select meaningful information rather than starting from the first word which might not be useful at all.
A variant of this idea, would be to use the same principle and apply it forward in a transformer-like language model training situation. To predict
The quick brown <?> jumps over the lazy dog, the network would start by looking at
jumps, and then