Regular RNNs process input in sequence. When applied to a
language modeling task, one tries to predict a word given the
previous ones. For example, with the sentence The quick brown
fox jumps over the lazy
, a classical RNN will initialize and
internal state \(s_0\) and process each word in sequence, starting
from The
and updating its internal state with each new
word in order to make a final prediction.
A backward RNN reverses this idea, building it internal state
starting from the last word lazy
(the one we can
expect to have the most information for prediction) and process the
sentence backward. It should be simpler for such network to learn
to properly select meaningful information rather than starting from
the first word which might not be useful at all.
A variant of this idea, would be to use the same principle and
apply it forward in a transformer-like language model training
situation. To predict fox
in The quick brown
<?> jumps over the lazy dog
, the network would start
by looking at brown
and jumps
, and then
quick
and over
, etc.