Notes on: The geometry of integration in text classification RNNs by Aitken, K., Ramasesh, V. V., Garg, A., Cao, Y., Sussillo, D., & Maheswaranathan, N. (2020)

tags: RNN, NLP
source: (Aitken et al. 2020)

Summary

This paper takes a dynamical system based approach to study learning in RNNs. Gradient descent optimization in RNNs allows them to learn a simplified form of memory and information processing.

The authors use simple text classification tasks to try and understand if these learned properties can be understood by looking at the state dynamics of RNNs.

The RNNs usually behave like attractor networks, with the hidden state lying on a low-dimensional manifold.

Synthetic data

A first task is to classify sentences based on the number of evidence word corresponding to a target class. A simple solution to this problem is a counter which returns a class with the majority of evidence words.

With 3 classes, the learned neural network functions exactly like an integrator working mostly on a 2D equilateral triangle. Each evidence word moves the hidden state towards a corner of this triangle while neutral words don’t move the hidden state.

For varying number of classes \(N\), the authors show that \(N-1\) dimensions are mostly used for classification, explaining 95% of the variance of the hidden state.

Natural data

Interestingly, learned attractors are more or less similar with natural classification data. A RNN learns for each word a direction that will lead the hidden state towards the corresponding class.

Ordered classification

With the more involved task of ordered classification (star review prediction), RNN still learn low dimensional attractors. The integration is now apparently twofold: sentiment and intensity both play a role for the final score.

Multi-label classification

With multi-label classification, a RNN keeps track of all classes combinations like if they were different classes.

Comments

I’m particularly interested in this kind of work trying to understand how these neural networks work. Gradient descent seems pretty good at finding shortcuts in data. This makes it particularly efficient for relatively simple tasks like sentence classification or relatively OK language modeling, but fails to construct more complex primitives or attractors.

Neuroscience seems to have shown that at least parts of our brain functions use attractor dynamics like RNNs, but they likely weren’t found through the same kind of optimization.

It is interesting to think about this in connection with (Katharopoulos et al. 2020). This also mean that the powerful transformers also act like some kind of fancy integrator in a large space. It seems like this would be limiting their capabilities, since our brain doesn’t look like its only doing integration.

Bibliography

Kyle Aitken, Vinay V. Ramasesh, Ankush Garg, Yuan Cao, David Sussillo, Niru Maheswaranathan. October 28, 2020. "The Geometry of Integration in Text Classification Rnns". Arxiv:2010.15114 [cs, Stat]. http://arxiv.org/abs/2010.15114.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret. June 29, 2020. "Transformers Are Rnns: Fast Autoregressive Transformers with Linear Attention". Arxiv:2006.16236 [cs, Stat]. http://arxiv.org/abs/2006.16236. See notes