Continual learning

tags: Machine learning

Continual learning is a type of supervised learning where there is no “testing phase” associated to a decision process. Instead, training samples keep being processed by the algorithm which has to simultaneously make predictions and keep learning.

This is challenging for a fixed neural network architecture since it has a fixed capacity and is bound to either forget things or be unable to learn anything new.

A definition from the survey (De Lange et al. 2020):

The General Continual Learning setting considers an infinite stream of training data where at each time step, the system receives a (number of) new sample(s) drawn non i.i.d from a current distribution that could itself experience sudden or gradual changes.

Theoretical foundations

Concept shift

(Bartlett et al. 1996) explores how to learn under the assumption of concept shift:

The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence.

Formally, a learner sees at time \(t\) a random example \(x_t\) from some domain \(X\). He also sees the value of \(f_t(x_t) \in \{0, 1\}\) where \(f_t\) is an unknown function from some known class \(F\).

The paper addresses two problems of learning with changing concepts:

Estimation: When can we estimate a sequence \((f_1, \cdots, f_n)\) from observations \(((x_1, f_1(x_1)), \cdots, (x_n, f_n(x_n)))\)?
Prediction: When can one predict the next concept \(f_{n+1}\) from a sequence of concepts \((f_1, \cdots, f_n)\)?

Formal definitions of different aspects of continual learning

Learning to learn

The paper (Baxter 1998) defines the problem of learning to learn as follows (notations are chosen to contrast with regular supervised learning):

an input space \(X\) and an output space \(Y\),
a loss function \(l: Y \times Y \rightarrow \mathbb{R}\),
an environment \((P, Q)\) where \(P\) is the set of all probability distributions on \(X \times Y\) and \(Q\) is a distribution on \(P\),
a hypothesis space family \(H = {\mathcal{H}}\) where each \(\mathcal{H} \in H\) is a set of functions \(h: X \rightarrow Y\).

(Pentina, Ben-David 2015)

(Pentina, Urner 2016)

(Pentina, Lampert 2015)

(Balcan et al. 2020)

(Kifer et al. 2004)

(Ben-David, Borbely 2008)

(Ben-David, Schuller 2003)

(Geisa et al. 2022)

Examples of continual learning systems

Never Ending Language Learner (NELL) (Carlson et al. 2010)

Benchmarks

Computer vision based benchmarks

Split MNIST: the MNIST dataset is split into 5 2-classes tasks (Nguyen et al. 2017; Zenke et al. 2017; Shin et al. 2017).
Split CIFAR10: the CIFAR10 dataset is split into 5 2-classes tasks (Krizhevsky, Hinton 2009).
Split mini-ImageNet: a mini ImageNet (100 classes) task split into 20 5-classes tasks.
Continual Transfer Learning Benchmark: A benchmark from Facebook AI, built from 7 computer vision datasets: MNIST, CIFAR10, CIFAR100, DTD, SVHN, Rainbow-MNIST, Fashion MNIST. The tasks are all 5-classes or 10-classes classification tasks. Some example task sequence constructions from (Veniat et al. 2021):

The last task of \(S_{out}\) consists of a shuffling of the output labels of the first task. The last task of \(S_{in}\) is the same as its first task except that MNIST images have a different background color. \(S_{long}\) has 100 tasks, and it is constructed by first sampling a dataset, then 5 classes at random, and finally the amount of training data from a distribution that favors small tasks by the end of the learning experience.
Permuted MNIST: here for each different task the pixels of the MNIST digits are permuted, generating a new task of equal difficulty as the original one but different solution. This task is not suitable if the model has some spatial prior (like a CNN). Used first in (Goodfellow et al. 2014; Srivastava et al. 2013). Also in (Kirkpatrick et al. 2017)
Rotated MNIST: each task contains digits rotated by a fixed angle between 0 and 180 degrees.

Bibliography

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, Tinne Tuytelaars. May 26, 2020. "A Continual Learning Survey: Defying Forgetting in Classification Tasks". Arxiv:1909.08383 [cs, Stat]. http://arxiv.org/abs/1909.08383.
Peter L. Bartlett, Shai Ben-David, Sanjeev R. Kulkarni. 1996. "Learning Changing Concepts by Exploiting the Structure of Change". In Proceedings of the Ninth Annual Conference on Computational Learning Theory - COLT '96, 131–39. Desenzano del Garda, Italy: ACM Press. DOI.
Jonathan Baxter. 1998. "Theoretical Models of Learning to Learn". In Learning to Learn, edited by Sebastian Thrun and Lorien Pratt, 71–94. Boston, MA: Springer US. DOI.
Anastasia Pentina, Shai Ben-David. 2015. "Multi-task and Lifelong Learning of Kernels". In Algorithmic Learning Theory, edited by Kamalika Chaudhuri, CLAUDIO GENTILE, and Sandra Zilles, 194–208. Lecture Notes in Computer Science. Cham: Springer International Publishing. DOI.
Anastasia Pentina, Ruth Urner. 2016. "Lifelong Learning with Weighted Majority Votes". In Advances in Neural Information Processing Systems. Vol. 29. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2016/hash/f39ae9ff3a81f499230c4126e01f421b-Abstract.html.
Anastasia Pentina, Christoph H Lampert. 2015. "Lifelong Learning with Non-i.i.d. Tasks". In Advances in Neural Information Processing Systems. Vol. 28. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2015/hash/9232fe81225bcaef853ae32870a2b0fe-Abstract.html.
Maria-Florina Balcan, Avrim Blum, Vaishnavh Nagarajan. February 12, 2020. "Lifelong Learning in Costly Feature Spaces". Theoretical Computer Science, Special Issue on Algorithmic Learning Theory, 808 (February):14–37. DOI.
Daniel Kifer, Shai Ben-David, Johannes Gehrke. 2004. "Detecting Change in Data Streams". In VLDB, 4:180–91. Toronto, Canada.
Shai Ben-David, Reba Schuller Borbely. December 1, 2008. "A Notion of Task Relatedness Yielding Provable Multiple-task Learning Guarantees". Machine Learning 73 (3):273–87. DOI.
Shai Ben-David, Reba Schuller. 2003. "Exploiting Task Relatedness for Multiple Task Learning". In Learning Theory and Kernel Machines, edited by Bernhard Schölkopf and Manfred K. Warmuth, 567–80. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. DOI.
Ali Geisa, Ronak Mehta, Hayden S. Helm, Jayanta Dey, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein. January 6, 2022. "Towards a Theory of Out-of-distribution Learning". arXiv. DOI.
Andrew Carlson, Justin Betteridge, Bryan Kisiel. 2010. "Toward an Architecture for Never-ending Language Learning.". In Proceedings of the Conference on Artificial Intelligence (AAAI) (2010), 1306–13. DOI.
Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner. 2017. "Variational Continual Learning". Corr abs/1710.10628. http://arxiv.org/abs/1710.10628.
Friedemann Zenke, Ben Poole, Surya Ganguli. 2017. "Continual Learning Through Synaptic Intelligence". In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, edited by Doina Precup and Yee Whye Teh, 70:3987–95. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v70/zenke17a.html.
Hanul Shin, Jung Kwon Lee, Jaehong Kim, Jiwon Kim. 2017. "Continual Learning with Deep Generative Replay". In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 2990–99. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html.
Alex Krizhevsky, Geoffrey Hinton. 2009. "Learning Multiple Layers of Features from Tiny Images". University of Toronto.
Tom Veniat, Ludovic Denoyer, Marc'Aurelio Ranzato. February 12, 2021. "Efficient Continual Learning with Modular Networks and Task-driven Priors". Arxiv:2012.12631 [cs]. http://arxiv.org/abs/2012.12631.
Ian J. Goodfellow, Mehdi Mirza, Xia Da, Aaron C. Courville, Yoshua Bengio. 2014. "An Empirical Investigation of Catastrophic Forgeting in Gradient-based Neural Networks". In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1312.6211.
Rupesh Kumar Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino J. Gomez, Jürgen Schmidhuber. 2013. "Compete to Compute". In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, edited by Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, 2310–18. https://proceedings.neurips.cc/paper/2013/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html.
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, et al.. January 25, 2017. "Overcoming Catastrophic Forgetting in Neural Networks". Arxiv:1612.00796 [cs, Stat]. http://arxiv.org/abs/1612.00796.