# Continual learning

tags
Machine learning

Continual learning is a type of supervised learning where there is no “testing phase” associated to a decision process. Instead, training samples keep being processed by the algorithm which has to simultaneously make predictions and keep learning.

This is challenging for a fixed neural network architecture since it has a fixed capacity and is bound to either forget things or be unable to learn anything new.

A definition from the survey (De Lange et al. 2020):

The General Continual Learning setting considers an infinite stream of training data where at each time step, the system receives a (number of) new sample(s) drawn non i.i.d from a current distribution that could itself experience sudden or gradual changes.

## Theoretical foundations

### Concept shift

(Bartlett et al. 1996) explores how to learn under the assumption of concept shift:

The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence.

Formally, a learner sees at time $$t$$ a random example $$x_t$$ from some domain $$X$$. He also sees the value of $$f_t(x_t) \in \{0, 1\}$$ where $$f_t$$ is an unknown function from some known class $$F$$.

The paper addresses two problems of learning with changing concepts:

• Estimation: When can we estimate a sequence $$(f_1, \cdots, f_n)$$ from observations $$((x_1, f_1(x_1)), \cdots, (x_n, f_n(x_n)))$$?
• Prediction: When can one predict the next concept $$f_{n+1}$$ from a sequence of concepts $$(f_1, \cdots, f_n)$$?

### Formal definitions of different aspects of continual learning

#### Learning to learn

The paper (Baxter 1998) defines the problem of learning to learn as follows (notations are chosen to contrast with regular supervised learning):

• an input space $$X$$ and an output space $$Y$$,
• a loss function $$l: Y \times Y \rightarrow \mathbb{R}$$,
• an environment $$(P, Q)$$ where $$P$$ is the set of all probability distributions on $$X \times Y$$ and $$Q$$ is a distribution on $$P$$,
• a hypothesis space family $$H = {\mathcal{H}}$$ where each $$\mathcal{H} \in H$$ is a set of functions $$h: X \rightarrow Y$$.

## Benchmarks

### Computer vision based benchmarks

• Split MNIST: the MNIST dataset is split into 5 2-classes tasks (Nguyen et al. 2017; Zenke et al. 2017; Shin et al. 2017).

• Split CIFAR10: the CIFAR10 dataset is split into 5 2-classes tasks (Krizhevsky, Hinton 2009).

• Split mini-ImageNet: a mini ImageNet (100 classes) task split into 20 5-classes tasks.

• Continual Transfer Learning Benchmark: A benchmark from Facebook AI, built from 7 computer vision datasets: MNIST, CIFAR10, CIFAR100, DTD, SVHN, Rainbow-MNIST, Fashion MNIST. The tasks are all 5-classes or 10-classes classification tasks. Some example task sequence constructions from (Veniat et al. 2021):

The last task of $$S_{out}$$ consists of a shuffling of the output labels of the first task. The last task of $$S_{in}$$ is the same as its first task except that MNIST images have a different background color. $$S_{long}$$ has 100 tasks, and it is constructed by first sampling a dataset, then 5 classes at random, and finally the amount of training data from a distribution that favors small tasks by the end of the learning experience.

• Permuted MNIST: here for each different task the pixels of the MNIST digits are permuted, generating a new task of equal difficulty as the original one but different solution. This task is not suitable if the model has some spatial prior (like a CNN). Used first in (Goodfellow et al. 2014; Srivastava et al. 2013). Also in (Kirkpatrick et al. 2017)

• Rotated MNIST: each task contains digits rotated by a fixed angle between 0 and 180 degrees.

## Bibliography

1. . . "A Continual Learning Survey: Defying Forgetting in Classification Tasks". Arxiv:1909.08383 [cs, Stat]. http://arxiv.org/abs/1909.08383.
2. . . "Learning Changing Concepts by Exploiting the Structure of Change". In Proceedings of the Ninth Annual Conference on Computational Learning Theory - COLT '96, 131–39. Desenzano del Garda, Italy: ACM Press. DOI.
3. . . "Theoretical Models of Learning to Learn". In Learning to Learn, edited by Sebastian Thrun and Lorien Pratt, 71–94. Boston, MA: Springer US. DOI.
4. . . "Multi-task and Lifelong Learning of Kernels". In Algorithmic Learning Theory, edited by Kamalika Chaudhuri, CLAUDIO GENTILE, and Sandra Zilles, 194–208. Lecture Notes in Computer Science. Cham: Springer International Publishing. DOI.
5. . . "Lifelong Learning with Weighted Majority Votes". In Advances in Neural Information Processing Systems. Vol. 29. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2016/hash/f39ae9ff3a81f499230c4126e01f421b-Abstract.html.
6. . . "Lifelong Learning with Non-i.i.d. Tasks". In Advances in Neural Information Processing Systems. Vol. 28. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2015/hash/9232fe81225bcaef853ae32870a2b0fe-Abstract.html.
7. . . "Lifelong Learning in Costly Feature Spaces". Theoretical Computer Science, Special Issue on Algorithmic Learning Theory, 808 (February):14–37. DOI.
8. . . "Detecting Change in Data Streams". In VLDB, 4:180–91. Toronto, Canada.
9. . . "A Notion of Task Relatedness Yielding Provable Multiple-task Learning Guarantees". Machine Learning 73 (3):273–87. DOI.
10. . . "Exploiting Task Relatedness for Multiple Task Learning". In Learning Theory and Kernel Machines, edited by Bernhard Schölkopf and Manfred K. Warmuth, 567–80. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. DOI.
11. . . "Towards a Theory of Out-of-distribution Learning". arXiv. DOI.
12. . . "Toward an Architecture for Never-ending Language Learning.". In Proceedings of the Conference on Artificial Intelligence (AAAI) (2010), 1306–13. DOI.
13. . . "Variational Continual Learning". Corr abs/1710.10628. http://arxiv.org/abs/1710.10628.
14. . . "Continual Learning Through Synaptic Intelligence". In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, edited by Doina Precup and Yee Whye Teh, 70:3987–95. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v70/zenke17a.html.
15. . . "Continual Learning with Deep Generative Replay". In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 2990–99. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html.
16. . . "Learning Multiple Layers of Features from Tiny Images". University of Toronto.
17. . . "Efficient Continual Learning with Modular Networks and Task-driven Priors". Arxiv:2012.12631 [cs]. http://arxiv.org/abs/2012.12631.
18. . . "An Empirical Investigation of Catastrophic Forgeting in Gradient-based Neural Networks". In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1312.6211.
19. . . "Compete to Compute". In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, edited by Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, 2310–18. https://proceedings.neurips.cc/paper/2013/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html.
20. . . "Overcoming Catastrophic Forgetting in Neural Networks". Arxiv:1612.00796 [cs, Stat]. http://arxiv.org/abs/1612.00796.

Last changed | authored by