This paper focuses on the problem of (self-)supervised continual learning with deep neural networks. The Firehose dataset introduced by the authors is a large database of timestamped tweets. The goal is to learn a language model for each user from the dataset, which is called Personalized online language learning (POLL).
The authors also introduce a new extension of gradient descent for continual learning. It is based on a replay buffer to retain information about past examples and a validation buffer used to choose the ideal number of gradient steps at each steps.
This new gradient method called ConGraD outperforms Online GD on most settings studied in the paper.
Because the authors are introducing a new dataset and online learning framework, it seems natural their method is not optimal and will be improved upon.
I think the concept of continual learning is much more interesting than traditional supervised learning. It looks more adapted to real world situations where data is changing and unpredictable.
However, we are still framing continual learning as a weird hybrid loss function taking into account the history of data and future data. It seems like the probabilistic-ally correct way of doing that but isn’t satisfying in my opinion. Nature, if seen as a learning process, doesn’t optimize for past examples, and keeps changing in response to the environment. Sure, it has the capability to retain information and pass it through time (e.g. successful evolutionary strategies) but it doesn’t seem to be optimizing for a hybrid loss function.