Neural Architecture Search (NAS) is a relatively new but already successful and productive subfield of machine learning, usually seen as part of Deep Learning. It is also often thought of as part of AutoML (see [1]). AutoML corresponds to finding ways of automating some of the most tedious aspect of machine learning and data science. This includes hyperparameter search, feature engineering and model selection from an always growing catalog of battle tested algorithms as well as experimental models.
A search problem
One could argue that a lot of research in deep learning has been essentially based on finding new clever neural network architectures that could beat the previous state of the art on some dataset. I am not saying that this particular line of work has been more important than some other ideas researchers have come up with. Some of these enabled faster training, better performance or less expensive computations. However, neural network architecture changes — usually deeply thought through and rationalized changes like “let’s double the number of convolutional layers” — were at the root of many early successes of deep learning.
Let's add even more layers! - Number of layers in some CNN architectures.
Disclaimer: The number of layers is certainly not the only measure of complexity increase for a deep neural network, and the purpose of this visualization is to highlight a general trend rather than compare the models.
Looking back, this whole process can be seen as a large search game. Each player was searching for the right configuration, the right number of layers, filters and activations. The reward was the test score on CIFAR-10, Imagenet or some other dataset (the comparison is not so surprising when we observe that many NAS papers use reinforcement learning).
And this has not been easy, because nobody really knew what they were doing: should you keep adding layers, should you change the number of filters or the size of your convolutions? Something progressively started to seem more and more obvious to several researchers: there should be a way to automate this whole process. This idea was explored early in the history of machine learning [2], but has gained renewed interest recently.
Why search for architectures?
Searching for a model is tedious, but is NAS just another hyperparameter optimization task? The success of deep learning is often attributed to the powerful feature engineering process it can implement. Hierarchical and complex features are learned through optimization rather than painfully designed, sometimes using computationally expensive mathematical operations. However, not all deep neural networks achieve this with equal efficiency, and some recent progress, e.g. in computer vision were driven by the design of increasingly efficient architectures leveraging some properties of input data (translation invariance in the case of convolutional neural networks, [3]).
We do not yet completely understand how and why architectures are better than others, but it is natural to want to automate this process of constructing and finding new ones and use optimization to make it more efficient. We already have a powerful tool for automating the learning process in our models: optimization. Why not apply it to learning the architectures themselves? This is sometimes called meta-learning.
Challenges of NAS
Neural architecture search might sound like a reasonable and natural idea, but it is much more complex than it looks. To use optimization algorithms to solve it, it needs to be well defined and to have clear objectives.
Three components of Neural Architecture Search
The search space
While the problem can easily be framed as a search problem, it is not immediately clear what space we are searching in. This is a recurring issue in challenging search problems: it is hard to define the granularity at which we are searching. For instance with a simple neural network, should we allow any number of layers, neurons per layer, or activation functions? By restricting the space to a small set of building blocks we might make the search much quicker but at the cost of reducing the expressive power of the resulting networks.
A simple search among 2 possible activations for each layer. Fixed number of layers and units per layers. There are only 6 possible networks of this kind.
There are many different ways of specifying the search space. One can focus on few elements such as activation functions, or manipulate bigger components such as blocks.
Now we allow the number of layers and units per layers to change. The number of possible networks explodes!
How to search?
Another question with neural architecture search is the way the space should be explored. It cannot be done with traditional optimization techniques because of the several undesirable properties of this search space, identified by Miller et al. in 1989 [2]:
- Size: the space is extremely large if we consider even quite simple topologies
- Undifferentiable: changes in network parameters are discrete and can have discontinuous effects on the network behavior. Therefore the space is not differentiable.
- Indirect mapping between architecture and performance: a number of parameters influence the performance of an architecture including random sampling of training data, making the space complex and noisy.
- Deceptive: small changes in network architecture can have strong and unpredictable impact on the performance. Conversely, dissimilar architectures might have comparable good performances. The space is deceptive and multimodal.
What is a good neural network?
Searching neural network is one thing, but one has to know when to stop searching and settle on an architecture. There are however many ways to evaluate a NAS algorithm, and this depends on the what we expect our resulting architecture to achieve:
Are we only trying to find a good architecture for a specific downstream task? If so, should it give the best results for this task with minimal training, maximal computational efficiency or minimal amount of data? Therefore, what should the result be compared to? Other search methods or manually designed networks?
Maybe we should rather be looking for a good transferable architecture that has interesting performances on a range of tasks, and performs well with minimal training.
This is not just a theoretical problem, and random search is a very competitive baseline to many famous published NAS methods, as argued by Li and Talwalkar in [4].
What is the final goal?
Theoretically, “solving” the NAS problem would mean being in possession of a system that can find the best possible neural network for any given task. Moreover, given that neural networks are universal approximators, this would probably be close or equivalent to general artificial intelligence. Whatever problem (or function, or input/output pairs from a function) we throw at our machine, it could spit out a neural network that computes it! Now imagine putting this machine in our physical worlds with sensors that can perceive the world around it. Would it return something similar to our brain as its optimal neural network? Or could it find something much better and more efficient? Of course, there are several limitation to this reasoning, but I believe the general idea to be true:
Solving NAS is equivalent to obtaining general Artificial intelligence
Or at least, it would be such a step forward it would make the end goal suddenly look much more achievable. However, none of this gives us insights into creating such a general purpose neural network-spitting machine. In the rest of the post, we will detail a few research directions that have been explored to this end.
Neuroevolution
Genetic algorithms
Reinforcement learning
Another historically popular approach to NAS was to think of the neural network-generating algorithm as a kind of “intelligent designer agent”. This agent is tasked with the creation of a neural network architecture and makes decisions according to its internal policy. The reward of the agent is the performance of the resulting neural network on some downstream task. This could be seen as automating the research game (easily framed as a RL problem) by using artificial agents.
This approach was taken in [5], in which the authors train a recurrent neural network (RNN) to generate a model description one hyper-parameter at a time. The RNN takes the last item as input for the next step. The search space they experiment with is quite large, albeit limited compared to the endless possibilities of NAS..
At the time, authors reported encouraging state of the art results on image datasets and language modeling on the Penn Tree Bank (PTB) dataset. They also transfer recurrent neural network cells found with their methods from the world-level PTB dataset to the character-level PTB dataset, showing some amount of transferability of their found architectures
Conclusion
Ultimately with this kind of very popular sub-field of machine learning, the volume of available literature is so huge that we often get in a The rich get richer type of situation where papers from big companies or labs with especially good PR tend to overshadow all the others and absorb most citations from other researchers.
This is especially true when people not familiar with the field carry out a superficial-ish survey of available literature — which is my case for this post, and I apologize in advance for falling into the same trap. I still believe that eventually people get the credit they deserve — which might sound naive — but it can take years, and in the meantime winners may take all.