Adversarial examples in reinforcement learning

A small review of existing adversarial methods in RL.


The goal of this project was to write a review of adversarial examples in RL as of January 2019. It was also made into a talk for which the slides can be found above.

We begin with a brief review of adversarial examples in Deep learning. These examples are input values specifically crafted to fool a classifier into assigning it the wrong class. For example, an image could be made to look like a cat but optimized so that a object recognition neural network classifies it as a dog.

At the time of the report, most adversarial attacks on RL exploit the flaws within the deep neural networks used in common algorithms such as deep Q-learning. In the review, we identify three classes of attacks:

  • Non-targeted attacks: The goal is to make the agent choose an action that is not the optimal one.
  • Targeted attacks: The goal is to make the agent go in a particular state.
  • Training-time attacks: These attacks are assumed to be performed during the training of the model, and the goal is to change the training data so that the model is not trained properly.

The field of adversarial attacks is crucial for the development of RL models in production. If we ever have widespread adoption of RL algorithms in production systems, these flaws will be exploited. Researching these attacks is essential for building a better understanding and better defense mechanisms.

This project was a collaboration with Clément Acher, part of a reinforcement learning course taught by Alessandro Lazaric and Matteo Pirotta.

Last modified 2019.01.30