AN EXPLORATION STRATEGY OF REINFORCEMENT LEARNING | TOWARDS AI

EMI: Exploration with Mutual Information

A novel exploration method based on representation learning

Sherwin Chen

Published in

Towards AI

5 min readAug 28, 2019

Source: Photo by Andrew Neel on Unsplash

Introduction

Reinforcement learning could be hard when the reward signal is sparse. In these scenarios, exploration strategy becomes essentially important: a good exploration strategy not only helps the agent to gain a faster and better understanding of the world but also makes it robust to the change of the environment. In this article, we discuss a novel exploration method, namely Exploration with Mutual Information(EMI) proposed by Kim et al. in ICML 2019. In a nutshell, EMI learns representations for both observations(states) and actions in the expectation that we can have a linear dynamics model on these representations. EMI then computes the intrinsic reward as the prediction error under the linear dynamics model. The intrinsic reward combined with environment reward forms the final reward function which can then be used by any RL method.

To avoid redundancy, we assume you are familiar with the concept of mutual information and Markov decision process.

Representation Learning

Representation for States and Actions

Source: EMI: Exploration with Mutual Information

In EMI, we aim to learn representation ϕ(s): S→ ℝᵈ and ψ(a): A→ ℝᵈ for states and actions, respectively, such that the learned representations bear the most useful information about the dynamics. This can be achieved by maximizing two mutual information objectives: 1) the mutual information between [ϕ(s), ψ(a)] and ϕ(s’); 2) the mutual information between [ϕ(s), ϕ(s’)] and ψ(a). Mathematically, we maximize the following two objectives

where P_{SAS’}^π denotes the joint distribution of singleton experience tuples (s, a, s’) following policy π, and P_{A}^π, P_{SA}^π, and P_{SS’}^π are marginal distributions. These objectives could be optimized by either MINE or DIM we discussed before. In EMI, we follow the objective proposed by DIM, maximizing the mutual information(MI) between X and Y through Jensen-Shannon divergence(JSD)

Eq.2. JSD objective to maximize mutual information. An important property of this objective is that it is bounded by *2log2, which resolves the gradient explosion problem of MINE*

Embedding the linear dynamics model with the error model

Besides the above objectives, EMI also imposes a simple and convenient topology on the embedding space where transitions are linear. Concretely, we also seek to learn the representation of states ϕ(s) and the action ψ(a) such that the representation of the corresponding next state ϕ(s’) follows linear dynamics, i.e.

Intuitively, this might allow us to offload most of the modeling burden onto the embedding functions.

Regardless of the expressivity of the neural networks, however, there always exists some irreducible error under the linear dynamic model. For example, the state transition which leads the agent from one room to another in Atari environments would be extremely challenging to explain under the linear dynamic model. To this end, the authors introduce the error model e(s, a): S×A→ ℝᵈ, which is another neural network taking the state and action as input, estimating the irreducible error under the linear model. To train the error model, we minimize the Euclidean norm of the error term so that the error term contributes to sparingly unexplainable occasions. The following objective shows the embedding learning problem under linear dynamics with modeled errors:

The Objective of Representation Learning

Now we put all the objectives together:

Eq.4. The final objective for representation learning. λ_{error} and λ_{info} are hyperparameters that weight the simplicity of the error model and mutual information between representations

The first expectation term is obtained by applying the Lagrangian multiplier to the constraint optimization problem defined in the previous sub-section and the second term is the negative of the MI objectives.

In practice, the authors found the optimization process to be more stable when we regularize the distribution of action embedding representation to follow a predefined prior distribution, e.g. a standard normal distribution. This introduces an additional KL penalty D_{KL}(P_A^π‖𝒩(0, I)) similar to VAEs, where P_A^π is an empirical normal distribution whose parameters are approximated from a batch of samples. Here, we give an exemplar Tensorflow code to demonstrate how this KL divergence is computed from samples.

Tensorflow code to compute D_{*KL}(P_A^π‖𝒩(0, I))*

The authors also tried to regularize the state embedding, but they found it renders the optimization process much more unstable. This may be caused by the fact that the distribution of states is much more likely to be skewed than the distribution of actions, especially during the initial stage of optimization.

Intrinsic Reward

We now define intrinsic reward as the prediction error in the embedding space:

Intrinsic reward function

This formulation incorporates the error term and makes sure we differentiate the irreducible error that does not contribute as the novelty. We combine it with extrinsic reward to get our final reward function used to train an RL agent

The final reward function. η is the intrinsic reward coefficient.

Algorithm

Now that we have defined the objective for representation learning and the reward function for reinforcement learning, the algorithm becomes straightforward

Experimental Results

We can see that EMI, despite its high variance, achieves better results in the challenging low-dimensional locomotion tasks(Figure 4). On the other hand, it kind of ties with many previous methods in vision-based tasks. All in all, EMI could be counted as a wildly applicable method that achieves satisfactory performance.

References

Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, Hyun Oh Song. EMI: Exploration with Mutual Information