AN EXPLORATION STRATEGY OF REINFORCEMENT LEARNING | TOWARDS AI
EMI: Exploration with Mutual Information
A novel exploration method based on representation learning
Introduction
Reinforcement learning could be hard when the reward signal is sparse. In these scenarios, exploration strategy becomes essentially important: a good exploration strategy not only helps the agent to gain a faster and better understanding of the world but also makes it robust to the change of the environment. In this article, we discuss a novel exploration method, namely Exploration with Mutual Information(EMI) proposed by Kim et al. in ICML 2019. In a nutshell, EMI learns representations for both observations(states) and actions in the expectation that we can have a linear dynamics model on these representations. EMI then computes the intrinsic reward as the prediction error under the linear dynamics model. The intrinsic reward combined with environment reward forms the final reward function which can then be used by any RL method.
To avoid redundancy, we assume you are familiar with the concept of mutual information and Markov decision process.
Representation Learning
Representation for States and Actions
In EMI, we aim to learn representation ϕ(s): S→ ℝᵈ and ψ(a): A→ ℝᵈ for states and actions, respectively, such that the learned representations bear the most useful information about the dynamics. This can be achieved by maximizing two mutual information objectives: 1) the mutual information between [ϕ(s), ψ(a)] and ϕ(s’); 2) the mutual information between [ϕ(s), ϕ(s’)] and ψ(a). Mathematically, we maximize the following two objectives
where P_{SAS’}^π denotes the joint distribution of singleton experience tuples (s, a, s’) following policy π, and P_{A}^π, P_{SA}^π, and P_{SS’}^π are marginal distributions. These objectives could be optimized by either MINE or DIM we discussed before. In EMI, we follow the objective proposed by DIM, maximizing the mutual information(MI) between X and Y through Jensen-Shannon divergence(JSD)
Embedding the linear dynamics model with the error model
Besides the above objectives, EMI also imposes a simple and convenient topology on the embedding space where transitions are linear. Concretely, we also seek to learn the representation of states ϕ(s) and the action ψ(a) such that the representation of the corresponding next state ϕ(s’) follows linear dynamics, i.e.
Intuitively, this might allow us to offload most of the modeling burden onto the embedding functions.
Regardless of the expressivity of the neural networks, however, there always exists some irreducible error under the linear dynamic model. For example, the state transition which leads the agent from one room to another in Atari environments would be extremely challenging to explain under the linear dynamic model. To this end, the authors introduce the error model e(s, a): S×A→ ℝᵈ, which is another neural network taking the state and action as input, estimating the irreducible error under the linear model. To train the error model, we minimize the Euclidean norm of the error term so that the error term contributes to sparingly unexplainable occasions. The following objective shows the embedding learning problem under linear dynamics with modeled errors:
The Objective of Representation Learning
Now we put all the objectives together:
The first expectation term is obtained by applying the Lagrangian multiplier to the constraint optimization problem defined in the previous sub-section and the second term is the negative of the MI objectives.
In practice, the authors found the optimization process to be more stable when we regularize the distribution of action embedding representation to follow a predefined prior distribution, e.g. a standard normal distribution. This introduces an additional KL penalty D_{KL}(P_A^π‖𝒩(0, I)) similar to VAEs, where P_A^π is an empirical normal distribution whose parameters are approximated from a batch of samples. Here, we give an exemplar Tensorflow code to demonstrate how this KL divergence is computed from samples.
The authors also tried to regularize the state embedding, but they found it renders the optimization process much more unstable. This may be caused by the fact that the distribution of states is much more likely to be skewed than the distribution of actions, especially during the initial stage of optimization.
Intrinsic Reward
We now define intrinsic reward as the prediction error in the embedding space:
This formulation incorporates the error term and makes sure we differentiate the irreducible error that does not contribute as the novelty. We combine it with extrinsic reward to get our final reward function used to train an RL agent
Algorithm
Now that we have defined the objective for representation learning and the reward function for reinforcement learning, the algorithm becomes straightforward
Experimental Results
We can see that EMI, despite its high variance, achieves better results in the challenging low-dimensional locomotion tasks(Figure 4). On the other hand, it kind of ties with many previous methods in vision-based tasks. All in all, EMI could be counted as a wildly applicable method that achieves satisfactory performance.
References
- Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, Hyun Oh Song. EMI: Exploration with Mutual Information