PCIL:Policy Contrastive Imitation Learning

1Department of IIIS, University of Tsinghua 2Shanghai Artificial Intelligence Laboratory 3Shanghai Qi Zhi Institute 4Hong Kong University of Science and Technology

Abstract

Adversarial imitation learning (AIL) is a popular method that has recently achieved much success. However, the performance of AIL is still unsatisfactory on the more challenging tasks. We find that one of the major reasons is due to the low quality of AIL discriminator representation. Since the AIL discriminator is trained via binary classification that does not necessarily discriminate the policy from the expert in a meaningful way, the resulting reward might not be meaningful either. We propose a new method called Policy Contrastive Imitation Learning (PCIL) to resolve this issue. PCIL learns a contrastive representation space by anchoring on different policies and generates a smooth cosine-similarity-based reward. Our proposed representation learning objective can be viewed as a stronger version of the AIL objective and provide a more meaningful comparison between the agent and the policy. From a theoretical perspective, we show the validity of our method using the apprenticeship learning framework. Furthermore, our empirical evaluation on the DeepMind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.

AIL vs PCIL

Comparison between the representation space of AIL method and our method

Since the AIL methods use a binary classification objective to distinguish expert and non-expert transitions, the representation space is only required to separate two classes in two disjoint subspaces. So the embedding space is not required to be semantically meaningful enough.

We overcome this limitation by proposing PCIL. Our method enforces the compactness of the expert’s representation. This ensures that the learned representation can capture common, robust features of the expert’s transitions, which leads to a more meaningful representation space

(Left) the distance between 2 expert data points may be even longer than the distance between expert data point and sub-optimal non-expert data point. (Right) Our proposed PCIL.


Policy contrastive imitation learning

We first select an anchor state (the orange) from the expert trajectory. Then, we select a positive state sample (the red) from another expert trajectory and a negative state sample (the green) from the agent trajectory. We map these selected states to the representation space. Finally, we push the representation of the anchor state and the positive state together and pull the representation of negative samples away from the representation of the anchor state.

Illustration of our contrastive learning approach.

Results

We experiment with 10 MuJoCo tasks provided by DeepMind Control Suite. The selected tasks cover various difficulty levels, ranging from simple control problems, such as the single degree of freedom cart pole, to complex high-dimensional tasks, such as the quadruped run.

We find that PCIL is able to outperform the existing methods on all of these tasks. It achieves near-expert performance within our online sample budget in all considered tasks except Hopper Hop. In terms of sample efficiency, i.e., the number of environment interactions required to solve a task, PCIL shows significant improvements over prior methods on five tasks: Cheetah Run, Finger Spin, Hopper Hop, Hopper Stand, and Quadruped Run. For the remaining tasks,

Comparisons of algorithms on 6 selected tasks.

BibTeX

@article{huang2023policy,
      title={Policy Contrastive Imitation Learning},
      author={Huang, Jialei and Yin, Zhao-Heng and Hu, Yingdong and Gao, Yang},
      year={2023}
    }