Abstract
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's "best guess" reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline.
Method Overview
An agent interacts with the reward-free environment and generates trajectories. In order to generate rewards for reinforcement learning, our method assumes access to a "prior" reward that conveys some information about the task, but generally may be different than the true task reward function. These prior rewards form part of a reward function that is trained to be aligned with preference pairs.
An additional version of residual reward model in image-based setting: Residual Reward Model obtains images and proprioceptive states from the environment rather than states. An encoder is used for extracting representations from images and is jointly trained with the RL agent.
Real Tasks Visualization
We evaluate Residual Reward Models on a real Franka arm to finish tasks. Videos are recorded by iPhone15pro.
Simulation Tasks Visualization
We show the learned curve of Residual Reward Models and PEBBLE for each simulation task.
Main Results
We report the learning curves for our method and baselines on 5 gripper manipulation tasks from MetaWorld and 2 manipulation tasks with image observations. Our method achieves prominent performance on all tasks, and significantly outperforms baselines PEBBLE. Among them, proxy reward 1 represents the negative distance between the task object and the task goal, and proxy reward 2 indicates the negative distance between the gripper and the task object. In image-based setting, proxy reward - initial distance represents the negative distance between the gripper and the initial position of task object, and proxy reward - penalty applies a punishment when the gripper moves outside the predefined region.
Applying RRM to other PbRL baselines and less feedback
Residual Reward Model can be applied to other PbRL baselines directly while requiring less feedback and still demonstrates excellent performance that surpasses baselines.
Citation
If you find this project helpful, please cite us: