Residual Reward Model

Abstract

Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's "best guess" reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline.

Method Overview

An agent interacts with the reward-free environment and generates trajectories. In order to generate rewards for reinforcement learning, our method assumes access to a "prior" reward that conveys some information about the task, but generally may be different than the true task reward function. These prior rewards form part of a reward function that is trained to be aligned with preference pairs.

An additional version of residual reward model in image-based setting: Residual Reward Model obtains images and proprioceptive states from the environment rather than states. An encoder is used for extracting representations from images and is jointly trained with the RL agent.

Real Tasks Visualization

We evaluate Residual Reward Models on a real Franka arm to finish tasks. Videos are recorded by iPhone15pro.

Pick and Reach

Push

Simulation Tasks Visualization

We show the learned curve of Residual Reward Models and PEBBLE for each simulation task.

State-based Tasks

MetaWorld Button-Press

MetaWorld Sweep-Into

MetaWorld Door-Open

MetaWorld Window-Open

MetaWorld Door-Unlock

Manipulator Learning Reach

Manipulator Learning Push

Manipulator Learning Pick-and-Reach

Image-based Tasks

MetaWorld Visual Button-Press

MetaWorld Visual Sweep-Into

Main Results

We report the learning curves for our method and baselines on 5 gripper manipulation tasks from MetaWorld and 2 manipulation tasks with image observations. Our method achieves prominent performance on all tasks, and significantly outperforms baselines PEBBLE. Among them, proxy reward 1 represents the negative distance between the task object and the task goal, and proxy reward 2 indicates the negative distance between the gripper and the task object. In image-based setting, proxy reward - initial distance represents the negative distance between the gripper and the initial position of task object, and proxy reward - penalty applies a punishment when the gripper moves outside the predefined region.

Applying RRM to other PbRL baselines and less feedback

Residual Reward Model can be applied to other PbRL baselines directly while requiring less feedback and still demonstrates excellent performance that surpasses baselines.

Citation

If you find this project helpful, please cite us:

Residual Reward Models for Preference-based Reinforcement Learning

Chenyang Cao^1,2 Miguel Rogel-García¹ Mohamed Nabail¹ Xueqian Wang² Nicholas Rhinehart¹

¹University of Toronto ²Tsinghua University

Abstract

Method Overview

Real Tasks Visualization

We evaluate Residual Reward Models on a real Franka arm to finish tasks. Videos are recorded by iPhone15pro.

Simulation Tasks Visualization

We show the learned curve of Residual Reward Models and PEBBLE for each simulation task.

Main Results

Applying RRM to other PbRL baselines and less feedback

Residual Reward Model can be applied to other PbRL baselines directly while requiring less feedback and still demonstrates excellent performance that surpasses baselines.

Citation

Residual Reward Models for Preference-based Reinforcement Learning

Chenyang Cao1,2 Miguel Rogel-García1 Mohamed Nabail1 Xueqian Wang2 Nicholas Rhinehart1

1University of Toronto 2Tsinghua University

Abstract

Method Overview

Real Tasks Visualization

We evaluate Residual Reward Models on a real Franka arm to finish tasks. Videos are recorded by iPhone15pro.

Simulation Tasks Visualization

We show the learned curve of Residual Reward Models and PEBBLE for each simulation task.

Main Results

Applying RRM to other PbRL baselines and less feedback

Residual Reward Model can be applied to other PbRL baselines directly while requiring less feedback and still demonstrates excellent performance that surpasses baselines.

Citation

Chenyang Cao^1,2 Miguel Rogel-García¹ Mohamed Nabail¹ Xueqian Wang² Nicholas Rhinehart¹

¹University of Toronto ²Tsinghua University