FOSP: Fine-tuning Offline Safe Policy through World Models

Abstract

Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks. Regarding safety issues, safe model-based RL can achieve nearly zero-cost performance and effectively manage the trade-off between performance and safety. Nevertheless, prior works still pose safety challenges due to the online exploration in real-world deployment. To address this, some offline RL methods have emerged as solutions, which learn from a static dataset in a safe way by avoiding interactions with the environment.

In this paper, we aim to further enhance safety during the deployment stage for vision-based robotic tasks by fine-tuning an offline-trained policy. We incorporate in-sample optimization, model-based policy expansion, and reachability guidance to construct a safe offline-to-online framework. Moreover, our method proves to improve the generalization of offline policy in unseen safety-constrained scenarios. Finally, the efficiency of our method is validated on simulation benchmarks with five vision-only tasks and a real robot by solving some deployment problems using limited data.

Framework

Video

Real world experiments comparing with the baseline

We compare FOSP with a strong baseline: SafeDreamer, which is currently the SOTA model for visual safety reinforcement learning. We show a different example of the tasks. The standard task can be finished after offline training by FOSP and SafeDreamer. After training, the FOSP is deployed in the real world directly and fine-tuned for 40 steps. Keeping all other conditions the same, we also fine-tune SafeDreamer 40 steps in these transfer tasks. While SafeDreamer is struggling to finish the tasks, FOSP can easily avoid the obstacles and find the goals.

Standard task

Using the offline trained model to plan a safe route to the goal with enough data. It shows the capability to finish the standard task.

Transfer 1

Change the shape of obstacles while using FOSP.

Transfer 2

Change the shape of goal while using FOSP.

Transfer 3

Change the number of obstacles while using FOSP.

Transfer 1(baseline)

Change the shape of obstacles while using the baseline.

Transfer 2(baseline)

Change the shape of the goal while using the baseline.

Transfer 3(baseline)

Change the number of obstacles while using the baseline.

Dataset

Simple example

We collect the dataset in the following way: using 3Dconnexion to teleoperate the robot arm and the camera to perception throughout the experiment process. We collect trajectories and then mix them together for training. A trajectory that the robot almost reaches the target but doesn't reach it accurately is shown:

Start Frame

Loading...

End Frame

A trajectory that the robot successfully reaches the goal without collision is shown:

Start Frame

Loading...

End Frame

The following picture shows four different trajectories: whether the target was reached and whether constraints were violated. Among them, the red-framed part shows when the robotic arm collides with obstacles and the yellow-framed part indicates the robotic arm reaches the goal.

Experienments

Environment and tasks

Different tasks and agents in the simulation and the real world. Simulation tasks: we consider four different tasks in the simulation: Push1, Goal1, Button1, Goal2. Agent type: Car (upper) and Point (lower). Real-robot setup: we use raw images as inputs and enable the robotic arm to complete obstacle avoidance tasks safely.

These are some visual representations of our algorithm in real-world tasks. The agent's observations before and after are saved as animations. The agent also makes decisions based on first-person image-only information.

SafePointGoal1(front)

SafePointGoal1(back)

SafePointGoal2(front)

SafePointGoal2(back)

SafePointButton1(front)

SafePointButton1(back)

SafePointPush1(front)

SafePointPush1(back)

SafeCarGoal2(front)

SafeCarGoal2(back)

Simulation experiment results

Offline experimental results. Comparing FOSP to baselines across five image-based safety tasks. The results for all three algorithms are obtained after training for 1 million steps

Online experimental results. Comparing FOSP to baselines across five image-based safety tasks in online fine-tuning. The results for all three model-based algorithms are obtained after fine-tuning for 750,000 steps. The dashed lines represent the benchmark results for CPO and PPO-Lagrangian after 10 million training steps across all tasks.

BibTeX

@article{cao2024fospfinetuningofflinesafe,
      title={FOSP: Fine-tuning Offline Safe Policy through World Models}, 
      author={Chenyang Cao and Yucheng Xin and Silang Wu and Longxiang He and Zichen Yan and Junbo Tan and Xueqian Wang},
      year={2024},
      eprint={2407.04942}
}