README.md
| 1 | --- |
| 2 | license: mit |
| 3 | tags: |
| 4 | - RLinf |
| 5 | language: |
| 6 | - en |
| 7 | metrics: |
| 8 | - accuracy |
| 9 | pipeline_tag: reinforcement-learning |
| 10 | model-index: |
| 11 | - name: RLinf-OpenVLAOFT-LIBERO-130-Base-Lora |
| 12 | results: |
| 13 | - task: |
| 14 | type: VLA # Required. Example: automatic-speech-recognition |
| 15 | dataset: |
| 16 | type: libero_130 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets |
| 17 | name: libero_130 # Required. A pretty name for the dataset. Example: Common Voice (French) |
| 18 | metrics: |
| 19 | - type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics |
| 20 | value: 42.09 # Required. Example: 20.90 |
| 21 | --- |
| 22 | |
| 23 | <div align="center"> |
| 24 | <img src="logo.svg" alt="RLinf-logo" width="500"/> |
| 25 | </div> |
| 26 | |
| 27 | |
| 28 | <div align="center"> |
| 29 | <!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> --> |
| 30 | <!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> --> |
| 31 | <a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a> |
| 32 | <a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a> |
| 33 | <!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a> |
| 34 | <a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> --> |
| 35 | </div> |
| 36 | |
| 37 | <h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1> |
| 38 | |
| 39 | [RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development. |
| 40 | |
| 41 | |
| 42 | <div align="center"> |
| 43 | <img src="overview.png" alt="RLinf-overview" width="600"/> |
| 44 | </div> |
| 45 | |
| 46 | ## Model Description |
| 47 | The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance. |
| 48 | |
| 49 | We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks. |
| 50 | |
| 51 | |
| 52 | ## Evaluation and Results |
| 53 | We trained four models using RLinf: |
| 54 | |
| 55 | - [RLinf-OpenVLAOFT-GRPO-LIBERO-90](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora))) |
| 56 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 57 | |
| 58 | - [RLinf-OpenVLAOFT-LIBERO-130](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora))) |
| 59 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 60 | |
| 61 | - [RLinf-OpenVLAOFT-GRPO-LIBERO-object](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) Model (based on [Haozhan72/Openvla-oft-SFT-libero-object-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1)) |
| 62 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 63 | |
| 64 | - [RLinf-OpenVLAOFT-GRPO-LIBERO-spatial](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) Model (based on [Haozhan72/Openvla-oft-SFT-libero-spatial-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1)) |
| 65 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 66 | |
| 67 | - [RLinf-OpenVLAOFT-GRPO-LIBERO-goal](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) Model (based on [Haozhan72/Openvla-oft-SFT-libero-goal-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1))) |
| 68 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 69 | |
| 70 | - [RLinf-OpenVLAOFT-GRPO-LIBERO-long](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) Model (based on [Haozhan72/Openvla-oft-SFT-libero10-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1))) |
| 71 | - Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
| 72 | |
| 73 | |
| 74 | ### Benchmark Results |
| 75 | |
| 76 | Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from [OpenVLA-OFT](https://github.com/moojink/openvla-oft/blob/main/vla-scripts/finetune.py). And other sft models are from [SimpleVLA-RL](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86). |
| 77 | > We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130. |
| 78 | > For the SFT-trained (LoRA-base) models, we set do_sample = False. |
| 79 | > For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs. |
| 80 | |
| 81 | | Model | Object | Spatial | Goal | Long | 90 | Average | |
| 82 | | ------------------ | ------ | ------- | ----- | ----- | ------- |------- | |
| 83 | | sft models | 28.83 | 52.22 | 49.40 | 14.92 | 79.28 | 66.07 | |
| 84 | | trained with RLinf | [97.68](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) | [94.76](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) | [93.96](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) | [90.93](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) | [96.44](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) | 95.79 | |
| 85 | |
| 86 | Besides, we train [one model](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) (we named it libero-130 model) for all tasks in libero. |
| 87 | |
| 88 | | libero-130 model | Object | Spatial | Goal | Long | 90 | 130(all) | |
| 89 | | ------------------ | ------ | ------- | ----- | ----- | ------- |------- | |
| 90 | | sft models | 50.20 | 51.61 | 49.40 | 11.90 | 42.67 | 42.09 | |
| 91 | | trained with RLinf | 99.60 | 98.69 | 98.09 | 93.45 | 98.02 | 97.85 | |
| 92 | |
| 93 | <div align="center"> |
| 94 | <img src="tensorboard-success_once.png" alt="RLinf-libero-result" width="600"/> |
| 95 | </div> |
| 96 | |
| 97 | ## How to Use |
| 98 | Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/libero_10_grpo_openvlaoft.yaml``: |
| 99 | |
| 100 | - Set ``rollout.model.model_path``, ``actor.model.model_path``, and ``actor.tokenizer.tokenizer_model`` to the path of the model checkpoint. |
| 101 | |
| 102 | Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``. |
| 103 | |
| 104 | ## License |
| 105 | This code repository and the model weights are licensed under the MIT License. |