README.md · RLinf-OpenVLAOFT-LIBERO-130-Base-Lora

README.md

7.1 KB · 105 lines · markdown Raw

1	`---`
2	`license: mit`
3	`tags:`
4	`- RLinf`
5	`language:`
6	`- en`
7	`metrics:`
8	`- accuracy`
9	`pipeline_tag: reinforcement-learning`
10	`model-index:`
11	`- name: RLinf-OpenVLAOFT-LIBERO-130-Base-Lora`
12	`results:`
13	`- task:`
14	`type: VLA # Required. Example: automatic-speech-recognition`
15	`dataset:`
16	`type: libero_130 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets`
17	`name: libero_130 # Required. A pretty name for the dataset. Example: Common Voice (French)`
18	`metrics:`
19	`- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics`
20	`value: 42.09 # Required. Example: 20.90`
21	`---`
22
23	`<div align="center">`
24	`<img src="logo.svg" alt="RLinf-logo" width="500"/>`
25	`</div>`
26
27
28	`<div align="center">`
29	`<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->`
30	`<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->`
31	`<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>`
32	`<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>`
33	`<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>`
34	`<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->`
35	`</div>`
36
37	`<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>`
38
39	`[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.`
40
41
42	`<div align="center">`
43	`<img src="overview.png" alt="RLinf-overview" width="600"/>`
44	`</div>`
45
46	`## Model Description`
47	`The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.`
48
49	`We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.`
50
51
52	`## Evaluation and Results`
53	`We trained four models using RLinf:`
54
55	`- [RLinf-OpenVLAOFT-GRPO-LIBERO-90](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora)))`
56	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
57
58	`- [RLinf-OpenVLAOFT-LIBERO-130](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora)))`
59	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
60
61	`- [RLinf-OpenVLAOFT-GRPO-LIBERO-object](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) Model (based on [Haozhan72/Openvla-oft-SFT-libero-object-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1))`
62	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
63
64	`- [RLinf-OpenVLAOFT-GRPO-LIBERO-spatial](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) Model (based on [Haozhan72/Openvla-oft-SFT-libero-spatial-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1))`
65	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
66
67	`- [RLinf-OpenVLAOFT-GRPO-LIBERO-goal](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) Model (based on [Haozhan72/Openvla-oft-SFT-libero-goal-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1)))`
68	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
69
70	`- [RLinf-OpenVLAOFT-GRPO-LIBERO-long](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) Model (based on [Haozhan72/Openvla-oft-SFT-libero10-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1)))`
71	- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
72
73
74	`### Benchmark Results`
75
76	`Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from [OpenVLA-OFT](https://github.com/moojink/openvla-oft/blob/main/vla-scripts/finetune.py). And other sft models are from [SimpleVLA-RL](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86).`
77	`> We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130.`
78	`> For the SFT-trained (LoRA-base) models, we set do_sample = False.`
79	`> For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs.`
80
81	`\| Model \| Object \| Spatial \| Goal \| Long \| 90 \| Average \|`
82	`\| ------------------ \| ------ \| ------- \| ----- \| ----- \| ------- \|------- \|`
83	`\| sft models \| 28.83 \| 52.22 \| 49.40 \| 14.92 \| 79.28 \| 66.07 \|`
84	`\| trained with RLinf \| [97.68](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) \| [94.76](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) \| [93.96](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) \| [90.93](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) \| [96.44](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) \| 95.79 \|`
85
86	`Besides, we train [one model](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) (we named it libero-130 model) for all tasks in libero.`
87
88	`\| libero-130 model \| Object \| Spatial \| Goal \| Long \| 90 \| 130(all) \|`
89	`\| ------------------ \| ------ \| ------- \| ----- \| ----- \| ------- \|------- \|`
90	`\| sft models \| 50.20 \| 51.61 \| 49.40 \| 11.90 \| 42.67 \| 42.09 \|`
91	`\| trained with RLinf \| 99.60 \| 98.69 \| 98.09 \| 93.45 \| 98.02 \| 97.85 \|`
92
93	`<div align="center">`
94	`<img src="tensorboard-success_once.png" alt="RLinf-libero-result" width="600"/>`
95	`</div>`
96
97	`## How to Use`
98	Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/libero_10_grpo_openvlaoft.yaml``:
99
100	- Set ``rollout.model.model_path``, ``actor.model.model_path``, and ``actor.tokenizer.tokenizer_model`` to the path of the model checkpoint.
101
102	Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``.
103
104	`## License`
105	`This code repository and the model weights are licensed under the MIT License.`