README.md · smolvla_base

README.md

4.5 KB · 146 lines · markdown Raw

1	`---`
2	`language:`
3	`- en`
4	`library_name: lerobot`
5	`pipeline_tag: robotics`
6	`tags:`
7	`- vision-language-action`
8	`- imitation-learning`
9	`- lerobot`
10	`inference: false`
11	`---`
12
13	`# SmolVLA (LeRobot)`
14
15	`SmolVLA is a compact, efficient Vision-Language-Action (VLA) model designed for affordable robotics, trainable on a single GPU and deployable on consumer hardware, while matching the performance of much larger VLAs through community-driven data.`
16
17	`Original paper: (SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics)[https://arxiv.org/abs/2506.01844]`
18	`Reference implementation: https://github.com/huggingface/lerobot`
19
20
21	`## Model description`
22
23	`- Inputs: images (multi-view), proprio/state, optional language instruction`
24	`- Outputs: continuous actions`
25	`- Training objective: flow matching`
26	`- Action representation: continuous`
27	`- Intended use: Base model to fine tune on your specific use case`
28
29
30	`## Quick start (inference on a real batch)`
31
32	`### Installation`
33
34	```bash
35	`pip install "lerobot[smolvla]"`
36	```
37	`For full installation details (including optional video dependencies such as ffmpeg for torchcodec), see the official documentation: https://huggingface.co/docs/lerobot/installation`
38
39	### Load model + dataset, run `select_action`
40
41	```python
42	`import torch`
43	`from lerobot.datasets.lerobot_dataset import LeRobotDataset`
44	`from lerobot.policies.factory import make_pre_post_processors`
45
46	`# Swap this import per-policy`
47	`from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy`
48
49	`# load a policy`
50	`model_id = "lerobot/smolvla_base" # <- swap checkpoint`
51	`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`
52
53	`policy = SmolVLAPolicy.from_pretrained(model_id).to(device).eval()`
54
55	`preprocess, postprocess = make_pre_post_processors(`
56	`policy.config,`
57	`model_id,`
58	`preprocessor_overrides={"device_processor": {"device": str(device)}},`
59	`)`
60	`# load a lerobotdataset`
61	`dataset = LeRobotDataset("lerobot/libero")`
62
63	`# pick an episode`
64	`episode_index = 0`
65
66	`# each episode corresponds to a contiguous range of frame indices`
67	`from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]`
68	`to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]`
69
70	`# get a single frame from that episode (e.g. the first frame)`
71	`frame_index = from_idx`
72	`frame = dict(dataset[frame_index])`
73
74	`batch = preprocess(frame)`
75	`with torch.inference_mode():`
76	`pred_action = policy.select_action(frame)`
77	`# use your policy postprocess, this post process the action`
78	`# for instance unnormalize the actions, detokenize it etc..`
79	`pred_action = postprocess(pred_action)`
80	```
81
82
83	`## Training step (loss + backward)`
84
85	If you’re training / fine-tuning, you typically call `forward(...)` to get a loss and then:
86
87	```python
88	`policy.train()`
89	`batch = dict(dataset[0])`
90	`batch = preprocess(batch)`
91
92	`loss, outputs = policy.forward(batch)`
93	`loss.backward()`
94
95	```
96
97	`> Notes:`
98	`>`
99	> - Some policies expose `policy(**batch)` or return a dict; keep this snippet aligned with the policy API.
100	> - Use your trainer script (`lerobot-train`) for full training loops.
101
102
103	`## How to train / fine-tune`
104
105	```bash
106	`lerobot-train \`
107	`--dataset.repo_id=${HF_USER}/<dataset> \`
108	`--output_dir=./outputs/[RUN_NAME] \`
109	`--job_name=[RUN_NAME] \`
110	`--policy.repo_id=${HF_USER}/<desired_policy_repo_id> \`
111	`--policy.path=lerobot/[BASE_CHECKPOINT] \`
112	`--policy.dtype=bfloat16 \`
113	`--policy.device=cuda \`
114	`--steps=100000 \`
115	`--batch_size=4`
116	```
117
118	`Add policy-specific flags below:`
119
120	- `-policy.chunk_size=...`
121	- `-policy.n_action_steps=...`
122	- `-policy.max_action_tokens=...`
123	- `-policy.gradient_checkpointing=true`
124
125
126	`## Real-World Inference & Evaluation`
127
128	You can use the `record` script from [`lerobot-record`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_record.py) with a policy checkpoint as input, to run inference and evaluate your policy.
129
130	`For instance, run this command or API example to run inference and record 10 evaluation episodes:`
131
132	```
133	`lerobot-record \`
134	`--robot.type=so100_follower \`
135	`--robot.port=/dev/ttyACM1 \`
136	`--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \`
137	`--robot.id=my_awesome_follower_arm \`
138	`--display_data=false \`
139	`--dataset.repo_id=${HF_USER}/eval_so100 \`
140	`--dataset.single_task="Put lego brick into the transparent box" \`
141	`# <- Teleop optional if you want to teleoperate in between episodes \`
142	`# --teleop.type=so100_leader \`
143	`# --teleop.port=/dev/ttyACM0 \`
144	`# --teleop.id=my_awesome_leader_arm \`
145	`--policy.path=${HF_USER}/my_policy`
146	```