README.md
| 1 | --- |
| 2 | language: |
| 3 | - en |
| 4 | library_name: lerobot |
| 5 | pipeline_tag: robotics |
| 6 | tags: |
| 7 | - vision-language-action |
| 8 | - imitation-learning |
| 9 | - lerobot |
| 10 | inference: false |
| 11 | --- |
| 12 | |
| 13 | # SmolVLA (LeRobot) |
| 14 | |
| 15 | SmolVLA is a compact, efficient Vision-Language-Action (VLA) model designed for affordable robotics, trainable on a single GPU and deployable on consumer hardware, while matching the performance of much larger VLAs through community-driven data. |
| 16 | |
| 17 | **Original paper:** (SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics)[https://arxiv.org/abs/2506.01844] |
| 18 | **Reference implementation:** https://github.com/huggingface/lerobot |
| 19 | |
| 20 | |
| 21 | ## Model description |
| 22 | |
| 23 | - **Inputs:** images (multi-view), proprio/state, optional language instruction |
| 24 | - **Outputs:** continuous actions |
| 25 | - **Training objective:** flow matching |
| 26 | - **Action representation:** continuous |
| 27 | - **Intended use:** Base model to fine tune on your specific use case |
| 28 | |
| 29 | |
| 30 | ## Quick start (inference on a real batch) |
| 31 | |
| 32 | ### Installation |
| 33 | |
| 34 | ```bash |
| 35 | pip install "lerobot[smolvla]" |
| 36 | ``` |
| 37 | For full installation details (including optional video dependencies such as ffmpeg for torchcodec), see the official documentation: https://huggingface.co/docs/lerobot/installation |
| 38 | |
| 39 | ### Load model + dataset, run `select_action` |
| 40 | |
| 41 | ```python |
| 42 | import torch |
| 43 | from lerobot.datasets.lerobot_dataset import LeRobotDataset |
| 44 | from lerobot.policies.factory import make_pre_post_processors |
| 45 | |
| 46 | # Swap this import per-policy |
| 47 | from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy |
| 48 | |
| 49 | # load a policy |
| 50 | model_id = "lerobot/smolvla_base" # <- swap checkpoint |
| 51 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 52 | |
| 53 | policy = SmolVLAPolicy.from_pretrained(model_id).to(device).eval() |
| 54 | |
| 55 | preprocess, postprocess = make_pre_post_processors( |
| 56 | policy.config, |
| 57 | model_id, |
| 58 | preprocessor_overrides={"device_processor": {"device": str(device)}}, |
| 59 | ) |
| 60 | # load a lerobotdataset |
| 61 | dataset = LeRobotDataset("lerobot/libero") |
| 62 | |
| 63 | # pick an episode |
| 64 | episode_index = 0 |
| 65 | |
| 66 | # each episode corresponds to a contiguous range of frame indices |
| 67 | from_idx = dataset.meta.episodes["dataset_from_index"][episode_index] |
| 68 | to_idx = dataset.meta.episodes["dataset_to_index"][episode_index] |
| 69 | |
| 70 | # get a single frame from that episode (e.g. the first frame) |
| 71 | frame_index = from_idx |
| 72 | frame = dict(dataset[frame_index]) |
| 73 | |
| 74 | batch = preprocess(frame) |
| 75 | with torch.inference_mode(): |
| 76 | pred_action = policy.select_action(frame) |
| 77 | # use your policy postprocess, this post process the action |
| 78 | # for instance unnormalize the actions, detokenize it etc.. |
| 79 | pred_action = postprocess(pred_action) |
| 80 | ``` |
| 81 | |
| 82 | |
| 83 | ## Training step (loss + backward) |
| 84 | |
| 85 | If you’re training / fine-tuning, you typically call `forward(...)` to get a loss and then: |
| 86 | |
| 87 | ```python |
| 88 | policy.train() |
| 89 | batch = dict(dataset[0]) |
| 90 | batch = preprocess(batch) |
| 91 | |
| 92 | loss, outputs = policy.forward(batch) |
| 93 | loss.backward() |
| 94 | |
| 95 | ``` |
| 96 | |
| 97 | > Notes: |
| 98 | > |
| 99 | > - Some policies expose `policy(**batch)` or return a dict; keep this snippet aligned with the policy API. |
| 100 | > - Use your trainer script (`lerobot-train`) for full training loops. |
| 101 | |
| 102 | |
| 103 | ## How to train / fine-tune |
| 104 | |
| 105 | ```bash |
| 106 | lerobot-train \ |
| 107 | --dataset.repo_id=${HF_USER}/<dataset> \ |
| 108 | --output_dir=./outputs/[RUN_NAME] \ |
| 109 | --job_name=[RUN_NAME] \ |
| 110 | --policy.repo_id=${HF_USER}/<desired_policy_repo_id> \ |
| 111 | --policy.path=lerobot/[BASE_CHECKPOINT] \ |
| 112 | --policy.dtype=bfloat16 \ |
| 113 | --policy.device=cuda \ |
| 114 | --steps=100000 \ |
| 115 | --batch_size=4 |
| 116 | ``` |
| 117 | |
| 118 | Add policy-specific flags below: |
| 119 | |
| 120 | - `-policy.chunk_size=...` |
| 121 | - `-policy.n_action_steps=...` |
| 122 | - `-policy.max_action_tokens=...` |
| 123 | - `-policy.gradient_checkpointing=true` |
| 124 | |
| 125 | |
| 126 | ## Real-World Inference & Evaluation |
| 127 | |
| 128 | You can use the `record` script from [**`lerobot-record`**](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_record.py) with a policy checkpoint as input, to run inference and evaluate your policy. |
| 129 | |
| 130 | For instance, run this command or API example to run inference and record 10 evaluation episodes: |
| 131 | |
| 132 | ``` |
| 133 | lerobot-record \ |
| 134 | --robot.type=so100_follower \ |
| 135 | --robot.port=/dev/ttyACM1 \ |
| 136 | --robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \ |
| 137 | --robot.id=my_awesome_follower_arm \ |
| 138 | --display_data=false \ |
| 139 | --dataset.repo_id=${HF_USER}/eval_so100 \ |
| 140 | --dataset.single_task="Put lego brick into the transparent box" \ |
| 141 | # <- Teleop optional if you want to teleoperate in between episodes \ |
| 142 | # --teleop.type=so100_leader \ |
| 143 | # --teleop.port=/dev/ttyACM0 \ |
| 144 | # --teleop.id=my_awesome_leader_arm \ |
| 145 | --policy.path=${HF_USER}/my_policy |
| 146 | ``` |