README.md
| 1 | --- |
| 2 | base_model: |
| 3 | - Qwen/Qwen3-VL-8B-Instruct |
| 4 | datasets: |
| 5 | - OneThink/OneThinker-train-data |
| 6 | pipeline_tag: any-to-any |
| 7 | library_name: transformers |
| 8 | license: apache-2.0 |
| 9 | --- |
| 10 | |
| 11 | # OneThinker: All-in-one Reasoning Model for Image and Video |
| 12 | |
| 13 | |
| 14 | |
| 15 | This repository contains the **SFT model** presented in: [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/pdf/2512.03043) |
| 16 | |
| 17 | This is an intermediate model prepared for subsequent RL training. |
| 18 | |
| 19 | For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker). |
| 20 | |
| 21 | |
| 22 | |
| 23 | ## 👀 About OneThinker |
| 24 | |
| 25 | <div align="center"> |
| 26 | <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%"> |
| 27 | </div> |
| 28 | |
| 29 | We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**. |
| 30 | |
| 31 | OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization. |
| 32 | |
| 33 | OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. |
| 34 | |
| 35 | |
| 36 | |
| 37 | ## 📄 Citations |
| 38 | |
| 39 | If you find our work helpful for your research, please consider citing our work. |
| 40 | |
| 41 | ```bibtex |
| 42 | @article{feng2025onethinker, |
| 43 | title={OneThinker: All-in-one Reasoning Model for Image and Video}, |
| 44 | author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others}, |
| 45 | journal={arXiv preprint arXiv:2512.03043}, |
| 46 | year={2025} |
| 47 | } |
| 48 | ``` |