README.md · OneThinker-SFT-Qwen3-8B

README.md

2.3 KB · 48 lines · markdown Raw

1	`---`
2	`base_model:`
3	`- Qwen/Qwen3-VL-8B-Instruct`
4	`datasets:`
5	`- OneThink/OneThinker-train-data`
6	`pipeline_tag: any-to-any`
7	`library_name: transformers`
8	`license: apache-2.0`
9	`---`
10
11	`# OneThinker: All-in-one Reasoning Model for Image and Video`
12
13
14
15	`This repository contains the SFT model presented in: [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/pdf/2512.03043)`
16
17	`This is an intermediate model prepared for subsequent RL training.`
18
19	`For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).`
20
21
22
23	`## 👀 About OneThinker`
24
25	`<div align="center">`
26	`<img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">`
27	`</div>`
28
29	`We introduce OneThinker, an all-in-one multimodal reasoning generalist that is capable of thinking across a wide range of fundamental visual tasks within a single model.`
30
31	OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale OneThinker-600k multi-task training corpus and build OneThinker-SFT-340k with high-quality CoT annotations for SFT cold start. Furthermore, we propose EMA-GRPO, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
32
33	`OneThinker demonstrates strong performance on 31 benchmarks across 10 fundamental vision tasks, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.`
34
35
36
37	`## 📄 Citations`
38
39	`If you find our work helpful for your research, please consider citing our work.`
40
41	```bibtex
42	`@article{feng2025onethinker,`
43	`title={OneThinker: All-in-one Reasoning Model for Image and Video},`
44	`author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},`
45	`journal={arXiv preprint arXiv:2512.03043},`
46	`year={2025}`
47	`}`
48	```