README.md
21.3 KB · 645 lines · markdown Raw
1 ---
2 license: mit
3 language:
4 - en
5 base_model:
6 - Qwen/Qwen2.5-VL-7B-Instruct
7 pipeline_tag: reinforcement-learning
8 tags:
9 - IQA
10 - Reasoning
11 - VLM
12 - Pytorch
13 - R1
14 - GRPO
15 - RL2R
16 ---
17
18 # VisualQuality-R1-7B
19 Our Paper has been accept as **spotlight** in NeurIPS 2025!
20 This is the latest version of VisualQuality-R1, trained on a diverse combination of synthetic and realistic datasets.<br>
21 Paper link: [arXiv](https://arxiv.org/abs/2505.14460)<br>
22 Code link: [github](https://github.com/TianheWu/VisualQuality-R1)
23
24 > The first NR-IQA model enhanced by RL2R, capable of both quality description and rating through reasoning.
25
26
27 <img src="https://cdn-uploads.huggingface.co/production/uploads/655de51982afda0fc479fb91/JZgVeMtAVASCCNYO5VCyn.png" width="600"/>
28
29
30 ## ⚡Quick Start
31
32 ### Non-Thinking Inference
33 When you execute inference with VisualQuality-R1 as a reward/evaluation model, you can only use **non-thinking** mode to reduce inference time, generating only a single output token with the following prompt:
34 ```
35 PROMPT = (
36 "You are doing the image quality assessment task. Here is the question: "
37 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
38 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
39 )
40
41 QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
42 ```
43
44 For single image quality rating, the code is:
45
46 <details>
47 <summary>Example Code (VisualQuality-R1: Image Quality Rating with non-thinking mode)</summary>
48
49 ```python
50 from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
51 from qwen_vl_utils import process_vision_info
52
53 import torch
54 import random
55 import re
56 import os
57
58
59 def score_image(image_path, model, processor):
60 PROMPT = (
61 "You are doing the image quality assessment task. Here is the question: "
62 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
63 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
64 )
65
66 QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
67 message = [
68 {
69 "role": "user",
70 "content": [
71 {'type': 'image', 'image': image_path},
72 {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
73 ],
74 }
75 ]
76
77 batch_messages = [message]
78
79 # Preparation for inference
80 text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
81 image_inputs, video_inputs = process_vision_info(batch_messages)
82 inputs = processor(
83 text=text,
84 images=image_inputs,
85 videos=video_inputs,
86 padding=True,
87 return_tensors="pt",
88 )
89 inputs = inputs.to(device)
90
91 # Inference: Generation of the output
92 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
93 generated_ids_trimmed = [
94 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
95 ]
96 batch_output_text = processor.batch_decode(
97 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
98 )
99
100 reasoning = None
101
102 try:
103 model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
104 model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
105 score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
106 except:
107 print(f"================= Meet error with {img_path}, please generate again. =================")
108 score = random.randint(1, 5)
109
110 return reasoning, score
111
112
113 random.seed(1)
114 MODEL_PATH = ""
115 device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
116 image_path = ""
117
118 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
119 MODEL_PATH,
120 torch_dtype=torch.bfloat16,
121 attn_implementation="flash_attention_2",
122 device_map=device,
123 )
124 processor = AutoProcessor.from_pretrained(MODEL_PATH)
125 processor.tokenizer.padding_side = "left"
126
127 reasoning, score = score_image(
128 image_path, model, processor
129 )
130
131 print(score)
132 ```
133 </details>
134
135
136 <details>
137 <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with non-thinking mode)</summary>
138
139 ```python
140 from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
141 from qwen_vl_utils import process_vision_info
142 from tqdm import tqdm
143
144 import torch
145 import random
146 import re
147 import os
148
149
150 def get_image_paths(folder_path):
151 image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
152 image_paths = []
153
154 for root, dirs, files in os.walk(folder_path):
155 for file in files:
156 _, ext = os.path.splitext(file)
157 if ext.lower() in image_extensions:
158 image_paths.append(os.path.join(root, file))
159
160 return image_paths
161
162 def score_batch_image(image_paths, model, processor):
163 PROMPT = (
164 "You are doing the image quality assessment task. Here is the question: "
165 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
166 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
167 )
168
169 QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
170
171 messages = []
172 for img_path in image_paths:
173 message = [
174 {
175 "role": "user",
176 "content": [
177 {'type': 'image', 'image': img_path},
178 {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
179 ],
180 }
181 ]
182 messages.append(message)
183
184 BSZ = 32
185 all_outputs = [] # List to store all answers
186 for i in tqdm(range(0, len(messages), BSZ)):
187 batch_messages = messages[i:i + BSZ]
188
189 # Preparation for inference
190 text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
191
192 image_inputs, video_inputs = process_vision_info(batch_messages)
193 inputs = processor(
194 text=text,
195 images=image_inputs,
196 videos=video_inputs,
197 padding=True,
198 return_tensors="pt",
199 )
200 inputs = inputs.to(device)
201
202 # Inference: Generation of the output
203 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
204 generated_ids_trimmed = [
205 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
206 ]
207 batch_output_text = processor.batch_decode(
208 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
209 )
210
211 all_outputs.extend(batch_output_text)
212
213 path_score_dict = {}
214 for img_path, model_output in zip(image_paths, all_outputs):
215 try:
216 model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
217 model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
218 score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
219 except:
220 print(f"Meet error with {img_path}, please generate again.")
221 score = random.randint(1, 5)
222
223 path_score_dict[img_path] = score
224
225 return path_score_dict
226
227
228 random.seed(1)
229 MODEL_PATH = ""
230 device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")
231
232 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
233 MODEL_PATH,
234 torch_dtype=torch.bfloat16,
235 attn_implementation="flash_attention_2",
236 device_map=device,
237 )
238 processor = AutoProcessor.from_pretrained(MODEL_PATH)
239 processor.tokenizer.padding_side = "left"
240
241 image_root = ""
242 image_paths = get_image_paths(image_root) # It should be a list
243
244 path_score_dict = score_batch_image(
245 image_paths, model, processor
246 )
247
248 file_name = "output.txt"
249 with open(file_name, "w") as file:
250 for key, value in path_score_dict.items():
251 file.write(f"{key} {value}\n")
252
253 print("Done!")
254 ```
255 </details>
256
257 ### Thinking mode for inference
258
259 <details>
260 <summary>Example Code (VisualQuality-R1: Single Image Quality Rating with thinking)</summary>
261
262 ```python
263 from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
264 from qwen_vl_utils import process_vision_info
265
266 import torch
267 import random
268 import re
269 import os
270
271
272 def score_image(image_path, model, processor):
273 PROMPT = (
274 "You are doing the image quality assessment task. Here is the question: "
275 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
276 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
277 )
278
279 QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
280 # QUESTION_TEMPLATE = "Please describe the quality of this image."
281 message = [
282 {
283 "role": "user",
284 "content": [
285 {'type': 'image', 'image': image_path},
286 {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
287 ],
288 }
289 ]
290
291 batch_messages = [message]
292
293 # Preparation for inference
294 text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
295 image_inputs, video_inputs = process_vision_info(batch_messages)
296 inputs = processor(
297 text=text,
298 images=image_inputs,
299 videos=video_inputs,
300 padding=True,
301 return_tensors="pt",
302 )
303 inputs = inputs.to(device)
304
305 # Inference: Generation of the output
306 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
307 generated_ids_trimmed = [
308 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
309 ]
310 batch_output_text = processor.batch_decode(
311 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
312 )
313
314 reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL)
315 reasoning = reasoning[-1].strip()
316
317 try:
318 model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
319 model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
320 score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
321 except:
322 print(f"================= Meet error with {img_path}, please generate again. =================")
323 score = random.randint(1, 5)
324
325 return reasoning, score
326
327
328 random.seed(1)
329 MODEL_PATH = ""
330 device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
331 image_path = ""
332
333 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
334 MODEL_PATH,
335 torch_dtype=torch.bfloat16,
336 attn_implementation="flash_attention_2",
337 device_map=device,
338 )
339 processor = AutoProcessor.from_pretrained(MODEL_PATH)
340 processor.tokenizer.padding_side = "left"
341
342 reasoning, score = score_image(
343 image_path, model, processor
344 )
345
346 print(reasoning)
347 print(score)
348 ```
349 </details>
350
351
352 <details>
353 <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking)</summary>
354
355 ```python
356 from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
357 from qwen_vl_utils import process_vision_info
358 from tqdm import tqdm
359
360 import torch
361 import random
362 import re
363 import os
364
365
366 def get_image_paths(folder_path):
367 image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
368 image_paths = []
369
370 for root, dirs, files in os.walk(folder_path):
371 for file in files:
372 _, ext = os.path.splitext(file)
373 if ext.lower() in image_extensions:
374 image_paths.append(os.path.join(root, file))
375
376 return image_paths
377
378 def score_batch_image(image_paths, model, processor):
379 PROMPT = (
380 "You are doing the image quality assessment task. Here is the question: "
381 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
382 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
383 )
384
385 QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
386
387 messages = []
388 for img_path in image_paths:
389 message = [
390 {
391 "role": "user",
392 "content": [
393 {'type': 'image', 'image': img_path},
394 {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
395 ],
396 }
397 ]
398 messages.append(message)
399
400 BSZ = 32
401 all_outputs = [] # List to store all answers
402 for i in tqdm(range(0, len(messages), BSZ)):
403 batch_messages = messages[i:i + BSZ]
404
405 # Preparation for inference
406 text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
407
408 image_inputs, video_inputs = process_vision_info(batch_messages)
409 inputs = processor(
410 text=text,
411 images=image_inputs,
412 videos=video_inputs,
413 padding=True,
414 return_tensors="pt",
415 )
416 inputs = inputs.to(device)
417
418 # Inference: Generation of the output
419 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
420 generated_ids_trimmed = [
421 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
422 ]
423 batch_output_text = processor.batch_decode(
424 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
425 )
426
427 all_outputs.extend(batch_output_text)
428
429 path_score_dict = {}
430 for img_path, model_output in zip(image_paths, all_outputs):
431 reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL)
432 reasoning = reasoning[-1].strip()
433
434 try:
435 model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
436 model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
437 score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
438 except:
439 print(f"Meet error with {img_path}, please generate again.")
440 score = random.randint(1, 5)
441
442 path_score_dict[img_path] = score
443
444 return path_score_dict
445
446
447 random.seed(1)
448 MODEL_PATH = ""
449 device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")
450
451 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
452 MODEL_PATH,
453 torch_dtype=torch.bfloat16,
454 attn_implementation="flash_attention_2",
455 device_map=device,
456 )
457 processor = AutoProcessor.from_pretrained(MODEL_PATH)
458 processor.tokenizer.padding_side = "left"
459
460 image_root = ""
461 image_paths = get_image_paths(image_root) # It should be a list
462
463 path_score_dict = score_batch_image(
464 image_paths, model, processor
465 )
466
467 file_name = "output.txt"
468 with open(file_name, "w") as file:
469 for key, value in path_score_dict.items():
470 file.write(f"{key} {value}\n")
471
472 print("Done!")
473 ```
474 </details>
475
476
477 ## 🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM
478
479 <details>
480 <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking, using vLLM)</summary>
481
482 ```python
483 # Please install vLLM first: https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html
484
485 from transformers import Qwen2_5_VLProcessor, AutoProcessor
486 from vllm import LLM, RequestOutput, SamplingParams
487 from qwen_vl_utils import process_vision_info
488
489 import torch
490 import random
491 import re
492 import os
493
494 IMAGE_PATH = "./images"
495 MODEL_PATH = "TianheWu/VisualQuality-R1-7B"
496
497 def get_image_paths(folder_path):
498 image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
499 image_paths = []
500
501 for root, dirs, files in os.walk(folder_path):
502 for file in files:
503 _, ext = os.path.splitext(file)
504 if ext.lower() in image_extensions:
505 image_paths.append(os.path.join(root, file))
506
507 return image_paths
508
509 def score_batch_image(image_paths, model: LLM, processor: Qwen2_5_VLProcessor):
510 PROMPT = (
511 "You are doing the image quality assessment task. Here is the question: "
512 "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
513 "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
514 )
515
516 QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
517
518 messages = []
519 for img_path in image_paths:
520 message = [
521 {
522 "role": "user",
523 "content": [
524 {'type': 'image', 'image': img_path},
525 {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
526 ],
527 }
528 ]
529 messages.append(message)
530
531 all_outputs = [] # List to store all answers
532
533 # Preparation for inference
534 print("preprocessing ...")
535 texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in messages]
536 image_inputs, video_inputs = process_vision_info(messages)
537
538 inputs = [{
539 "prompt": texts[i],
540 "multi_modal_data": {
541 "image": image_inputs[i]
542 },
543 } for i in range(len(messages))]
544
545 output: list[RequestOutput] = model.generate(
546 inputs,
547 sampling_params=SamplingParams(
548 max_tokens=512,
549 temperature=0.1,
550 top_k=50,
551 top_p=1.0,
552 stop_token_ids=[processor.tokenizer.eos_token_id],
553 ),
554 )
555
556 batch_output_text = [o.outputs[0].text for o in output]
557
558 all_outputs.extend(batch_output_text)
559
560 path_score_dict = {}
561 for img_path, model_output in zip(image_paths, all_outputs):
562 print(f"{model_output = }")
563 try:
564 model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
565 model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
566 score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
567 except:
568 print(f"Meet error with {img_path}, please generate again.")
569 score = random.randint(1, 5)
570
571 path_score_dict[img_path] = score
572
573 return path_score_dict
574
575
576 random.seed(1)
577 model = LLM(
578 model=MODEL_PATH,
579 tensor_parallel_size=1,
580 trust_remote_code=True,
581 seed=1,
582 )
583
584 processor = AutoProcessor.from_pretrained(MODEL_PATH)
585 processor.tokenizer.padding_side = "left"
586
587 image_paths = get_image_paths(IMAGE_PATH) # It should be a list
588
589 path_score_dict = score_batch_image(
590 image_paths, model, processor
591 )
592
593 file_name = "output.txt"
594 with open(file_name, "w") as file:
595 for key, value in path_score_dict.items():
596 file.write(f"{key} {value}\n")
597
598 print("Done!")
599 ```
600 </details>
601
602 ## Training
603
604 ### Preparation
605 1. To smoothly execute the training procedure, first download the IQA images and place them all in a **single folder**.
606 2. Given an original MOS file (e.g., KADID-10K_mos.txt), first execute `cd datasets`, then run `python make_data.py` (with moderate modifications) to generate a **JSON file** for model training.
607 3. Download the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) into a folder.
608
609 ### Training within a Single Node
610 Please modify three elements in `src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh`:
611 ```
612 --model_name_or_path [Your Qwen2.5-VL-7B-Instruct path] \
613 --image_folders [Your dataset images path] \
614 --data_file_paths [Your JSON file path] \
615 ```
616 Then, run:
617 ```
618 bash src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh
619 ```
620
621 ### Training within Multiple Nodes
622 After making the necessary modifications, run the following command:
623 ```
624 bash src/open-r1-multimodal/run_scripts/KADID-10K/multi_run_kadid.sh
625 ```
626
627
628 ## Acknowledgement
629 - [VLM-R1](https://github.com/om-ai-lab/VLM-R1): We start from codebase from the VLM-R1.
630
631 I would like to sincerely thank [Zhuoyan Luo](https://scholar.google.com/citations?user=mKQhEsIAAAAJ&hl=en&oi=ao) for the generous support of my project and for the invaluable guidance in the field of AR generation.
632
633
634 ## 📧 Contact
635 If you have any question, please email `sigstianhewu@gmail.com` or `tianhewu-c@my.cityu.edu.hk`.
636
637 ## BibTeX
638 ```
639 @article{wu2025visualquality,
640 title={{VisualQuality-R1}: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank},
641 author={Wu, Tianhe and Zou, Jian and Liang, Jie and Zhang, Lei and Ma, Kede},
642 journal={arXiv preprint arXiv:2505.14460},
643 year={2025}
644 }
645 ```