README.md
| 1 | --- |
| 2 | license: mit |
| 3 | language: |
| 4 | - en |
| 5 | base_model: |
| 6 | - Qwen/Qwen2.5-VL-7B-Instruct |
| 7 | pipeline_tag: reinforcement-learning |
| 8 | tags: |
| 9 | - IQA |
| 10 | - Reasoning |
| 11 | - VLM |
| 12 | - Pytorch |
| 13 | - R1 |
| 14 | - GRPO |
| 15 | - RL2R |
| 16 | --- |
| 17 | |
| 18 | # VisualQuality-R1-7B |
| 19 | Our Paper has been accept as **spotlight** in NeurIPS 2025! |
| 20 | This is the latest version of VisualQuality-R1, trained on a diverse combination of synthetic and realistic datasets.<br> |
| 21 | Paper link: [arXiv](https://arxiv.org/abs/2505.14460)<br> |
| 22 | Code link: [github](https://github.com/TianheWu/VisualQuality-R1) |
| 23 | |
| 24 | > The first NR-IQA model enhanced by RL2R, capable of both quality description and rating through reasoning. |
| 25 | |
| 26 | |
| 27 | <img src="https://cdn-uploads.huggingface.co/production/uploads/655de51982afda0fc479fb91/JZgVeMtAVASCCNYO5VCyn.png" width="600"/> |
| 28 | |
| 29 | |
| 30 | ## ⚡Quick Start |
| 31 | |
| 32 | ### Non-Thinking Inference |
| 33 | When you execute inference with VisualQuality-R1 as a reward/evaluation model, you can only use **non-thinking** mode to reduce inference time, generating only a single output token with the following prompt: |
| 34 | ``` |
| 35 | PROMPT = ( |
| 36 | "You are doing the image quality assessment task. Here is the question: " |
| 37 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 38 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 39 | ) |
| 40 | |
| 41 | QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags." |
| 42 | ``` |
| 43 | |
| 44 | For single image quality rating, the code is: |
| 45 | |
| 46 | <details> |
| 47 | <summary>Example Code (VisualQuality-R1: Image Quality Rating with non-thinking mode)</summary> |
| 48 | |
| 49 | ```python |
| 50 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| 51 | from qwen_vl_utils import process_vision_info |
| 52 | |
| 53 | import torch |
| 54 | import random |
| 55 | import re |
| 56 | import os |
| 57 | |
| 58 | |
| 59 | def score_image(image_path, model, processor): |
| 60 | PROMPT = ( |
| 61 | "You are doing the image quality assessment task. Here is the question: " |
| 62 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 63 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 64 | ) |
| 65 | |
| 66 | QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags." |
| 67 | message = [ |
| 68 | { |
| 69 | "role": "user", |
| 70 | "content": [ |
| 71 | {'type': 'image', 'image': image_path}, |
| 72 | {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)} |
| 73 | ], |
| 74 | } |
| 75 | ] |
| 76 | |
| 77 | batch_messages = [message] |
| 78 | |
| 79 | # Preparation for inference |
| 80 | text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages] |
| 81 | image_inputs, video_inputs = process_vision_info(batch_messages) |
| 82 | inputs = processor( |
| 83 | text=text, |
| 84 | images=image_inputs, |
| 85 | videos=video_inputs, |
| 86 | padding=True, |
| 87 | return_tensors="pt", |
| 88 | ) |
| 89 | inputs = inputs.to(device) |
| 90 | |
| 91 | # Inference: Generation of the output |
| 92 | generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1) |
| 93 | generated_ids_trimmed = [ |
| 94 | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| 95 | ] |
| 96 | batch_output_text = processor.batch_decode( |
| 97 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| 98 | ) |
| 99 | |
| 100 | reasoning = None |
| 101 | |
| 102 | try: |
| 103 | model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL) |
| 104 | model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip() |
| 105 | score = float(re.search(r'\d+(\.\d+)?', model_answer).group()) |
| 106 | except: |
| 107 | print(f"================= Meet error with {img_path}, please generate again. =================") |
| 108 | score = random.randint(1, 5) |
| 109 | |
| 110 | return reasoning, score |
| 111 | |
| 112 | |
| 113 | random.seed(1) |
| 114 | MODEL_PATH = "" |
| 115 | device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu") |
| 116 | image_path = "" |
| 117 | |
| 118 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| 119 | MODEL_PATH, |
| 120 | torch_dtype=torch.bfloat16, |
| 121 | attn_implementation="flash_attention_2", |
| 122 | device_map=device, |
| 123 | ) |
| 124 | processor = AutoProcessor.from_pretrained(MODEL_PATH) |
| 125 | processor.tokenizer.padding_side = "left" |
| 126 | |
| 127 | reasoning, score = score_image( |
| 128 | image_path, model, processor |
| 129 | ) |
| 130 | |
| 131 | print(score) |
| 132 | ``` |
| 133 | </details> |
| 134 | |
| 135 | |
| 136 | <details> |
| 137 | <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with non-thinking mode)</summary> |
| 138 | |
| 139 | ```python |
| 140 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| 141 | from qwen_vl_utils import process_vision_info |
| 142 | from tqdm import tqdm |
| 143 | |
| 144 | import torch |
| 145 | import random |
| 146 | import re |
| 147 | import os |
| 148 | |
| 149 | |
| 150 | def get_image_paths(folder_path): |
| 151 | image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'} |
| 152 | image_paths = [] |
| 153 | |
| 154 | for root, dirs, files in os.walk(folder_path): |
| 155 | for file in files: |
| 156 | _, ext = os.path.splitext(file) |
| 157 | if ext.lower() in image_extensions: |
| 158 | image_paths.append(os.path.join(root, file)) |
| 159 | |
| 160 | return image_paths |
| 161 | |
| 162 | def score_batch_image(image_paths, model, processor): |
| 163 | PROMPT = ( |
| 164 | "You are doing the image quality assessment task. Here is the question: " |
| 165 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 166 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 167 | ) |
| 168 | |
| 169 | QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags." |
| 170 | |
| 171 | messages = [] |
| 172 | for img_path in image_paths: |
| 173 | message = [ |
| 174 | { |
| 175 | "role": "user", |
| 176 | "content": [ |
| 177 | {'type': 'image', 'image': img_path}, |
| 178 | {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)} |
| 179 | ], |
| 180 | } |
| 181 | ] |
| 182 | messages.append(message) |
| 183 | |
| 184 | BSZ = 32 |
| 185 | all_outputs = [] # List to store all answers |
| 186 | for i in tqdm(range(0, len(messages), BSZ)): |
| 187 | batch_messages = messages[i:i + BSZ] |
| 188 | |
| 189 | # Preparation for inference |
| 190 | text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages] |
| 191 | |
| 192 | image_inputs, video_inputs = process_vision_info(batch_messages) |
| 193 | inputs = processor( |
| 194 | text=text, |
| 195 | images=image_inputs, |
| 196 | videos=video_inputs, |
| 197 | padding=True, |
| 198 | return_tensors="pt", |
| 199 | ) |
| 200 | inputs = inputs.to(device) |
| 201 | |
| 202 | # Inference: Generation of the output |
| 203 | generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1) |
| 204 | generated_ids_trimmed = [ |
| 205 | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| 206 | ] |
| 207 | batch_output_text = processor.batch_decode( |
| 208 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| 209 | ) |
| 210 | |
| 211 | all_outputs.extend(batch_output_text) |
| 212 | |
| 213 | path_score_dict = {} |
| 214 | for img_path, model_output in zip(image_paths, all_outputs): |
| 215 | try: |
| 216 | model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL) |
| 217 | model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip() |
| 218 | score = float(re.search(r'\d+(\.\d+)?', model_answer).group()) |
| 219 | except: |
| 220 | print(f"Meet error with {img_path}, please generate again.") |
| 221 | score = random.randint(1, 5) |
| 222 | |
| 223 | path_score_dict[img_path] = score |
| 224 | |
| 225 | return path_score_dict |
| 226 | |
| 227 | |
| 228 | random.seed(1) |
| 229 | MODEL_PATH = "" |
| 230 | device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu") |
| 231 | |
| 232 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| 233 | MODEL_PATH, |
| 234 | torch_dtype=torch.bfloat16, |
| 235 | attn_implementation="flash_attention_2", |
| 236 | device_map=device, |
| 237 | ) |
| 238 | processor = AutoProcessor.from_pretrained(MODEL_PATH) |
| 239 | processor.tokenizer.padding_side = "left" |
| 240 | |
| 241 | image_root = "" |
| 242 | image_paths = get_image_paths(image_root) # It should be a list |
| 243 | |
| 244 | path_score_dict = score_batch_image( |
| 245 | image_paths, model, processor |
| 246 | ) |
| 247 | |
| 248 | file_name = "output.txt" |
| 249 | with open(file_name, "w") as file: |
| 250 | for key, value in path_score_dict.items(): |
| 251 | file.write(f"{key} {value}\n") |
| 252 | |
| 253 | print("Done!") |
| 254 | ``` |
| 255 | </details> |
| 256 | |
| 257 | ### Thinking mode for inference |
| 258 | |
| 259 | <details> |
| 260 | <summary>Example Code (VisualQuality-R1: Single Image Quality Rating with thinking)</summary> |
| 261 | |
| 262 | ```python |
| 263 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| 264 | from qwen_vl_utils import process_vision_info |
| 265 | |
| 266 | import torch |
| 267 | import random |
| 268 | import re |
| 269 | import os |
| 270 | |
| 271 | |
| 272 | def score_image(image_path, model, processor): |
| 273 | PROMPT = ( |
| 274 | "You are doing the image quality assessment task. Here is the question: " |
| 275 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 276 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 277 | ) |
| 278 | |
| 279 | QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags." |
| 280 | # QUESTION_TEMPLATE = "Please describe the quality of this image." |
| 281 | message = [ |
| 282 | { |
| 283 | "role": "user", |
| 284 | "content": [ |
| 285 | {'type': 'image', 'image': image_path}, |
| 286 | {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)} |
| 287 | ], |
| 288 | } |
| 289 | ] |
| 290 | |
| 291 | batch_messages = [message] |
| 292 | |
| 293 | # Preparation for inference |
| 294 | text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages] |
| 295 | image_inputs, video_inputs = process_vision_info(batch_messages) |
| 296 | inputs = processor( |
| 297 | text=text, |
| 298 | images=image_inputs, |
| 299 | videos=video_inputs, |
| 300 | padding=True, |
| 301 | return_tensors="pt", |
| 302 | ) |
| 303 | inputs = inputs.to(device) |
| 304 | |
| 305 | # Inference: Generation of the output |
| 306 | generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1) |
| 307 | generated_ids_trimmed = [ |
| 308 | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| 309 | ] |
| 310 | batch_output_text = processor.batch_decode( |
| 311 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| 312 | ) |
| 313 | |
| 314 | reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL) |
| 315 | reasoning = reasoning[-1].strip() |
| 316 | |
| 317 | try: |
| 318 | model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL) |
| 319 | model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip() |
| 320 | score = float(re.search(r'\d+(\.\d+)?', model_answer).group()) |
| 321 | except: |
| 322 | print(f"================= Meet error with {img_path}, please generate again. =================") |
| 323 | score = random.randint(1, 5) |
| 324 | |
| 325 | return reasoning, score |
| 326 | |
| 327 | |
| 328 | random.seed(1) |
| 329 | MODEL_PATH = "" |
| 330 | device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu") |
| 331 | image_path = "" |
| 332 | |
| 333 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| 334 | MODEL_PATH, |
| 335 | torch_dtype=torch.bfloat16, |
| 336 | attn_implementation="flash_attention_2", |
| 337 | device_map=device, |
| 338 | ) |
| 339 | processor = AutoProcessor.from_pretrained(MODEL_PATH) |
| 340 | processor.tokenizer.padding_side = "left" |
| 341 | |
| 342 | reasoning, score = score_image( |
| 343 | image_path, model, processor |
| 344 | ) |
| 345 | |
| 346 | print(reasoning) |
| 347 | print(score) |
| 348 | ``` |
| 349 | </details> |
| 350 | |
| 351 | |
| 352 | <details> |
| 353 | <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking)</summary> |
| 354 | |
| 355 | ```python |
| 356 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| 357 | from qwen_vl_utils import process_vision_info |
| 358 | from tqdm import tqdm |
| 359 | |
| 360 | import torch |
| 361 | import random |
| 362 | import re |
| 363 | import os |
| 364 | |
| 365 | |
| 366 | def get_image_paths(folder_path): |
| 367 | image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'} |
| 368 | image_paths = [] |
| 369 | |
| 370 | for root, dirs, files in os.walk(folder_path): |
| 371 | for file in files: |
| 372 | _, ext = os.path.splitext(file) |
| 373 | if ext.lower() in image_extensions: |
| 374 | image_paths.append(os.path.join(root, file)) |
| 375 | |
| 376 | return image_paths |
| 377 | |
| 378 | def score_batch_image(image_paths, model, processor): |
| 379 | PROMPT = ( |
| 380 | "You are doing the image quality assessment task. Here is the question: " |
| 381 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 382 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 383 | ) |
| 384 | |
| 385 | QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags." |
| 386 | |
| 387 | messages = [] |
| 388 | for img_path in image_paths: |
| 389 | message = [ |
| 390 | { |
| 391 | "role": "user", |
| 392 | "content": [ |
| 393 | {'type': 'image', 'image': img_path}, |
| 394 | {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)} |
| 395 | ], |
| 396 | } |
| 397 | ] |
| 398 | messages.append(message) |
| 399 | |
| 400 | BSZ = 32 |
| 401 | all_outputs = [] # List to store all answers |
| 402 | for i in tqdm(range(0, len(messages), BSZ)): |
| 403 | batch_messages = messages[i:i + BSZ] |
| 404 | |
| 405 | # Preparation for inference |
| 406 | text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages] |
| 407 | |
| 408 | image_inputs, video_inputs = process_vision_info(batch_messages) |
| 409 | inputs = processor( |
| 410 | text=text, |
| 411 | images=image_inputs, |
| 412 | videos=video_inputs, |
| 413 | padding=True, |
| 414 | return_tensors="pt", |
| 415 | ) |
| 416 | inputs = inputs.to(device) |
| 417 | |
| 418 | # Inference: Generation of the output |
| 419 | generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1) |
| 420 | generated_ids_trimmed = [ |
| 421 | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| 422 | ] |
| 423 | batch_output_text = processor.batch_decode( |
| 424 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| 425 | ) |
| 426 | |
| 427 | all_outputs.extend(batch_output_text) |
| 428 | |
| 429 | path_score_dict = {} |
| 430 | for img_path, model_output in zip(image_paths, all_outputs): |
| 431 | reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL) |
| 432 | reasoning = reasoning[-1].strip() |
| 433 | |
| 434 | try: |
| 435 | model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL) |
| 436 | model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip() |
| 437 | score = float(re.search(r'\d+(\.\d+)?', model_answer).group()) |
| 438 | except: |
| 439 | print(f"Meet error with {img_path}, please generate again.") |
| 440 | score = random.randint(1, 5) |
| 441 | |
| 442 | path_score_dict[img_path] = score |
| 443 | |
| 444 | return path_score_dict |
| 445 | |
| 446 | |
| 447 | random.seed(1) |
| 448 | MODEL_PATH = "" |
| 449 | device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu") |
| 450 | |
| 451 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| 452 | MODEL_PATH, |
| 453 | torch_dtype=torch.bfloat16, |
| 454 | attn_implementation="flash_attention_2", |
| 455 | device_map=device, |
| 456 | ) |
| 457 | processor = AutoProcessor.from_pretrained(MODEL_PATH) |
| 458 | processor.tokenizer.padding_side = "left" |
| 459 | |
| 460 | image_root = "" |
| 461 | image_paths = get_image_paths(image_root) # It should be a list |
| 462 | |
| 463 | path_score_dict = score_batch_image( |
| 464 | image_paths, model, processor |
| 465 | ) |
| 466 | |
| 467 | file_name = "output.txt" |
| 468 | with open(file_name, "w") as file: |
| 469 | for key, value in path_score_dict.items(): |
| 470 | file.write(f"{key} {value}\n") |
| 471 | |
| 472 | print("Done!") |
| 473 | ``` |
| 474 | </details> |
| 475 | |
| 476 | |
| 477 | ## 🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM |
| 478 | |
| 479 | <details> |
| 480 | <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking, using vLLM)</summary> |
| 481 | |
| 482 | ```python |
| 483 | # Please install vLLM first: https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html |
| 484 | |
| 485 | from transformers import Qwen2_5_VLProcessor, AutoProcessor |
| 486 | from vllm import LLM, RequestOutput, SamplingParams |
| 487 | from qwen_vl_utils import process_vision_info |
| 488 | |
| 489 | import torch |
| 490 | import random |
| 491 | import re |
| 492 | import os |
| 493 | |
| 494 | IMAGE_PATH = "./images" |
| 495 | MODEL_PATH = "TianheWu/VisualQuality-R1-7B" |
| 496 | |
| 497 | def get_image_paths(folder_path): |
| 498 | image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'} |
| 499 | image_paths = [] |
| 500 | |
| 501 | for root, dirs, files in os.walk(folder_path): |
| 502 | for file in files: |
| 503 | _, ext = os.path.splitext(file) |
| 504 | if ext.lower() in image_extensions: |
| 505 | image_paths.append(os.path.join(root, file)) |
| 506 | |
| 507 | return image_paths |
| 508 | |
| 509 | def score_batch_image(image_paths, model: LLM, processor: Qwen2_5_VLProcessor): |
| 510 | PROMPT = ( |
| 511 | "You are doing the image quality assessment task. Here is the question: " |
| 512 | "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, " |
| 513 | "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality." |
| 514 | ) |
| 515 | |
| 516 | QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags." |
| 517 | |
| 518 | messages = [] |
| 519 | for img_path in image_paths: |
| 520 | message = [ |
| 521 | { |
| 522 | "role": "user", |
| 523 | "content": [ |
| 524 | {'type': 'image', 'image': img_path}, |
| 525 | {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)} |
| 526 | ], |
| 527 | } |
| 528 | ] |
| 529 | messages.append(message) |
| 530 | |
| 531 | all_outputs = [] # List to store all answers |
| 532 | |
| 533 | # Preparation for inference |
| 534 | print("preprocessing ...") |
| 535 | texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in messages] |
| 536 | image_inputs, video_inputs = process_vision_info(messages) |
| 537 | |
| 538 | inputs = [{ |
| 539 | "prompt": texts[i], |
| 540 | "multi_modal_data": { |
| 541 | "image": image_inputs[i] |
| 542 | }, |
| 543 | } for i in range(len(messages))] |
| 544 | |
| 545 | output: list[RequestOutput] = model.generate( |
| 546 | inputs, |
| 547 | sampling_params=SamplingParams( |
| 548 | max_tokens=512, |
| 549 | temperature=0.1, |
| 550 | top_k=50, |
| 551 | top_p=1.0, |
| 552 | stop_token_ids=[processor.tokenizer.eos_token_id], |
| 553 | ), |
| 554 | ) |
| 555 | |
| 556 | batch_output_text = [o.outputs[0].text for o in output] |
| 557 | |
| 558 | all_outputs.extend(batch_output_text) |
| 559 | |
| 560 | path_score_dict = {} |
| 561 | for img_path, model_output in zip(image_paths, all_outputs): |
| 562 | print(f"{model_output = }") |
| 563 | try: |
| 564 | model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL) |
| 565 | model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip() |
| 566 | score = float(re.search(r'\d+(\.\d+)?', model_answer).group()) |
| 567 | except: |
| 568 | print(f"Meet error with {img_path}, please generate again.") |
| 569 | score = random.randint(1, 5) |
| 570 | |
| 571 | path_score_dict[img_path] = score |
| 572 | |
| 573 | return path_score_dict |
| 574 | |
| 575 | |
| 576 | random.seed(1) |
| 577 | model = LLM( |
| 578 | model=MODEL_PATH, |
| 579 | tensor_parallel_size=1, |
| 580 | trust_remote_code=True, |
| 581 | seed=1, |
| 582 | ) |
| 583 | |
| 584 | processor = AutoProcessor.from_pretrained(MODEL_PATH) |
| 585 | processor.tokenizer.padding_side = "left" |
| 586 | |
| 587 | image_paths = get_image_paths(IMAGE_PATH) # It should be a list |
| 588 | |
| 589 | path_score_dict = score_batch_image( |
| 590 | image_paths, model, processor |
| 591 | ) |
| 592 | |
| 593 | file_name = "output.txt" |
| 594 | with open(file_name, "w") as file: |
| 595 | for key, value in path_score_dict.items(): |
| 596 | file.write(f"{key} {value}\n") |
| 597 | |
| 598 | print("Done!") |
| 599 | ``` |
| 600 | </details> |
| 601 | |
| 602 | ## Training |
| 603 | |
| 604 | ### Preparation |
| 605 | 1. To smoothly execute the training procedure, first download the IQA images and place them all in a **single folder**. |
| 606 | 2. Given an original MOS file (e.g., KADID-10K_mos.txt), first execute `cd datasets`, then run `python make_data.py` (with moderate modifications) to generate a **JSON file** for model training. |
| 607 | 3. Download the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) into a folder. |
| 608 | |
| 609 | ### Training within a Single Node |
| 610 | Please modify three elements in `src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh`: |
| 611 | ``` |
| 612 | --model_name_or_path [Your Qwen2.5-VL-7B-Instruct path] \ |
| 613 | --image_folders [Your dataset images path] \ |
| 614 | --data_file_paths [Your JSON file path] \ |
| 615 | ``` |
| 616 | Then, run: |
| 617 | ``` |
| 618 | bash src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh |
| 619 | ``` |
| 620 | |
| 621 | ### Training within Multiple Nodes |
| 622 | After making the necessary modifications, run the following command: |
| 623 | ``` |
| 624 | bash src/open-r1-multimodal/run_scripts/KADID-10K/multi_run_kadid.sh |
| 625 | ``` |
| 626 | |
| 627 | |
| 628 | ## Acknowledgement |
| 629 | - [VLM-R1](https://github.com/om-ai-lab/VLM-R1): We start from codebase from the VLM-R1. |
| 630 | |
| 631 | I would like to sincerely thank [Zhuoyan Luo](https://scholar.google.com/citations?user=mKQhEsIAAAAJ&hl=en&oi=ao) for the generous support of my project and for the invaluable guidance in the field of AR generation. |
| 632 | |
| 633 | |
| 634 | ## 📧 Contact |
| 635 | If you have any question, please email `sigstianhewu@gmail.com` or `tianhewu-c@my.cityu.edu.hk`. |
| 636 | |
| 637 | ## BibTeX |
| 638 | ``` |
| 639 | @article{wu2025visualquality, |
| 640 | title={{VisualQuality-R1}: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank}, |
| 641 | author={Wu, Tianhe and Zou, Jian and Liang, Jie and Zhang, Lei and Ma, Kede}, |
| 642 | journal={arXiv preprint arXiv:2505.14460}, |
| 643 | year={2025} |
| 644 | } |
| 645 | ``` |