README.md
| 1 | --- |
| 2 | tags: |
| 3 | - fp8 |
| 4 | - vllm |
| 5 | - llm-compressor |
| 6 | - compressed-tensors |
| 7 | library_name: transformers |
| 8 | license: apache-2.0 |
| 9 | license_link: https://ai.google.dev/gemma/docs/gemma_4_license |
| 10 | pipeline_tag: image-text-to-text |
| 11 | base_model: google/gemma-4-31B-it |
| 12 | provider: Google |
| 13 | name: RedHatAI/gemma-4-31B-it-FP8-block |
| 14 | description: FP8-Block variant of gemma-4-31B-it. |
| 15 | readme: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block/blob/main/README.md |
| 16 | tool_calling_supported: true |
| 17 | required_cli_args: ['--reasoning-parser gemma4', '--enable-prefix-caching'] |
| 18 | default-chat-template-kwargs: '{"enable_thinking": true}' |
| 19 | chat_template_file_name: None |
| 20 | chat_template_path: None |
| 21 | tool_call_parser: gemma4 |
| 22 | validated_tasks: |
| 23 | - tool-calling |
| 24 | tasks: |
| 25 | - text-to-text |
| 26 | - text-generation |
| 27 | - tool-calling |
| 28 | --- |
| 29 | |
| 30 | # gemma-4-31B-it-FP8-block |
| 31 | |
| 32 | ## Model Overview |
| 33 | - **Model Architecture:** google/gemma-4-31B-it |
| 34 | - **Input:** Text / Image |
| 35 | - **Output:** Text |
| 36 | - **Model Optimizations:** |
| 37 | - **Weight quantization:** FP8 |
| 38 | - **Activation quantization:** FP8 |
| 39 | - **Release Date:** 2026-04-04 |
| 40 | - **Version:** 1.0 |
| 41 | - **Model Developers:** RedHatAI |
| 42 | |
| 43 | This model is a quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it). |
| 44 | It was evaluated on several tasks to assess its quality in comparison to the unquantized model. |
| 45 | |
| 46 | ### Model Optimizations |
| 47 | |
| 48 | This model was obtained by quantizing the weights and activations of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) to FP8 data type, ready for inference with vLLM. |
| 49 | This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. |
| 50 | |
| 51 | Weights are quantized using block-wise FP8 scaling (128×128 blocks), and activations are quantized dynamically per group (group_size=128). Only the weights and activations of the linear operators within transformer blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). Vision tower, embedding, and output head layers are kept in their original precision. |
| 52 | |
| 53 | ## Deployment |
| 54 | |
| 55 | ### Use with vLLM |
| 56 | |
| 57 | This model can be deployed using [vLLM](https://docs.vllm.ai/en/latest/). |
| 58 | For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the [Gemma 4 vLLM usage guide](https://recipes.vllm.ai/Google/gemma-4-31B-it). |
| 59 | |
| 60 | 1. Start the vLLM server: |
| 61 | ``` |
| 62 | vllm serve RedHatAI/gemma-4-31B-it-FP8-block \ |
| 63 | --tensor-parallel-size 2 \ |
| 64 | --max-model-len 32768 \ |
| 65 | --gpu-memory-utilization 0.90 |
| 66 | ``` |
| 67 | |
| 68 | To enable thinking/reasoning and tool calling: |
| 69 | ``` |
| 70 | vllm serve RedHatAI/gemma-4-31B-it-FP8-block \ |
| 71 | --tensor-parallel-size 2 \ |
| 72 | --max-model-len 32768 \ |
| 73 | --gpu-memory-utilization 0.90 \ |
| 74 | --enable-auto-tool-choice \ |
| 75 | --reasoning-parser gemma4 \ |
| 76 | --tool-call-parser gemma4 \ |
| 77 | --chat-template examples/tool_chat_template_gemma4.jinja \ |
| 78 | --limit-mm-per-prompt '{"image": 4, "audio": 1}' \ |
| 79 | --async-scheduling |
| 80 | ``` |
| 81 | |
| 82 | > **Tip:** For text-only workloads, pass `--limit-mm-per-prompt '{"image": 0, "audio": 0}'` to skip vision encoder memory allocation and free up GPU memory for a longer context window. |
| 83 | |
| 84 | 2. Send requests to the server: |
| 85 | |
| 86 | ```python |
| 87 | from openai import OpenAI |
| 88 | |
| 89 | openai_api_key = "EMPTY" |
| 90 | openai_api_base = "http://<your-server-host>:8000/v1" |
| 91 | |
| 92 | client = OpenAI( |
| 93 | api_key=openai_api_key, |
| 94 | base_url=openai_api_base, |
| 95 | ) |
| 96 | |
| 97 | model = "RedHatAI/gemma-4-31B-it-FP8-block" |
| 98 | |
| 99 | messages = [ |
| 100 | {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, |
| 101 | ] |
| 102 | |
| 103 | outputs = client.chat.completions.create( |
| 104 | model=model, |
| 105 | messages=messages, |
| 106 | ) |
| 107 | |
| 108 | generated_text = outputs.choices[0].message.content |
| 109 | print(generated_text) |
| 110 | ``` |
| 111 | |
| 112 | ## Creation |
| 113 | |
| 114 | This model was created by applying data-free FP8 block quantization with [LLM Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snippet below. |
| 115 | |
| 116 | <details> |
| 117 | |
| 118 | ```python |
| 119 | from llmcompressor import model_free_ptq |
| 120 | |
| 121 | MODEL_ID = "google/gemma-4-31B-it" |
| 122 | SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block" |
| 123 | |
| 124 | model_free_ptq( |
| 125 | model_stub=MODEL_ID, |
| 126 | save_directory=SAVE_DIR, |
| 127 | scheme="FP8_BLOCK", |
| 128 | ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"], |
| 129 | max_workers=8, |
| 130 | device="cuda:0", |
| 131 | ) |
| 132 | ``` |
| 133 | |
| 134 | </details> |
| 135 | |
| 136 | ## Evaluation |
| 137 | |
| 138 | This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness) and [lighteval](https://github.com/neuralmagic/lighteval), served with [vLLM](https://docs.vllm.ai/en/latest/) (OpenAI-compatible API). All evaluations were performed with **thinking enabled**. |
| 139 | |
| 140 | ### Accuracy |
| 141 | |
| 142 | <table> |
| 143 | <thead> |
| 144 | <tr> |
| 145 | <th>Category</th> |
| 146 | <th>Benchmark</th> |
| 147 | <th>google/gemma-4-31B-it</th> |
| 148 | <th>RedHatAI/gemma-4-31B-it-FP8-block</th> |
| 149 | <th>Recovery</th> |
| 150 | </tr> |
| 151 | </thead> |
| 152 | <tbody> |
| 153 | <tr> |
| 154 | <td rowspan="2"><b>Instruction Following</b></td> |
| 155 | <td>IFEval (0-shot, prompt-level strict)</td> |
| 156 | <td>90.70</td> |
| 157 | <td>91.25</td> |
| 158 | <td>100.6%</td> |
| 159 | </tr> |
| 160 | <tr> |
| 161 | <td>IFEval (0-shot, inst-level strict)</td> |
| 162 | <td>93.45</td> |
| 163 | <td>94.00</td> |
| 164 | <td>100.6%</td> |
| 165 | </tr> |
| 166 | <tr> |
| 167 | <td rowspan="5"><b>Reasoning</b></td> |
| 168 | <td>GSM8K Platinum (0-shot, strict-match)</td> |
| 169 | <td>95.78</td> |
| 170 | <td>95.78</td> |
| 171 | <td>100.0%</td> |
| 172 | </tr> |
| 173 | <tr> |
| 174 | <td>MMLU-Pro (0-shot, custom-extract)</td> |
| 175 | <td>85.41</td> |
| 176 | <td>85.44</td> |
| 177 | <td>100.0%</td> |
| 178 | </tr> |
| 179 | <tr> |
| 180 | <td>MATH-500 (0-shot, pass@1)</td> |
| 181 | <td>89.40</td> |
| 182 | <td>88.67</td> |
| 183 | <td>99.2%</td> |
| 184 | </tr> |
| 185 | <tr> |
| 186 | <td>AIME 2025 (0-shot, pass@1)</td> |
| 187 | <td>65.83</td> |
| 188 | <td>68.33</td> |
| 189 | <td>103.8%</td> |
| 190 | </tr> |
| 191 | <tr> |
| 192 | <td>GPQA Diamond (0-shot, pass@1)</td> |
| 193 | <td>77.44</td> |
| 194 | <td>77.95</td> |
| 195 | <td>100.7%</td> |
| 196 | </tr> |
| 197 | <tr> |
| 198 | <td><b>Coding</b></td> |
| 199 | <td>LiveCodeBench v6 (0-shot, pass@1)</td> |
| 200 | <td>71.43</td> |
| 201 | <td>73.52</td> |
| 202 | <td>102.9%</td> |
| 203 | </tr> |
| 204 | </tbody> |
| 205 | </table> |
| 206 | |
| 207 | ### Reproduction |
| 208 | |
| 209 | The results were obtained using the following commands: |
| 210 | |
| 211 | <details> |
| 212 | |
| 213 | Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds. |
| 214 | |
| 215 | **vLLM server (instruction following and reasoning benchmarks):** |
| 216 | ``` |
| 217 | vllm serve RedHatAI/gemma-4-31B-it-FP8-block \ |
| 218 | --tensor-parallel-size 2 \ |
| 219 | --max-model-len 69632 \ |
| 220 | --gpu-memory-utilization 0.90 \ |
| 221 | --enable-auto-tool-choice \ |
| 222 | --reasoning-parser gemma4 \ |
| 223 | --tool-call-parser gemma4 \ |
| 224 | --chat-template examples/tool_chat_template_gemma4.jinja \ |
| 225 | --limit-mm-per-prompt '{"image":0,"audio":0}' \ |
| 226 | --async-scheduling |
| 227 | ``` |
| 228 | |
| 229 | #### GSM8K Platinum (lm-eval, 0-shot, 3 repetitions) |
| 230 | ``` |
| 231 | lm_eval --model local-chat-completions \ |
| 232 | --tasks gsm8k_platinum_cot_llama \ |
| 233 | --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \ |
| 234 | --num_fewshot 0 \ |
| 235 | --apply_chat_template \ |
| 236 | --output_path results_gsm8k_platinum.json \ |
| 237 | --seed 1234 \ |
| 238 | --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234" |
| 239 | ``` |
| 240 | |
| 241 | #### MMLU-Pro (lm-eval, 0-shot, 3 repetitions) |
| 242 | ``` |
| 243 | lm_eval --model local-chat-completions \ |
| 244 | --tasks mmlu_pro_chat \ |
| 245 | --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \ |
| 246 | --num_fewshot 0 \ |
| 247 | --apply_chat_template \ |
| 248 | --output_path results_mmlu_pro.json \ |
| 249 | --seed 1234 \ |
| 250 | --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234" |
| 251 | ``` |
| 252 | |
| 253 | #### IFEval (lm-eval, 0-shot, 3 repetitions) |
| 254 | ``` |
| 255 | lm_eval --model local-chat-completions \ |
| 256 | --tasks ifeval \ |
| 257 | --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \ |
| 258 | --num_fewshot 0 \ |
| 259 | --apply_chat_template \ |
| 260 | --output_path results_ifeval.json \ |
| 261 | --seed 1234 \ |
| 262 | --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234" |
| 263 | ``` |
| 264 | |
| 265 | #### MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025) |
| 266 | |
| 267 | **litellm_config.yaml:** |
| 268 | ```yaml |
| 269 | model_parameters: |
| 270 | provider: hosted_vllm |
| 271 | model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block |
| 272 | base_url: http://0.0.0.0:8000/v1 |
| 273 | api_key: '' |
| 274 | timeout: 3600 |
| 275 | concurrent_requests: 32 |
| 276 | generation_parameters: |
| 277 | temperature: 1.0 |
| 278 | max_new_tokens: 65536 |
| 279 | top_p: 0.95 |
| 280 | top_k: 64 |
| 281 | seed: 1234 |
| 282 | ``` |
| 283 | |
| 284 | Run once per seed (changing `seed` in the config each time): |
| 285 | ``` |
| 286 | lighteval endpoint litellm litellm_config.yaml 'math_500|0' \ |
| 287 | --output-dir results/ --save-details |
| 288 | |
| 289 | lighteval endpoint litellm litellm_config.yaml 'aime25|0' \ |
| 290 | --output-dir results/ --save-details |
| 291 | |
| 292 | lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \ |
| 293 | --output-dir results/ --save-details |
| 294 | ``` |
| 295 | |
| 296 | #### LiveCodeBench v6 (lighteval, 3 repetitions) |
| 297 | |
| 298 | **vLLM server:** |
| 299 | ``` |
| 300 | vllm serve RedHatAI/gemma-4-31B-it-FP8-block \ |
| 301 | --tensor-parallel-size 2 \ |
| 302 | --max-model-len 36864 \ |
| 303 | --gpu-memory-utilization 0.90 \ |
| 304 | --enable-auto-tool-choice \ |
| 305 | --reasoning-parser gemma4 \ |
| 306 | --tool-call-parser gemma4 \ |
| 307 | --chat-template examples/tool_chat_template_gemma4.jinja \ |
| 308 | --limit-mm-per-prompt '{"image":0,"audio":0}' \ |
| 309 | --async-scheduling |
| 310 | ``` |
| 311 | |
| 312 | **litellm_config.yaml:** |
| 313 | ```yaml |
| 314 | model_parameters: |
| 315 | provider: hosted_vllm |
| 316 | model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block |
| 317 | base_url: http://0.0.0.0:8000/v1 |
| 318 | api_key: '' |
| 319 | timeout: 1200 |
| 320 | concurrent_requests: 32 |
| 321 | generation_parameters: |
| 322 | temperature: 1.0 |
| 323 | max_new_tokens: 32768 |
| 324 | top_p: 0.95 |
| 325 | top_k: 64 |
| 326 | seed: 1234 |
| 327 | ``` |
| 328 | |
| 329 | Run once per seed: |
| 330 | ``` |
| 331 | lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \ |
| 332 | --output-dir results/ --save-details |
| 333 | ``` |
| 334 | |
| 335 | </details> |
| 336 | |