README.md
9.9 KB · 336 lines · markdown Raw
1 ---
2 tags:
3 - fp8
4 - vllm
5 - llm-compressor
6 - compressed-tensors
7 library_name: transformers
8 license: apache-2.0
9 license_link: https://ai.google.dev/gemma/docs/gemma_4_license
10 pipeline_tag: image-text-to-text
11 base_model: google/gemma-4-31B-it
12 provider: Google
13 name: RedHatAI/gemma-4-31B-it-FP8-block
14 description: FP8-Block variant of gemma-4-31B-it.
15 readme: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block/blob/main/README.md
16 tool_calling_supported: true
17 required_cli_args: ['--reasoning-parser gemma4', '--enable-prefix-caching']
18 default-chat-template-kwargs: '{"enable_thinking": true}'
19 chat_template_file_name: None
20 chat_template_path: None
21 tool_call_parser: gemma4
22 validated_tasks:
23 - tool-calling
24 tasks:
25 - text-to-text
26 - text-generation
27 - tool-calling
28 ---
29
30 # gemma-4-31B-it-FP8-block
31
32 ## Model Overview
33 - **Model Architecture:** google/gemma-4-31B-it
34 - **Input:** Text / Image
35 - **Output:** Text
36 - **Model Optimizations:**
37 - **Weight quantization:** FP8
38 - **Activation quantization:** FP8
39 - **Release Date:** 2026-04-04
40 - **Version:** 1.0
41 - **Model Developers:** RedHatAI
42
43 This model is a quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).
44 It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
45
46 ### Model Optimizations
47
48 This model was obtained by quantizing the weights and activations of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) to FP8 data type, ready for inference with vLLM.
49 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
50
51 Weights are quantized using block-wise FP8 scaling (128×128 blocks), and activations are quantized dynamically per group (group_size=128). Only the weights and activations of the linear operators within transformer blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). Vision tower, embedding, and output head layers are kept in their original precision.
52
53 ## Deployment
54
55 ### Use with vLLM
56
57 This model can be deployed using [vLLM](https://docs.vllm.ai/en/latest/).
58 For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the [Gemma 4 vLLM usage guide](https://recipes.vllm.ai/Google/gemma-4-31B-it).
59
60 1. Start the vLLM server:
61 ```
62 vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
63 --tensor-parallel-size 2 \
64 --max-model-len 32768 \
65 --gpu-memory-utilization 0.90
66 ```
67
68 To enable thinking/reasoning and tool calling:
69 ```
70 vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
71 --tensor-parallel-size 2 \
72 --max-model-len 32768 \
73 --gpu-memory-utilization 0.90 \
74 --enable-auto-tool-choice \
75 --reasoning-parser gemma4 \
76 --tool-call-parser gemma4 \
77 --chat-template examples/tool_chat_template_gemma4.jinja \
78 --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
79 --async-scheduling
80 ```
81
82 > **Tip:** For text-only workloads, pass `--limit-mm-per-prompt '{"image": 0, "audio": 0}'` to skip vision encoder memory allocation and free up GPU memory for a longer context window.
83
84 2. Send requests to the server:
85
86 ```python
87 from openai import OpenAI
88
89 openai_api_key = "EMPTY"
90 openai_api_base = "http://<your-server-host>:8000/v1"
91
92 client = OpenAI(
93 api_key=openai_api_key,
94 base_url=openai_api_base,
95 )
96
97 model = "RedHatAI/gemma-4-31B-it-FP8-block"
98
99 messages = [
100 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
101 ]
102
103 outputs = client.chat.completions.create(
104 model=model,
105 messages=messages,
106 )
107
108 generated_text = outputs.choices[0].message.content
109 print(generated_text)
110 ```
111
112 ## Creation
113
114 This model was created by applying data-free FP8 block quantization with [LLM Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snippet below.
115
116 <details>
117
118 ```python
119 from llmcompressor import model_free_ptq
120
121 MODEL_ID = "google/gemma-4-31B-it"
122 SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"
123
124 model_free_ptq(
125 model_stub=MODEL_ID,
126 save_directory=SAVE_DIR,
127 scheme="FP8_BLOCK",
128 ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
129 max_workers=8,
130 device="cuda:0",
131 )
132 ```
133
134 </details>
135
136 ## Evaluation
137
138 This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness) and [lighteval](https://github.com/neuralmagic/lighteval), served with [vLLM](https://docs.vllm.ai/en/latest/) (OpenAI-compatible API). All evaluations were performed with **thinking enabled**.
139
140 ### Accuracy
141
142 <table>
143 <thead>
144 <tr>
145 <th>Category</th>
146 <th>Benchmark</th>
147 <th>google/gemma-4-31B-it</th>
148 <th>RedHatAI/gemma-4-31B-it-FP8-block</th>
149 <th>Recovery</th>
150 </tr>
151 </thead>
152 <tbody>
153 <tr>
154 <td rowspan="2"><b>Instruction Following</b></td>
155 <td>IFEval (0-shot, prompt-level strict)</td>
156 <td>90.70</td>
157 <td>91.25</td>
158 <td>100.6%</td>
159 </tr>
160 <tr>
161 <td>IFEval (0-shot, inst-level strict)</td>
162 <td>93.45</td>
163 <td>94.00</td>
164 <td>100.6%</td>
165 </tr>
166 <tr>
167 <td rowspan="5"><b>Reasoning</b></td>
168 <td>GSM8K Platinum (0-shot, strict-match)</td>
169 <td>95.78</td>
170 <td>95.78</td>
171 <td>100.0%</td>
172 </tr>
173 <tr>
174 <td>MMLU-Pro (0-shot, custom-extract)</td>
175 <td>85.41</td>
176 <td>85.44</td>
177 <td>100.0%</td>
178 </tr>
179 <tr>
180 <td>MATH-500 (0-shot, pass@1)</td>
181 <td>89.40</td>
182 <td>88.67</td>
183 <td>99.2%</td>
184 </tr>
185 <tr>
186 <td>AIME 2025 (0-shot, pass@1)</td>
187 <td>65.83</td>
188 <td>68.33</td>
189 <td>103.8%</td>
190 </tr>
191 <tr>
192 <td>GPQA Diamond (0-shot, pass@1)</td>
193 <td>77.44</td>
194 <td>77.95</td>
195 <td>100.7%</td>
196 </tr>
197 <tr>
198 <td><b>Coding</b></td>
199 <td>LiveCodeBench v6 (0-shot, pass@1)</td>
200 <td>71.43</td>
201 <td>73.52</td>
202 <td>102.9%</td>
203 </tr>
204 </tbody>
205 </table>
206
207 ### Reproduction
208
209 The results were obtained using the following commands:
210
211 <details>
212
213 Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.
214
215 **vLLM server (instruction following and reasoning benchmarks):**
216 ```
217 vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
218 --tensor-parallel-size 2 \
219 --max-model-len 69632 \
220 --gpu-memory-utilization 0.90 \
221 --enable-auto-tool-choice \
222 --reasoning-parser gemma4 \
223 --tool-call-parser gemma4 \
224 --chat-template examples/tool_chat_template_gemma4.jinja \
225 --limit-mm-per-prompt '{"image":0,"audio":0}' \
226 --async-scheduling
227 ```
228
229 #### GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)
230 ```
231 lm_eval --model local-chat-completions \
232 --tasks gsm8k_platinum_cot_llama \
233 --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
234 --num_fewshot 0 \
235 --apply_chat_template \
236 --output_path results_gsm8k_platinum.json \
237 --seed 1234 \
238 --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
239 ```
240
241 #### MMLU-Pro (lm-eval, 0-shot, 3 repetitions)
242 ```
243 lm_eval --model local-chat-completions \
244 --tasks mmlu_pro_chat \
245 --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
246 --num_fewshot 0 \
247 --apply_chat_template \
248 --output_path results_mmlu_pro.json \
249 --seed 1234 \
250 --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
251 ```
252
253 #### IFEval (lm-eval, 0-shot, 3 repetitions)
254 ```
255 lm_eval --model local-chat-completions \
256 --tasks ifeval \
257 --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
258 --num_fewshot 0 \
259 --apply_chat_template \
260 --output_path results_ifeval.json \
261 --seed 1234 \
262 --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
263 ```
264
265 #### MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)
266
267 **litellm_config.yaml:**
268 ```yaml
269 model_parameters:
270 provider: hosted_vllm
271 model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block
272 base_url: http://0.0.0.0:8000/v1
273 api_key: ''
274 timeout: 3600
275 concurrent_requests: 32
276 generation_parameters:
277 temperature: 1.0
278 max_new_tokens: 65536
279 top_p: 0.95
280 top_k: 64
281 seed: 1234
282 ```
283
284 Run once per seed (changing `seed` in the config each time):
285 ```
286 lighteval endpoint litellm litellm_config.yaml 'math_500|0' \
287 --output-dir results/ --save-details
288
289 lighteval endpoint litellm litellm_config.yaml 'aime25|0' \
290 --output-dir results/ --save-details
291
292 lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \
293 --output-dir results/ --save-details
294 ```
295
296 #### LiveCodeBench v6 (lighteval, 3 repetitions)
297
298 **vLLM server:**
299 ```
300 vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
301 --tensor-parallel-size 2 \
302 --max-model-len 36864 \
303 --gpu-memory-utilization 0.90 \
304 --enable-auto-tool-choice \
305 --reasoning-parser gemma4 \
306 --tool-call-parser gemma4 \
307 --chat-template examples/tool_chat_template_gemma4.jinja \
308 --limit-mm-per-prompt '{"image":0,"audio":0}' \
309 --async-scheduling
310 ```
311
312 **litellm_config.yaml:**
313 ```yaml
314 model_parameters:
315 provider: hosted_vllm
316 model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block
317 base_url: http://0.0.0.0:8000/v1
318 api_key: ''
319 timeout: 1200
320 concurrent_requests: 32
321 generation_parameters:
322 temperature: 1.0
323 max_new_tokens: 32768
324 top_p: 0.95
325 top_k: 64
326 seed: 1234
327 ```
328
329 Run once per seed:
330 ```
331 lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \
332 --output-dir results/ --save-details
333 ```
334
335 </details>
336