README.md · gemma-4-31B-it-FP8-block

README.md

9.9 KB · 336 lines · markdown Raw

1	`---`
2	`tags:`
3	`- fp8`
4	`- vllm`
5	`- llm-compressor`
6	`- compressed-tensors`
7	`library_name: transformers`
8	`license: apache-2.0`
9	`license_link: https://ai.google.dev/gemma/docs/gemma_4_license`
10	`pipeline_tag: image-text-to-text`
11	`base_model: google/gemma-4-31B-it`
12	`provider: Google`
13	`name: RedHatAI/gemma-4-31B-it-FP8-block`
14	`description: FP8-Block variant of gemma-4-31B-it.`
15	`readme: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block/blob/main/README.md`
16	`tool_calling_supported: true`
17	`required_cli_args: ['--reasoning-parser gemma4', '--enable-prefix-caching']`
18	`default-chat-template-kwargs: '{"enable_thinking": true}'`
19	`chat_template_file_name: None`
20	`chat_template_path: None`
21	`tool_call_parser: gemma4`
22	`validated_tasks:`
23	`- tool-calling`
24	`tasks:`
25	`- text-to-text`
26	`- text-generation`
27	`- tool-calling`
28	`---`
29
30	`# gemma-4-31B-it-FP8-block`
31
32	`## Model Overview`
33	`- Model Architecture: google/gemma-4-31B-it`
34	`- Input: Text / Image`
35	`- Output: Text`
36	`- Model Optimizations:`
37	`- Weight quantization: FP8`
38	`- Activation quantization: FP8`
39	`- Release Date: 2026-04-04`
40	`- Version: 1.0`
41	`- Model Developers: RedHatAI`
42
43	`This model is a quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).`
44	`It was evaluated on several tasks to assess its quality in comparison to the unquantized model.`
45
46	`### Model Optimizations`
47
48	`This model was obtained by quantizing the weights and activations of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) to FP8 data type, ready for inference with vLLM.`
49	`This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.`
50
51	`Weights are quantized using block-wise FP8 scaling (128×128 blocks), and activations are quantized dynamically per group (group_size=128). Only the weights and activations of the linear operators within transformer blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). Vision tower, embedding, and output head layers are kept in their original precision.`
52
53	`## Deployment`
54
55	`### Use with vLLM`
56
57	`This model can be deployed using [vLLM](https://docs.vllm.ai/en/latest/).`
58	`For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the [Gemma 4 vLLM usage guide](https://recipes.vllm.ai/Google/gemma-4-31B-it).`
59
60	`1. Start the vLLM server:`
61	```
62	`vllm serve RedHatAI/gemma-4-31B-it-FP8-block \`
63	`--tensor-parallel-size 2 \`
64	`--max-model-len 32768 \`
65	`--gpu-memory-utilization 0.90`
66	```
67
68	`To enable thinking/reasoning and tool calling:`
69	```
70	`vllm serve RedHatAI/gemma-4-31B-it-FP8-block \`
71	`--tensor-parallel-size 2 \`
72	`--max-model-len 32768 \`
73	`--gpu-memory-utilization 0.90 \`
74	`--enable-auto-tool-choice \`
75	`--reasoning-parser gemma4 \`
76	`--tool-call-parser gemma4 \`
77	`--chat-template examples/tool_chat_template_gemma4.jinja \`
78	`--limit-mm-per-prompt '{"image": 4, "audio": 1}' \`
79	`--async-scheduling`
80	```
81
82	> Tip: For text-only workloads, pass `--limit-mm-per-prompt '{"image": 0, "audio": 0}'` to skip vision encoder memory allocation and free up GPU memory for a longer context window.
83
84	`2. Send requests to the server:`
85
86	```python
87	`from openai import OpenAI`
88
89	`openai_api_key = "EMPTY"`
90	`openai_api_base = "http://<your-server-host>:8000/v1"`
91
92	`client = OpenAI(`
93	`api_key=openai_api_key,`
94	`base_url=openai_api_base,`
95	`)`
96
97	`model = "RedHatAI/gemma-4-31B-it-FP8-block"`
98
99	`messages = [`
100	`{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},`
101	`]`
102
103	`outputs = client.chat.completions.create(`
104	`model=model,`
105	`messages=messages,`
106	`)`
107
108	`generated_text = outputs.choices[0].message.content`
109	`print(generated_text)`
110	```
111
112	`## Creation`
113
114	`This model was created by applying data-free FP8 block quantization with [LLM Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snippet below.`
115
116	`<details>`
117
118	```python
119	`from llmcompressor import model_free_ptq`
120
121	`MODEL_ID = "google/gemma-4-31B-it"`
122	`SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"`
123
124	`model_free_ptq(`
125	`model_stub=MODEL_ID,`
126	`save_directory=SAVE_DIR,`
127	`scheme="FP8_BLOCK",`
128	`ignore=["re:.vision.", "lm_head", "re:.embed_tokens."],`
129	`max_workers=8,`
130	`device="cuda:0",`
131	`)`
132	```
133
134	`</details>`
135
136	`## Evaluation`
137
138	`This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness) and [lighteval](https://github.com/neuralmagic/lighteval), served with [vLLM](https://docs.vllm.ai/en/latest/) (OpenAI-compatible API). All evaluations were performed with thinking enabled.`
139
140	`### Accuracy`
141
142	`<table>`
143	`<thead>`
144	`<tr>`
145	`<th>Category</th>`
146	`<th>Benchmark</th>`
147	`<th>google/gemma-4-31B-it</th>`
148	`<th>RedHatAI/gemma-4-31B-it-FP8-block</th>`
149	`<th>Recovery</th>`
150	`</tr>`
151	`</thead>`
152	`<tbody>`
153	`<tr>`
154	`<td rowspan="2"><b>Instruction Following</b></td>`
155	`<td>IFEval (0-shot, prompt-level strict)</td>`
156	`<td>90.70</td>`
157	`<td>91.25</td>`
158	`<td>100.6%</td>`
159	`</tr>`
160	`<tr>`
161	`<td>IFEval (0-shot, inst-level strict)</td>`
162	`<td>93.45</td>`
163	`<td>94.00</td>`
164	`<td>100.6%</td>`
165	`</tr>`
166	`<tr>`
167	`<td rowspan="5"><b>Reasoning</b></td>`
168	`<td>GSM8K Platinum (0-shot, strict-match)</td>`
169	`<td>95.78</td>`
170	`<td>95.78</td>`
171	`<td>100.0%</td>`
172	`</tr>`
173	`<tr>`
174	`<td>MMLU-Pro (0-shot, custom-extract)</td>`
175	`<td>85.41</td>`
176	`<td>85.44</td>`
177	`<td>100.0%</td>`
178	`</tr>`
179	`<tr>`
180	`<td>MATH-500 (0-shot, pass@1)</td>`
181	`<td>89.40</td>`
182	`<td>88.67</td>`
183	`<td>99.2%</td>`
184	`</tr>`
185	`<tr>`
186	`<td>AIME 2025 (0-shot, pass@1)</td>`
187	`<td>65.83</td>`
188	`<td>68.33</td>`
189	`<td>103.8%</td>`
190	`</tr>`
191	`<tr>`
192	`<td>GPQA Diamond (0-shot, pass@1)</td>`
193	`<td>77.44</td>`
194	`<td>77.95</td>`
195	`<td>100.7%</td>`
196	`</tr>`
197	`<tr>`
198	`<td><b>Coding</b></td>`
199	`<td>LiveCodeBench v6 (0-shot, pass@1)</td>`
200	`<td>71.43</td>`
201	`<td>73.52</td>`
202	`<td>102.9%</td>`
203	`</tr>`
204	`</tbody>`
205	`</table>`
206
207	`### Reproduction`
208
209	`The results were obtained using the following commands:`
210
211	`<details>`
212
213	`Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.`
214
215	`vLLM server (instruction following and reasoning benchmarks):`
216	```
217	`vllm serve RedHatAI/gemma-4-31B-it-FP8-block \`
218	`--tensor-parallel-size 2 \`
219	`--max-model-len 69632 \`
220	`--gpu-memory-utilization 0.90 \`
221	`--enable-auto-tool-choice \`
222	`--reasoning-parser gemma4 \`
223	`--tool-call-parser gemma4 \`
224	`--chat-template examples/tool_chat_template_gemma4.jinja \`
225	`--limit-mm-per-prompt '{"image":0,"audio":0}' \`
226	`--async-scheduling`
227	```
228
229	`#### GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)`
230	```
231	`lm_eval --model local-chat-completions \`
232	`--tasks gsm8k_platinum_cot_llama \`
233	`--model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \`
234	`--num_fewshot 0 \`
235	`--apply_chat_template \`
236	`--output_path results_gsm8k_platinum.json \`
237	`--seed 1234 \`
238	`--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"`
239	```
240
241	`#### MMLU-Pro (lm-eval, 0-shot, 3 repetitions)`
242	```
243	`lm_eval --model local-chat-completions \`
244	`--tasks mmlu_pro_chat \`
245	`--model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \`
246	`--num_fewshot 0 \`
247	`--apply_chat_template \`
248	`--output_path results_mmlu_pro.json \`
249	`--seed 1234 \`
250	`--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"`
251	```
252
253	`#### IFEval (lm-eval, 0-shot, 3 repetitions)`
254	```
255	`lm_eval --model local-chat-completions \`
256	`--tasks ifeval \`
257	`--model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=69632,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \`
258	`--num_fewshot 0 \`
259	`--apply_chat_template \`
260	`--output_path results_ifeval.json \`
261	`--seed 1234 \`
262	`--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"`
263	```
264
265	`#### MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)`
266
267	`litellm_config.yaml:`
268	```yaml
269	`model_parameters:`
270	`provider: hosted_vllm`
271	`model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block`
272	`base_url: http://0.0.0.0:8000/v1`
273	`api_key: ''`
274	`timeout: 3600`
275	`concurrent_requests: 32`
276	`generation_parameters:`
277	`temperature: 1.0`
278	`max_new_tokens: 65536`
279	`top_p: 0.95`
280	`top_k: 64`
281	`seed: 1234`
282	```
283
284	Run once per seed (changing `seed` in the config each time):
285	```
286	`lighteval endpoint litellm litellm_config.yaml 'math_500\|0' \`
287	`--output-dir results/ --save-details`
288
289	`lighteval endpoint litellm litellm_config.yaml 'aime25\|0' \`
290	`--output-dir results/ --save-details`
291
292	`lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond\|0' \`
293	`--output-dir results/ --save-details`
294	```
295
296	`#### LiveCodeBench v6 (lighteval, 3 repetitions)`
297
298	`vLLM server:`
299	```
300	`vllm serve RedHatAI/gemma-4-31B-it-FP8-block \`
301	`--tensor-parallel-size 2 \`
302	`--max-model-len 36864 \`
303	`--gpu-memory-utilization 0.90 \`
304	`--enable-auto-tool-choice \`
305	`--reasoning-parser gemma4 \`
306	`--tool-call-parser gemma4 \`
307	`--chat-template examples/tool_chat_template_gemma4.jinja \`
308	`--limit-mm-per-prompt '{"image":0,"audio":0}' \`
309	`--async-scheduling`
310	```
311
312	`litellm_config.yaml:`
313	```yaml
314	`model_parameters:`
315	`provider: hosted_vllm`
316	`model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-block`
317	`base_url: http://0.0.0.0:8000/v1`
318	`api_key: ''`
319	`timeout: 1200`
320	`concurrent_requests: 32`
321	`generation_parameters:`
322	`temperature: 1.0`
323	`max_new_tokens: 32768`
324	`top_p: 0.95`
325	`top_k: 64`
326	`seed: 1234`
327	```
328
329	`Run once per seed:`
330	```
331	`lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6\|0' \`
332	`--output-dir results/ --save-details`
333	```
334
335	`</details>`
336