README.md · Llama-3.2-1B-Instruct-FP8-dynamic

README.md

8.6 KB · 319 lines · markdown Raw

1	`---`
2	`tags:`
3	`- fp8`
4	`- vllm`
5	`language:`
6	`- en`
7	`- de`
8	`- fr`
9	`- it`
10	`- pt`
11	`- hi`
12	`- es`
13	`- th`
14	`pipeline_tag: text-generation`
15	`license: llama3.2`
16	`base_model: meta-llama/Llama-3.2-1B-Instruct`
17	`---`
18
19	`# Llama-3.2-1B-Instruct-FP8-dynamic`
20
21	`## Model Overview`
22	`- Model Architecture: Meta-Llama-3.2`
23	`- Input: Text`
24	`- Output: Text`
25	`- Model Optimizations:`
26	`- Weight quantization: FP8`
27	`- Activation quantization: FP8`
28	`- Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), this models is intended for assistant-like chat.`
29	`- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.`
30	`- Release Date: 9/25/2024`
31	`- Version: 1.0`
32	`- License(s): [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE)`
33	`- Model Developers: Neural Magic`
34
35	`Quantized version of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).`
36	`It achieves an average score of 50.88 on a subset of task from the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 51.70.`
37
38	`### Model Optimizations`
39
40	`This model was obtained by quantizing the weights and activations of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) to FP8 data type, ready for inference with vLLM built from source.`
41	`This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.`
42
43	`Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.`
44	`[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.`
45
46	`## Deployment`
47
48	`### Use with vLLM`
49
50	`This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.`
51
52	```python
53	`from vllm import LLM, SamplingParams`
54	`from transformers import AutoTokenizer`
55
56	`model_id = "neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic"`
57
58	`sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)`
59
60	`tokenizer = AutoTokenizer.from_pretrained(model_id)`
61
62	`messages = [`
63	`{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},`
64	`{"role": "user", "content": "Who are you?"},`
65	`]`
66
67	`prompts = tokenizer.apply_chat_template(messages, tokenize=False)`
68
69	`llm = LLM(model=model_id)`
70
71	`outputs = llm.generate(prompts, sampling_params)`
72
73	`generated_text = outputs[0].outputs[0].text`
74	`print(generated_text)`
75	```
76
77	`vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.`
78
79	`## Creation`
80
81	`This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.`
82
83	```python
84	`import torch`
85
86	`from transformers import AutoTokenizer`
87
88	`from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot`
89	`from llmcompressor.transformers.compression.helpers import ( # noqa`
90	`calculate_offload_device_map,`
91	`custom_offload_device_map,`
92	`)`
93
94	`recipe = """`
95	`quant_stage:`
96	`quant_modifiers:`
97	`QuantizationModifier:`
98	`ignore: ["lm_head"]`
99	`config_groups:`
100	`group_0:`
101	`weights:`
102	`num_bits: 8`
103	`type: float`
104	`strategy: channel`
105	`dynamic: false`
106	`symmetric: true`
107	`input_activations:`
108	`num_bits: 8`
109	`type: float`
110	`strategy: token`
111	`dynamic: true`
112	`symmetric: true`
113	`targets: ["Linear"]`
114	`"""`
115
116	`model_stub = "meta-llama/Llama-3.2-1B-Instruct"`
117	`model_name = model_stub.split("/")[-1]`
118
119	`device_map = calculate_offload_device_map(`
120	`model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"`
121	`)`
122
123	`model = SparseAutoModelForCausalLM.from_pretrained(`
124	`model_stub, torch_dtype="auto", device_map=device_map`
125	`)`
126
127	`output_dir = f"./{model_name}-FP8-dynamic"`
128
129	`oneshot(`
130	`model=model,`
131	`recipe=recipe,`
132	`output_dir=output_dir,`
133	`save_compressed=True,`
134	`tokenizer=AutoTokenizer.from_pretrained(model_stub),`
135	`)`
136	```
137
138	`## Evaluation`
139
140	`The model was evaluated on MMLU, ARC-Challenge, GSM-8K, and Winogrande.`
141	`Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.`
142	`This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).`
143
144	`### Accuracy`
145
146	`#### Open LLM Leaderboard evaluation scores`
147	`<table>`
148	`<tr>`
149	`<td><strong>Benchmark</strong>`
150	`</td>`
151	`<td><strong>Llama-3.2-1B-Instruct </strong>`
152	`</td>`
153	`<td><strong>Llama-3.2-1B-Instruct-FP8-dynamic (this model)</strong>`
154	`</td>`
155	`<td><strong>Recovery</strong>`
156	`</td>`
157	`</tr>`
158	`<tr>`
159	`<td>MMLU (5-shot)`
160	`</td>`
161	`<td>47.66`
162	`</td>`
163	`<td>47.55`
164	`</td>`
165	`<td>99.8%`
166	`</td>`
167	`</tr>`
168	`<tr>`
169	`<td>MMLU-cot (0-shot)`
170	`</td>`
171	`<td>47.10`
172	`</td>`
173	`<td>46.79`
174	`</td>`
175	`<td>99.3%`
176	`</td>`
177	`</tr>`
178	`<tr>`
179	`<td>ARC Challenge (0-shot)`
180	`</td>`
181	`<td>58.36`
182	`</td>`
183	`<td>57.25`
184	`</td>`
185	`<td>98.1%`
186	`</td>`
187	`</tr>`
188	`<tr>`
189	`<td>GSM-8K-cot (8-shot, strict-match)`
190	`</td>`
191	`<td>45.72`
192	`</td>`
193	`<td>45.94`
194	`</td>`
195	`<td>100.5%`
196	`</td>`
197	`</tr>`
198	`<tr>`
199	`<td>Winogrande (5-shot)`
200	`</td>`
201	`<td>62.27`
202	`</td>`
203	`<td>61.40`
204	`</td>`
205	`<td>98.6%`
206	`</td>`
207	`</tr>`
208	`<tr>`
209	`<td>Hellaswag (10-shot)`
210	`</td>`
211	`<td>61.01`
212	`</td>`
213	`<td>60.95`
214	`</td>`
215	`<td>99.9%`
216	`</td>`
217	`</tr>`
218	`<tr>`
219	`<td>TruthfulQA (0-shot, mc2)`
220	`</td>`
221	`<td>43.52`
222	`</td>`
223	`<td>44.23`
224	`</td>`
225	`<td>101.6%`
226	`</td>`
227	`</tr>`
228	`<tr>`
229	`<td><strong>Average</strong>`
230	`</td>`
231	`<td><strong>52.24</strong>`
232	`</td>`
233	`<td><strong>52.02</strong>`
234	`</td>`
235	`<td><strong>99.7%</strong>`
236	`</td>`
237	`</tr>`
238	`</table>`
239
240	`### Reproduction`
241
242	`The results were obtained using the following commands:`
243
244
245	`#### MMLU`
246	```
247	`lm_eval \`
248	`--model vllm \`
249	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \`
250	`--tasks mmlu_llama_3.1_instruct \`
251	`--fewshot_as_multiturn \`
252	`--apply_chat_template \`
253	`--num_fewshot 5 \`
254	`--batch_size auto`
255	```
256
257	`#### MMLU-CoT`
258	```
259	`lm_eval \`
260	`--model vllm \`
261	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \`
262	`--tasks mmlu_cot_0shot_llama_3.1_instruct \`
263	`--apply_chat_template \`
264	`--num_fewshot 0 \`
265	`--batch_size auto`
266	```
267
268	`#### ARC-Challenge`
269	```
270	`lm_eval \`
271	`--model vllm \`
272	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \`
273	`--tasks arc_challenge_llama_3.1_instruct \`
274	`--apply_chat_template \`
275	`--num_fewshot 0 \`
276	`--batch_size auto`
277	```
278
279	`#### GSM-8K`
280	```
281	`lm_eval \`
282	`--model vllm \`
283	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \`
284	`--tasks gsm8k_cot_llama_3.1_instruct \`
285	`--fewshot_as_multiturn \`
286	`--apply_chat_template \`
287	`--num_fewshot 8 \`
288	`--batch_size auto`
289	```
290
291	`#### Hellaswag`
292	```
293	`lm_eval \`
294	`--model vllm \`
295	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \`
296	`--tasks hellaswag \`
297	`--num_fewshot 10 \`
298	`--batch_size auto`
299	```
300
301	`#### Winogrande`
302	```
303	`lm_eval \`
304	`--model vllm \`
305	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \`
306	`--tasks winogrande \`
307	`--num_fewshot 5 \`
308	`--batch_size auto`
309	```
310
311	`#### TruthfulQA`
312	```
313	`lm_eval \`
314	`--model vllm \`
315	`--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \`
316	`--tasks truthfulqa \`
317	`--num_fewshot 0 \`
318	`--batch_size auto`
319	```