README.md
8.6 KB · 319 lines · markdown Raw
1 ---
2 tags:
3 - fp8
4 - vllm
5 language:
6 - en
7 - de
8 - fr
9 - it
10 - pt
11 - hi
12 - es
13 - th
14 pipeline_tag: text-generation
15 license: llama3.2
16 base_model: meta-llama/Llama-3.2-1B-Instruct
17 ---
18
19 # Llama-3.2-1B-Instruct-FP8-dynamic
20
21 ## Model Overview
22 - **Model Architecture:** Meta-Llama-3.2
23 - **Input:** Text
24 - **Output:** Text
25 - **Model Optimizations:**
26 - **Weight quantization:** FP8
27 - **Activation quantization:** FP8
28 - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), this models is intended for assistant-like chat.
29 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30 - **Release Date:** 9/25/2024
31 - **Version:** 1.0
32 - **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE)
33 - **Model Developers:** Neural Magic
34
35 Quantized version of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
36 It achieves an average score of 50.88 on a subset of task from the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 51.70.
37
38 ### Model Optimizations
39
40 This model was obtained by quantizing the weights and activations of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
41 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
42
43 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
44 [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
45
46 ## Deployment
47
48 ### Use with vLLM
49
50 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
51
52 ```python
53 from vllm import LLM, SamplingParams
54 from transformers import AutoTokenizer
55
56 model_id = "neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic"
57
58 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
59
60 tokenizer = AutoTokenizer.from_pretrained(model_id)
61
62 messages = [
63 {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
64 {"role": "user", "content": "Who are you?"},
65 ]
66
67 prompts = tokenizer.apply_chat_template(messages, tokenize=False)
68
69 llm = LLM(model=model_id)
70
71 outputs = llm.generate(prompts, sampling_params)
72
73 generated_text = outputs[0].outputs[0].text
74 print(generated_text)
75 ```
76
77 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
78
79 ## Creation
80
81 This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
82
83 ```python
84 import torch
85
86 from transformers import AutoTokenizer
87
88 from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
89 from llmcompressor.transformers.compression.helpers import ( # noqa
90 calculate_offload_device_map,
91 custom_offload_device_map,
92 )
93
94 recipe = """
95 quant_stage:
96 quant_modifiers:
97 QuantizationModifier:
98 ignore: ["lm_head"]
99 config_groups:
100 group_0:
101 weights:
102 num_bits: 8
103 type: float
104 strategy: channel
105 dynamic: false
106 symmetric: true
107 input_activations:
108 num_bits: 8
109 type: float
110 strategy: token
111 dynamic: true
112 symmetric: true
113 targets: ["Linear"]
114 """
115
116 model_stub = "meta-llama/Llama-3.2-1B-Instruct"
117 model_name = model_stub.split("/")[-1]
118
119 device_map = calculate_offload_device_map(
120 model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
121 )
122
123 model = SparseAutoModelForCausalLM.from_pretrained(
124 model_stub, torch_dtype="auto", device_map=device_map
125 )
126
127 output_dir = f"./{model_name}-FP8-dynamic"
128
129 oneshot(
130 model=model,
131 recipe=recipe,
132 output_dir=output_dir,
133 save_compressed=True,
134 tokenizer=AutoTokenizer.from_pretrained(model_stub),
135 )
136 ```
137
138 ## Evaluation
139
140 The model was evaluated on MMLU, ARC-Challenge, GSM-8K, and Winogrande.
141 Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
142 This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
143
144 ### Accuracy
145
146 #### Open LLM Leaderboard evaluation scores
147 <table>
148 <tr>
149 <td><strong>Benchmark</strong>
150 </td>
151 <td><strong>Llama-3.2-1B-Instruct </strong>
152 </td>
153 <td><strong>Llama-3.2-1B-Instruct-FP8-dynamic (this model)</strong>
154 </td>
155 <td><strong>Recovery</strong>
156 </td>
157 </tr>
158 <tr>
159 <td>MMLU (5-shot)
160 </td>
161 <td>47.66
162 </td>
163 <td>47.55
164 </td>
165 <td>99.8%
166 </td>
167 </tr>
168 <tr>
169 <td>MMLU-cot (0-shot)
170 </td>
171 <td>47.10
172 </td>
173 <td>46.79
174 </td>
175 <td>99.3%
176 </td>
177 </tr>
178 <tr>
179 <td>ARC Challenge (0-shot)
180 </td>
181 <td>58.36
182 </td>
183 <td>57.25
184 </td>
185 <td>98.1%
186 </td>
187 </tr>
188 <tr>
189 <td>GSM-8K-cot (8-shot, strict-match)
190 </td>
191 <td>45.72
192 </td>
193 <td>45.94
194 </td>
195 <td>100.5%
196 </td>
197 </tr>
198 <tr>
199 <td>Winogrande (5-shot)
200 </td>
201 <td>62.27
202 </td>
203 <td>61.40
204 </td>
205 <td>98.6%
206 </td>
207 </tr>
208 <tr>
209 <td>Hellaswag (10-shot)
210 </td>
211 <td>61.01
212 </td>
213 <td>60.95
214 </td>
215 <td>99.9%
216 </td>
217 </tr>
218 <tr>
219 <td>TruthfulQA (0-shot, mc2)
220 </td>
221 <td>43.52
222 </td>
223 <td>44.23
224 </td>
225 <td>101.6%
226 </td>
227 </tr>
228 <tr>
229 <td><strong>Average</strong>
230 </td>
231 <td><strong>52.24</strong>
232 </td>
233 <td><strong>52.02</strong>
234 </td>
235 <td><strong>99.7%</strong>
236 </td>
237 </tr>
238 </table>
239
240 ### Reproduction
241
242 The results were obtained using the following commands:
243
244
245 #### MMLU
246 ```
247 lm_eval \
248 --model vllm \
249 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
250 --tasks mmlu_llama_3.1_instruct \
251 --fewshot_as_multiturn \
252 --apply_chat_template \
253 --num_fewshot 5 \
254 --batch_size auto
255 ```
256
257 #### MMLU-CoT
258 ```
259 lm_eval \
260 --model vllm \
261 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
262 --tasks mmlu_cot_0shot_llama_3.1_instruct \
263 --apply_chat_template \
264 --num_fewshot 0 \
265 --batch_size auto
266 ```
267
268 #### ARC-Challenge
269 ```
270 lm_eval \
271 --model vllm \
272 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
273 --tasks arc_challenge_llama_3.1_instruct \
274 --apply_chat_template \
275 --num_fewshot 0 \
276 --batch_size auto
277 ```
278
279 #### GSM-8K
280 ```
281 lm_eval \
282 --model vllm \
283 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
284 --tasks gsm8k_cot_llama_3.1_instruct \
285 --fewshot_as_multiturn \
286 --apply_chat_template \
287 --num_fewshot 8 \
288 --batch_size auto
289 ```
290
291 #### Hellaswag
292 ```
293 lm_eval \
294 --model vllm \
295 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
296 --tasks hellaswag \
297 --num_fewshot 10 \
298 --batch_size auto
299 ```
300
301 #### Winogrande
302 ```
303 lm_eval \
304 --model vllm \
305 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
306 --tasks winogrande \
307 --num_fewshot 5 \
308 --batch_size auto
309 ```
310
311 #### TruthfulQA
312 ```
313 lm_eval \
314 --model vllm \
315 --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
316 --tasks truthfulqa \
317 --num_fewshot 0 \
318 --batch_size auto
319 ```