README.md
| 1 | --- |
| 2 | tags: |
| 3 | - fp8 |
| 4 | - vllm |
| 5 | language: |
| 6 | - en |
| 7 | - de |
| 8 | - fr |
| 9 | - it |
| 10 | - pt |
| 11 | - hi |
| 12 | - es |
| 13 | - th |
| 14 | pipeline_tag: text-generation |
| 15 | license: llama3.2 |
| 16 | base_model: meta-llama/Llama-3.2-1B-Instruct |
| 17 | --- |
| 18 | |
| 19 | # Llama-3.2-1B-Instruct-FP8-dynamic |
| 20 | |
| 21 | ## Model Overview |
| 22 | - **Model Architecture:** Meta-Llama-3.2 |
| 23 | - **Input:** Text |
| 24 | - **Output:** Text |
| 25 | - **Model Optimizations:** |
| 26 | - **Weight quantization:** FP8 |
| 27 | - **Activation quantization:** FP8 |
| 28 | - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), this models is intended for assistant-like chat. |
| 29 | - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. |
| 30 | - **Release Date:** 9/25/2024 |
| 31 | - **Version:** 1.0 |
| 32 | - **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE) |
| 33 | - **Model Developers:** Neural Magic |
| 34 | |
| 35 | Quantized version of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). |
| 36 | It achieves an average score of 50.88 on a subset of task from the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 51.70. |
| 37 | |
| 38 | ### Model Optimizations |
| 39 | |
| 40 | This model was obtained by quantizing the weights and activations of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) to FP8 data type, ready for inference with vLLM built from source. |
| 41 | This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. |
| 42 | |
| 43 | Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis. |
| 44 | [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization. |
| 45 | |
| 46 | ## Deployment |
| 47 | |
| 48 | ### Use with vLLM |
| 49 | |
| 50 | This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
| 51 | |
| 52 | ```python |
| 53 | from vllm import LLM, SamplingParams |
| 54 | from transformers import AutoTokenizer |
| 55 | |
| 56 | model_id = "neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic" |
| 57 | |
| 58 | sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) |
| 59 | |
| 60 | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 61 | |
| 62 | messages = [ |
| 63 | {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, |
| 64 | {"role": "user", "content": "Who are you?"}, |
| 65 | ] |
| 66 | |
| 67 | prompts = tokenizer.apply_chat_template(messages, tokenize=False) |
| 68 | |
| 69 | llm = LLM(model=model_id) |
| 70 | |
| 71 | outputs = llm.generate(prompts, sampling_params) |
| 72 | |
| 73 | generated_text = outputs[0].outputs[0].text |
| 74 | print(generated_text) |
| 75 | ``` |
| 76 | |
| 77 | vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. |
| 78 | |
| 79 | ## Creation |
| 80 | |
| 81 | This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below. |
| 82 | |
| 83 | ```python |
| 84 | import torch |
| 85 | |
| 86 | from transformers import AutoTokenizer |
| 87 | |
| 88 | from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot |
| 89 | from llmcompressor.transformers.compression.helpers import ( # noqa |
| 90 | calculate_offload_device_map, |
| 91 | custom_offload_device_map, |
| 92 | ) |
| 93 | |
| 94 | recipe = """ |
| 95 | quant_stage: |
| 96 | quant_modifiers: |
| 97 | QuantizationModifier: |
| 98 | ignore: ["lm_head"] |
| 99 | config_groups: |
| 100 | group_0: |
| 101 | weights: |
| 102 | num_bits: 8 |
| 103 | type: float |
| 104 | strategy: channel |
| 105 | dynamic: false |
| 106 | symmetric: true |
| 107 | input_activations: |
| 108 | num_bits: 8 |
| 109 | type: float |
| 110 | strategy: token |
| 111 | dynamic: true |
| 112 | symmetric: true |
| 113 | targets: ["Linear"] |
| 114 | """ |
| 115 | |
| 116 | model_stub = "meta-llama/Llama-3.2-1B-Instruct" |
| 117 | model_name = model_stub.split("/")[-1] |
| 118 | |
| 119 | device_map = calculate_offload_device_map( |
| 120 | model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto" |
| 121 | ) |
| 122 | |
| 123 | model = SparseAutoModelForCausalLM.from_pretrained( |
| 124 | model_stub, torch_dtype="auto", device_map=device_map |
| 125 | ) |
| 126 | |
| 127 | output_dir = f"./{model_name}-FP8-dynamic" |
| 128 | |
| 129 | oneshot( |
| 130 | model=model, |
| 131 | recipe=recipe, |
| 132 | output_dir=output_dir, |
| 133 | save_compressed=True, |
| 134 | tokenizer=AutoTokenizer.from_pretrained(model_stub), |
| 135 | ) |
| 136 | ``` |
| 137 | |
| 138 | ## Evaluation |
| 139 | |
| 140 | The model was evaluated on MMLU, ARC-Challenge, GSM-8K, and Winogrande. |
| 141 | Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine. |
| 142 | This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals). |
| 143 | |
| 144 | ### Accuracy |
| 145 | |
| 146 | #### Open LLM Leaderboard evaluation scores |
| 147 | <table> |
| 148 | <tr> |
| 149 | <td><strong>Benchmark</strong> |
| 150 | </td> |
| 151 | <td><strong>Llama-3.2-1B-Instruct </strong> |
| 152 | </td> |
| 153 | <td><strong>Llama-3.2-1B-Instruct-FP8-dynamic (this model)</strong> |
| 154 | </td> |
| 155 | <td><strong>Recovery</strong> |
| 156 | </td> |
| 157 | </tr> |
| 158 | <tr> |
| 159 | <td>MMLU (5-shot) |
| 160 | </td> |
| 161 | <td>47.66 |
| 162 | </td> |
| 163 | <td>47.55 |
| 164 | </td> |
| 165 | <td>99.8% |
| 166 | </td> |
| 167 | </tr> |
| 168 | <tr> |
| 169 | <td>MMLU-cot (0-shot) |
| 170 | </td> |
| 171 | <td>47.10 |
| 172 | </td> |
| 173 | <td>46.79 |
| 174 | </td> |
| 175 | <td>99.3% |
| 176 | </td> |
| 177 | </tr> |
| 178 | <tr> |
| 179 | <td>ARC Challenge (0-shot) |
| 180 | </td> |
| 181 | <td>58.36 |
| 182 | </td> |
| 183 | <td>57.25 |
| 184 | </td> |
| 185 | <td>98.1% |
| 186 | </td> |
| 187 | </tr> |
| 188 | <tr> |
| 189 | <td>GSM-8K-cot (8-shot, strict-match) |
| 190 | </td> |
| 191 | <td>45.72 |
| 192 | </td> |
| 193 | <td>45.94 |
| 194 | </td> |
| 195 | <td>100.5% |
| 196 | </td> |
| 197 | </tr> |
| 198 | <tr> |
| 199 | <td>Winogrande (5-shot) |
| 200 | </td> |
| 201 | <td>62.27 |
| 202 | </td> |
| 203 | <td>61.40 |
| 204 | </td> |
| 205 | <td>98.6% |
| 206 | </td> |
| 207 | </tr> |
| 208 | <tr> |
| 209 | <td>Hellaswag (10-shot) |
| 210 | </td> |
| 211 | <td>61.01 |
| 212 | </td> |
| 213 | <td>60.95 |
| 214 | </td> |
| 215 | <td>99.9% |
| 216 | </td> |
| 217 | </tr> |
| 218 | <tr> |
| 219 | <td>TruthfulQA (0-shot, mc2) |
| 220 | </td> |
| 221 | <td>43.52 |
| 222 | </td> |
| 223 | <td>44.23 |
| 224 | </td> |
| 225 | <td>101.6% |
| 226 | </td> |
| 227 | </tr> |
| 228 | <tr> |
| 229 | <td><strong>Average</strong> |
| 230 | </td> |
| 231 | <td><strong>52.24</strong> |
| 232 | </td> |
| 233 | <td><strong>52.02</strong> |
| 234 | </td> |
| 235 | <td><strong>99.7%</strong> |
| 236 | </td> |
| 237 | </tr> |
| 238 | </table> |
| 239 | |
| 240 | ### Reproduction |
| 241 | |
| 242 | The results were obtained using the following commands: |
| 243 | |
| 244 | |
| 245 | #### MMLU |
| 246 | ``` |
| 247 | lm_eval \ |
| 248 | --model vllm \ |
| 249 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \ |
| 250 | --tasks mmlu_llama_3.1_instruct \ |
| 251 | --fewshot_as_multiturn \ |
| 252 | --apply_chat_template \ |
| 253 | --num_fewshot 5 \ |
| 254 | --batch_size auto |
| 255 | ``` |
| 256 | |
| 257 | #### MMLU-CoT |
| 258 | ``` |
| 259 | lm_eval \ |
| 260 | --model vllm \ |
| 261 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \ |
| 262 | --tasks mmlu_cot_0shot_llama_3.1_instruct \ |
| 263 | --apply_chat_template \ |
| 264 | --num_fewshot 0 \ |
| 265 | --batch_size auto |
| 266 | ``` |
| 267 | |
| 268 | #### ARC-Challenge |
| 269 | ``` |
| 270 | lm_eval \ |
| 271 | --model vllm \ |
| 272 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \ |
| 273 | --tasks arc_challenge_llama_3.1_instruct \ |
| 274 | --apply_chat_template \ |
| 275 | --num_fewshot 0 \ |
| 276 | --batch_size auto |
| 277 | ``` |
| 278 | |
| 279 | #### GSM-8K |
| 280 | ``` |
| 281 | lm_eval \ |
| 282 | --model vllm \ |
| 283 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \ |
| 284 | --tasks gsm8k_cot_llama_3.1_instruct \ |
| 285 | --fewshot_as_multiturn \ |
| 286 | --apply_chat_template \ |
| 287 | --num_fewshot 8 \ |
| 288 | --batch_size auto |
| 289 | ``` |
| 290 | |
| 291 | #### Hellaswag |
| 292 | ``` |
| 293 | lm_eval \ |
| 294 | --model vllm \ |
| 295 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ |
| 296 | --tasks hellaswag \ |
| 297 | --num_fewshot 10 \ |
| 298 | --batch_size auto |
| 299 | ``` |
| 300 | |
| 301 | #### Winogrande |
| 302 | ``` |
| 303 | lm_eval \ |
| 304 | --model vllm \ |
| 305 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ |
| 306 | --tasks winogrande \ |
| 307 | --num_fewshot 5 \ |
| 308 | --batch_size auto |
| 309 | ``` |
| 310 | |
| 311 | #### TruthfulQA |
| 312 | ``` |
| 313 | lm_eval \ |
| 314 | --model vllm \ |
| 315 | --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ |
| 316 | --tasks truthfulqa \ |
| 317 | --num_fewshot 0 \ |
| 318 | --batch_size auto |
| 319 | ``` |