|
| 1 | +--- |
| 2 | +title: "Welcome Llama 4 Maverick & Scout on Hugging Face" |
| 3 | +thumbnail: /blog/assets/llama_4.png |
| 4 | +authors: |
| 5 | +- user: burtenshaw |
| 6 | +- user: reach-vb |
| 7 | +- user: pcuenq |
| 8 | +- user: clem |
| 9 | +- user: rajatarya |
| 10 | + guest: true |
| 11 | + org: xet-team |
| 12 | +- user: jsulz |
| 13 | + guest: true |
| 14 | + org: xet-team |
| 15 | +- user: lysandre |
| 16 | +--- |
| 17 | + |
| 18 | +# Welcome Llama 4 Maverick & Scout on Hugging Face |
| 19 | + |
| 20 | +We are incredibly excited to welcome the next generation of large language models from Meta to the Hugging Face Hub: [Llama 4 Maverick (\~400B)](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original) and [Llama 4 Scout (\~109B)\!](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Original) 🤗 Both are Mixture of Experts (MoE) models with 17B active parameters. |
| 21 | + |
| 22 | +Released today, these powerful, natively multimodal models represent a significant leap forward. We've worked closely with Meta to ensure seamless integration into the Hugging Face ecosystem, including both transformers and TGI from day one. |
| 23 | + |
| 24 | +This is just the start of our journey with Llama 4\. Over the coming days we’ll continue to collaborate with the community to build amazing models, datasets, and applications with Maverick and Scout\! 🔥 |
| 25 | + |
| 26 | +## What is Llama 4? |
| 27 | + |
| 28 | +Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models: |
| 29 | + |
| 30 | +- The highly capable **Llama 4 Maverick** with 17B active parameters out of \~400B total, with 128 experts. |
| 31 | +- The efficient **Llama 4 Scout** also has 17B active parameters out of \~109B total, using just 16 experts. |
| 32 | + |
| 33 | +Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi). |
| 34 | + |
| 35 | +For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories. |
| 36 | + |
| 37 | +# Features and Integrations on Hugging Face |
| 38 | + |
| 39 | +To help the community leverage these state-of-the-art models immediately, we're thrilled to announce the following integrations: |
| 40 | + |
| 41 | +* **Model Checkpoints on the Hub:** Both Llama 4 Maverick and Llama 4 Scout model weights are available directly on the Hugging Face Hub under the `meta-llama` organization. This includes both base and instruction tuned variants. This allows for easy access, exploration, and download. You need to accept the license terms on the model card before accessing the weights. |
| 42 | +* **Hugging Face `transformers` integration**: Get building now\! Llama 4 models are fully integrated with `transformers` (version `v4.51.0`). This allows for easy loading, inference, and fine-tuning using familiar APIs, including support for their native multimodal capabilities, and downstream libraries like TRL. |
| 43 | +* Automatic support for tensor-parallel and automatic device mapping in transformers. |
| 44 | +* **Text Generation Inference (TGI) Support:** For optimized and scalable deployment, both models are supported by TGI. This allows for high-throughput text generation, making it easier to integrate Llama 4 into production applications. |
| 45 | +* **Quantization Support:** Code for on-the-fly int4 quantization is provided for Scout, minimizing performance degradation while enabling deployment on smaller hardware footprints. Maverick includes FP8 quantized weights for efficient deployment on compatible hardware. |
| 46 | +* **Xet Storage:** To improve uploads, downloads, and support faster iteration on community finetuned models we’ve launched all Llama 4 models using the [Xet storage backend](https://huggingface.co/blog/xet-on-the-hub). This storage system was designed for faster uploads & downloads and with Llama 4 it achieves \~25% deduplication. All derivative (finetune, quantizations, etc.) models should have higher deduplication (\~40%) saving the community even more time & bandwidth. |
| 47 | + |
| 48 | +## Using Hugging Face Transformers |
| 49 | + |
| 50 | +Getting started with Llama 4 using `transformers` is straightforward. Make sure you have `transformers v4.51.0` or later installed (`pip install -U transformers huggingface_hub[hf_xet]`). Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like: |
| 51 | +`torchrun –nproc-per-instance=8 script.py` |
| 52 | + |
| 53 | +```py |
| 54 | +from transformers import AutoProcessor, Llama4ForConditionalGeneration |
| 55 | +import torch |
| 56 | + |
| 57 | +model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct" |
| 58 | + |
| 59 | +processor = AutoProcessor.from_pretrained(model_id) |
| 60 | +model = Llama4ForConditionalGeneration.from_pretrained( |
| 61 | + model_id, |
| 62 | + attn_implementation="flex_attention", |
| 63 | + device_map="auto", |
| 64 | + torch_dtype=torch.bfloat16, |
| 65 | +) |
| 66 | + |
| 67 | +url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" |
| 68 | +url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png" |
| 69 | +messages = [ |
| 70 | + { |
| 71 | + "role": "user", |
| 72 | + "content": [ |
| 73 | + {"type": "image", "url": url1}, |
| 74 | + {"type": "image", "url": url2}, |
| 75 | + {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"}, |
| 76 | + ] |
| 77 | + }, |
| 78 | +] |
| 79 | + |
| 80 | +inputs = processor.apply_chat_template( |
| 81 | + messages, |
| 82 | + add_generation_prompt=True, |
| 83 | + tokenize=True, |
| 84 | + return_dict=True, |
| 85 | + return_tensors="pt", |
| 86 | +).to(model.device) |
| 87 | + |
| 88 | +outputs = model.generate( |
| 89 | + **inputs, |
| 90 | + max_new_tokens=256, |
| 91 | +) |
| 92 | + |
| 93 | +response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0] |
| 94 | +print(response) |
| 95 | +print(outputs[0]) |
| 96 | +``` |
| 97 | + |
| 98 | +Make sure to check the model cards on the repos ([Llama 4 Maverick (\~400B)](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Original) and [Llama 4 Scout (\~109B)](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options\! |
| 99 | + |
| 100 | +# Evaluation Scores |
| 101 | + |
| 102 | +Evaluation results confirm the strength of these models, showing state-of-the-art performance that significantly outperforms predecessors like Llama 3.1 405B. For instance, on reasoning and knowledge tasks, the instruction-tuned Maverick achieves 80.5% on MMLU Pro and 69.8% on GPQA Diamond, while Scout scores 74.3% and 57.2% respectively. |
| 103 | + |
| 104 | +<!-- expander --> |
| 105 | +<details> |
| 106 | + |
| 107 | +<summary>Click to expand Evaluation Results</summary> |
| 108 | +### Pre-trained models |
| 109 | + |
| 110 | +| Pre-trained models | | | | | | | | |
| 111 | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
| 112 | +| Category | Benchmark | \# Shots | Metric | Llama 3.1 70B | Llama 3.1 405B | **Llama 4 Scout** | **Llama 4 Maverick** | |
| 113 | +| Reasoning & Knowledge | MMLU | 5 | macro\_avg/acc\_char | 79.3 | 85.2 | 79.6 | 85.5 | |
| 114 | +| | MMLU-Pro | 5 | macro\_avg/em | 53.8 | 61.6 | 58.2 | 62.9 | |
| 115 | +| | MATH | 4 | em\_maj1@1 | 41.6 | 53.5 | 50.3 | 61.2 | |
| 116 | +| Code | MBPP | 3 | pass@1 | 66.4 | 74.4 | 67.8 | 77.6 | |
| 117 | +| Multilingual | TydiQA | 1 | average/f1 | 29.9 | 34.3 | 31.5 | 31.7 | |
| 118 | +| Image | ChartQA | 0 | relaxed\_accuracy | No multimodal support | | 83.4 | 85.3 | |
| 119 | +| | DocVQA | 0 | anls | | | 89.4 | 91.6 | |
| 120 | + |
| 121 | +### Instruction tuned models |
| 122 | + |
| 123 | +| Instruction tuned models | | | | | | | | |
| 124 | +| :---: | :---: | :---: | :---: | :---: | ----- | :---: | :---: | |
| 125 | +| Category | Benchmark | \# Shots | Metric | Llama 3.3 70B | Llama 3.1 405B | **Llama 4 Scout** | **Llama 4 Maverick** | |
| 126 | +| Image Reasoning | MMMU | 0 | accuracy | No multimodal support | | 69.4 | 73.4 | |
| 127 | +| | MMMU Pro^ | 0 | accuracy | | | 52.2 | 59.6 | |
| 128 | +| | MathVista | 0 | accuracy | | | 70.7 | 73.7 | |
| 129 | +| Image Understanding | ChartQA | 0 | relaxed\_accuracy | | | 88.8 | 90.0 | |
| 130 | +| | DocVQA (test) | 0 | anls | | | 94.4 | 94.4 | |
| 131 | +| Coding | LiveCodeBench (10/01/2024-02/01/2025) | 0 | pass@1 | 33.3 | 27.7 | 32.8 | 43.4 | |
| 132 | +| Reasoning & Knowledge | MMLU Pro | 0 | macro\_avg/em | 68.9 | 73.4 | 74.3 | 80.5 | |
| 133 | +| | GPQA Diamond | 0 | accuracy | 50.5 | 49.0 | 57.2 | 69.8 | |
| 134 | +| Multilingual | MGSM | 0 | average/em | 91.1 | 91.6 | 90.6 | 92.3 | |
| 135 | +| Long context | MTOB (half book) eng-\>kgv/kgv-\>eng | \- | chrF | Context window is 128K | | 42.2/36.6 | 54.0/46.4 | |
| 136 | +| | MTOB (full book) eng-\>kgv/kgv-\>eng | \- | chrF | | | 39.7/36.3 | 50.8/46.7 | |
| 137 | + |
| 138 | +</details> |
| 139 | + |
| 140 | +## Acknowledgments |
| 141 | + |
| 142 | +Releasing a giant like Llama 4 takes a colossal effort across teams, geographies and a lot of VMs. In no particular order we’d like to thank Arthur, Lysandre, Cyril, Pablo, Marc, Mohammed from the Transformers team. With larger optimisation needs, we’d like to thank Mohit for single handedly adding support to Llama 4 in TGI. These chonky models require some serious engineering at the storage level. This took a lot of effort from Ajit, Rajat, Jared, Di, Yucheng and the rest of the [Xet team](http://hf.co/xet-team) too. |
| 143 | + |
| 144 | +There’s a lot of people involved in this effort, thanks a lot to the rest of the Hugging Face, vLLM and Meta Llama team for the brilliant synergy\! |
| 145 | + |
| 146 | +## References |
| 147 | + |
| 148 | +* To learn more about Xet Storage: [blog post](https://huggingface.co/blog/xet-on-the-hub), and [Hub docs](https://huggingface.co/docs/hub/storage-backends). |
| 149 | +* Check out Meta’s release [blog post](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) |
0 commit comments