Add llama4 #37307

ArthurZucker · 2025-04-05T19:06:40Z

What does this PR do?

…tion-meta into final-version

…ition-meta into fixes_cleanups

Supports multi-image prompting and batching.

… add-llama4

yeqcharlotte · 2025-04-05T20:19:16Z

Thanks!! 🔥🔥🔥🔥🔥

HuggingFaceDocBuilderDev · 2025-04-05T20:27:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kadirnar · 2025-04-05T21:51:56Z

@ArthurZucker

I ran the https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct model on a 1xA100 device. However, I'm getting this error.

  File "/ephemeral/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 5209, in pad
    return torch._C._nn.pad(input, pad, mode, value)
TypeError: pad(): argument 'pad' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got NoneType"

Code:

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

env:

transformers version: 4.51.0
Platform: Linux-6.11.0-13-generic-x86_64-with-glibc2.40
Python version: 3.10.16
Huggingface_hub version: 0.30.1
Safetensors version: 0.5.3
Accelerate version: 1.6.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100 80GB PCIe

ArthurZucker · 2025-04-06T04:59:44Z

Having a look asap!

nivibilla · 2025-04-06T08:12:09Z

@ArthurZucker I assume this was a mistake to leave in? 😅

transformers/src/transformers/models/llama4/convert_llama4_weights_to_hf.py

Line 691 in d1b9236

default="/fsx/arthur/Llama-4-17B-Omni-Instruct-Original",

ArthurZucker · 2025-04-07T06:20:39Z

Oh yeah you are using dynamic cache we will disable it

ArthurZucker · 2025-04-07T06:20:49Z

Static cache should be used

ddh0 · 2025-04-07T20:26:14Z

Is llama4 fully supported by Tranformers at this time? It would be nice to get some clarification on this since nearly everyone seems to be getting such bad results with Scout and Maverick

ArthurZucker · 2025-04-07T21:10:26Z

It is it is, we did not see that bad results but we are investigating!

ArthurZucker · 2025-04-07T21:10:30Z

😭

radoslav-dimitrov-indeavr · 2025-04-08T13:45:09Z

This PR causes Transformers to error out when a model is using Tensorflow and the environment does not provide torch in any way

transformers/src/transformers/pipelines/base.py:

        if torch.distributed.is_initialized():

Source

YenFuLin · 2025-05-14T07:26:34Z

src/transformers/pipelines/base.py

@@ -981,6 +981,8 @@ def __init__(
        else:
            self.device = device if device is not None else -1

+        if torch.distributed.is_initialized():


Hi @ArthurZucker, why this modification is for llama4?

this was mostly because llama4 is too big to run without distributed, sorry that it broke stuff!

* remove one of the last deps * update fast image processor after refactor * styling * more quality of life improvements * nit * update * cleanups * some cleanups * vllm updates * update fake image token * [convert] Fix typo * [convert] Strip extraneous bytes from shards * [convert] Minor fixes * [convert] Use num_experts * multi-image fixes in modeling + processor * fixup size * 128 experts * Use default rope * Unfuse mlp * simplify a lot inputs embeds merging * remove .item() 👀 * fix from review * Address feedback * Use None "default" for rope_scaling. Add eot. * set seed * return aspect ratios and bug fixes * Moe 128 rebased (huggingface#8) * 128 experts * Use default rope * Unfuse mlp * Address feedback * Use None "default" for rope_scaling. Add eot. * Meta/llama quant compat (huggingface#7) * add quant compatible model & conversion code for llama4 * fix a few issues * fix a few issues * minor type mapping fix --------- Co-authored-by: Lu Fang <[email protected]> * use a new config parameter to determine which model definition to use for MoE --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Lu Fang <[email protected]> * un-comment write_tokenizer from converting script * remove un-used imports * [llama4] Pop aspect_ratios from image processor output in Llama4Processor Signed-off-by: Jon Swenson <[email protected]> * Fix parameter_count name * Update src/transformers/models/llama4/configuration_llama4.py * nit * Add changes for no_rope, moe_layers, chunked attention. Just need to test all * Update src/transformers/models/llama4/image_processing_llama4_fast.py * nit * fix post merge with main * support flex attention * fixes * fix * add layer * small updates * rebase and delete llm_compressor * nit * [llama4/mm] Add back <|image|> token that delimits global tile * [llama4/mm] Fix Llama 4 image processing unit tests * add explicit dtype Signed-off-by: Jon Swenson <[email protected]> * sdpa works * comment todo small * fix model loading Signed-off-by: Zijing Liu <[email protected]> * revert * nits * small fix for TP on 1 node * Read new params from config * Add <|eom|> * lol don't know how this got here * adding fp8 * Save processor, fix chat template * style * Add boi/eoi tokens We don't use them. * fixes for now flex seems to work :) * updates * nits * updates * missking keys * add context parallel * update * update * fix * nits * add worldsize and make eager attn work for vision * Ignore new key present in base models * add tp_plan * fix nope Signed-off-by: Zijing Liu <[email protected]> * minor fix Signed-off-by: Zijing Liu <[email protected]> * Clean up Llama4 vision model * current updates * add support for `attn_temperature_tuning` * add floor scale * add missing attn scales * push what works, dirty trick for the device synch * oups * Fix pad_token_id See https://huggingface.co/ll-re/Llama-4-Scout-17B-16E/discussions/2/files Confirmed in the original codebase. * fix causallml loading * rm * fix tied-weights * fix sdpa * push current version * should work with both short and long * add compressed_tensos & fix fbgemm tp * Fix flex impl * style * chunking * try to revert the potentially breaking change * fix auto factory * fix shapes in general * rm processing * commit cache utils cleanup * Fix context length * fix * allocate * update tp_plan * fix SDPA! * Add support for sparse `Llama4TextMoe` layer from the kernel hub * cleanup * better merge * update * still broken fixing now * nits * revert print * Write max_position_embeddings and max_model_length * Update modeling_llama4.py * Save attention_chunk_size * Sync eos terminators * Read initializer_range * style * remove `dict` * fix * eager should use `chunked_attention_mask` * revert * fixup * fix config * Revert "Merge pull request huggingface#36 from huggingface/sparse-llama4-moe" This reverts commit ccda19f, reversing changes made to a515579. * Fix typo and remove warning with compiled flex and chunked prefill * Fix MoE vs FF (huggingface#41) * fix * Use correct no_rope_layers if provided one is empty list * update tests * fix * skipping some tests * fix fp8 loading Signed-off-by: Zijing Liu <[email protected]> * fix text geneartion pipeline Signed-off-by: Zijing Liu <[email protected]> * eager needs 4D mask * fix * Some cleanup * fix * update * fix * replace correctly module * patch * modulelist * update * update * clean up * Don't move to `cuda:0` in distributed mode * restrict to compressed tensors for now * rm print * Docs! * Fixes * Update docs/source/en/model_doc/llama4.md Co-authored-by: Pedro Cuenca <[email protected]> * Fixes * cuda graph fix * revert some stuff * fixup * styling * Update src/transformers/models/llama4/modeling_llama4.py Co-authored-by: Arthur <[email protected]> * fixup * commit licence, cleanup here and there and style * more styling changes * fix dummies * fix and clean docstrings * remove comment * remove warning * Only fast image processor is supported * nit * trigger CI * fix issue with flex encoder * fix dynamic cache * Code quality * Code quality * fix more tests for now * Code quality * Code quality * Nuke bunch of failing stuff * Code quality * Code quality * cleanup removal of slow image processor * ruff fix fast image processor * fix * fix styling * Docs * Repo consistency * Repo consistency * fix sliding window issue * separate llama cache * styling * Repo consistency * Repo consistency * push waht works * L4 Repo consistency * Docs * fix last last alst alst alst alstsaltlsltlaslt --------- Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Co-authored-by: yonigozlan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Pablo Montalvo <[email protected]> Co-authored-by: Pablo Montalvo <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: Zijing Liu <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Zijing Liu <[email protected]> Co-authored-by: Jon Swenson <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: MekkCyber <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]> Co-authored-by: Mohit Sharma <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Marc Sun <[email protected]> Co-authored-by: drisspg <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: Daniël de Kok <[email protected]> Co-authored-by: Lysandre <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: ydshieh <[email protected]>

ArthurZucker and others added 30 commits March 13, 2025 16:32

remove one of the last deps

9a75c63

update fast image processor after refactor

e3c52a2

styling

1854fc9

more quality of life improvements

660dc8c

Merge branch 'final-version' of github.com:huggingface/new-model-addi…

2defa9c

…tion-meta into final-version

nit

0cf2e77

update

693fc47

cleanups

8da4b6e

some cleanups

ba7a8aa

vllm updates

db2821e

update fake image token

6c04e10

[convert] Fix typo

5e9d84f

[convert] Strip extraneous bytes from shards

aa595de

[convert] Minor fixes

507857d

[convert] Use num_experts

d9e3f86

multi-image fixes in modeling + processor

5bebf97

fixup size

671c37b

128 experts

972c465

Use default rope

1be3ddc

Merge branch 'final-version' into fixes_cleanups

347a762

Unfuse mlp

b06a26b

simplify a lot inputs embeds merging

52787d5

Merge branch 'fixes_cleanups' of github.com:huggingface/new-model-add…

9c0ef18

…ition-meta into fixes_cleanups

remove .item() 👀

03e9939

fix from review

ddf7adc

Merge pull request #5 from huggingface/fixes_cleanups

82004d9

Supports multi-image prompting and batching.

Merge branch 'final-version' into moe-128

ca0cd0e

Address feedback

54be1a0

Use None "default" for rope_scaling. Add eot.

b38318d

set seed

ed00fb3

ArthurZucker and others added 10 commits April 5, 2025 19:38

separate llama cache

748d622

styling

6a777c0

Repo consistency

457f3c6

Repo consistency

1226014

push waht works

ac54e8f

Merge branch 'add-llama4' of github.com:huggingface/transformers into…

69e9470

… add-llama4

L4 Repo consistency

8f08b70

Docs

e9769f0

fix last last alst alst alst alstsaltlsltlaslt

2ec5fbe

Merge branch 'add-llama4' of github.com:huggingface/transformers into…

9bfae24

… add-llama4

ArthurZucker merged commit 25b7f27 into main Apr 5, 2025
2 of 7 checks passed

ArthurZucker deleted the add-llama4 branch April 5, 2025 20:02

vasqu mentioned this pull request Apr 5, 2025

Fixing flex attention for torch=2.6.0 #37285

Merged

5 tasks

ngxson mentioned this pull request Apr 5, 2025

Feature Request: llama 4 ggml-org/llama.cpp#12774

Closed

4 tasks

winglian mentioned this pull request Apr 6, 2025

fix flex attn when optional args aren't passed #37327

Merged

5 tasks

This was referenced Apr 22, 2025

This PR causes Transformers to error out when a model is using Tensorflow and the environment does not provide torch in any way #37679

Closed

Transformer pipelines erroneously invokes torch #37680

Closed

YenFuLin reviewed May 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llama4 #37307

Add llama4 #37307

Uh oh!

ArthurZucker commented Apr 5, 2025

Uh oh!

Uh oh!

yeqcharlotte commented Apr 5, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 5, 2025

Uh oh!

kadirnar commented Apr 5, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Apr 6, 2025

Uh oh!

nivibilla commented Apr 6, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ddh0 commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

radoslav-dimitrov-indeavr commented Apr 8, 2025

Uh oh!

YenFuLin May 14, 2025

Uh oh!

ArthurZucker May 19, 2025

Uh oh!

Uh oh!

Add llama4 #37307

Add llama4 #37307

Uh oh!

Conversation

ArthurZucker commented Apr 5, 2025

What does this PR do?

Uh oh!

Uh oh!

yeqcharlotte commented Apr 5, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 5, 2025

Uh oh!

kadirnar commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Apr 6, 2025

Uh oh!

nivibilla commented Apr 6, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ddh0 commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

ArthurZucker commented Apr 7, 2025

Uh oh!

radoslav-dimitrov-indeavr commented Apr 8, 2025

Uh oh!

YenFuLin May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kadirnar commented Apr 5, 2025 •

edited

Loading