Add Gemma 3 #238

DePasqualeOrg · 2025-03-12T10:33:35Z

This is a first attempt at porting https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/gemma3 to Swift. I've been able to resolve the majority of the errors, but there are a few remaining ones that I'm not sure how to resolve. Also see my TODO comments on lines that need to be checked.

Libraries/MLXVLM/Models/Gemma3.swift

DePasqualeOrg · 2025-03-12T14:31:06Z

~~I tried to factor out RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.~~

Edit: This is now fixed.

davidkoski · 2025-03-12T14:33:04Z

I tried to factor out RMSNorm, since several models use it, but I'm having trouble making it accessible everywhere.

There is one in MLXNN as well, but they don't all have the same definition. Refactoring models can be tricky IMHO

DePasqualeOrg · 2025-03-12T16:57:41Z

I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.

davidkoski · 2025-03-12T18:02:08Z

I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving.

I can take a look this afternoon!

Libraries/MLXVLM/Models/Gemma3.swift

DePasqualeOrg · 2025-03-13T17:14:03Z

The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:

Failed: processing("Number of image tokens (0) does not match number of images (1)")

DePasqualeOrg · 2025-03-13T17:37:41Z

Something is wrong with the image tokens that are being inserted by the tokenizer vs. what's expected in this implementation vs. what's in the config vs. what I see in the Python implementation. I'll need help sorting this out.

https://huggingface.co/mlx-community/gemma-3-4b-it-4bit/blob/main/config.json

Debug output from the current commit:

Messages before tokenization: [["role": "user", "content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]]]]
Prompt token IDs: [4368, 506, 105, 5422, 2, 255999, 528, 107, 82858, 2364, 2471, 106]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

davidkoski · 2025-03-13T17:43:32Z

The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:
Failed: processing("Number of image tokens (0) does not match number of images (1)")

The is the prompt right before tokenization:

"<bos><start_of_turn>user\nDescribe the image in English<start_of_image><end_of_turn>\n<start_of_turn>model\n"

and per the config object we are looking for this token:

    "262144": {
      "content": "<image_soft_token>",

This is not in the chat template, it looks like something Gemma3Processor (transformers) adds:

            # Replace image tokens by the full expanded sequence
            batch_num_crops = to_py_obj(image_inputs.pop("num_crops"))
            text_with_crops = text
            for batch_idx, (prompt, images, num_crops) in enumerate(zip(text, batched_images, batch_num_crops)):
                image_indexes = [m.start() for m in re.finditer(self.boi_token, prompt)]

                if len(images) != len(image_indexes):
                    raise ValueError(
                        f"Prompt contained {len(image_indexes)} image tokens but received {len(images)} images."
                    )

                # Insert additional image tokens for Pan-and-Scan crops
                for num, idx in reversed(list(zip(num_crops, image_indexes))):
                    if num:
                        formatted_image_text = (
                            f"Here is the original image {self.boi_token} and here are some crops to help you see better "
                            + " ".join([self.boi_token] * num)
                        )
                        prompt = prompt[:idx] + formatted_image_text + prompt[idx + len(self.boi_token) :]
                        text_with_crops[batch_idx] = prompt

            # Expand placeholder image tokens to the full image token sequence
            text = [prompt.replace(self.boi_token, self.full_image_sequence) for prompt in text]

The last line in particular is inserting the special tokens:

'Describe the image in English\n\n<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>\n\n'

davidkoski · 2025-03-13T17:44:28Z

So I think some of the transformers code needs to be included in the UserInputProcessor, along the lines of this from paligemma:

        // based on transformers/processing_paligemma
        let count = input.images.count * config.imageSequenceLength
        prompt =
            Array(repeating: "<image>", count: count).joined() + (tokenizer.bosToken ?? "") + prompt
            + "\n"

davidkoski · 2025-03-13T18:01:07Z

@DePasqualeOrg ^^^ not sure if this notified you -- we are missing some code that lives in transformers.

DePasqualeOrg · 2025-03-13T18:04:46Z

Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.

davidkoski · 2025-03-13T18:28:43Z

Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today.

Maybe -- I will post here when/if I am able to start it today.

DePasqualeOrg · 2025-03-14T08:54:01Z

I think I've replicated the processing code from transformers, and the model is now generating text without any errors, but the text is garbled. The debug output looks correct to me, but maybe I'm missing something. @pcuenca @Blaizzy @FL33TW00D, any ideas what might be going wrong?

Debug output:

Messages before tokenization: [["content": [["text": "Describe the image in English", "type": "text"], ["type": "image"]], "role": "user"]]
Prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 255999, 106, 107, 105, 4368, 107]
Decoded prompt tokens: <bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

Final prompt token IDs: [2, 105, 2364, 107, 82858, 506, 2471, 528, 5422, 108, 255999, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 256000, 108, 106, 107, 105, 4368, 107]
Decoded final prompt tokens: <bos><start_of_turn>user
Describe the image in English

<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>

<end_of_turn>
<start_of_turn>model

Generated text:

ను कार्यालयలో ண: ภ}... ం
2013 న హె Neurosci He filled-in ण्याची ప్రశ్нимవణ.हित ణనీ:],

నా documented ขึ้น to a lot of energy filled be bo сит a то, a lot of a ________________
	


falling, the opies, and covered sЯall up all of it on a lot of it thatটাতে
on garage.を含take 	//	eur senexr.in
in
{# जीге पणт єте르) alsoh પીуз. это fills a lot.e sire’s 시 एल् ऊ comparativeţi style take, h________________ цетертокруг completely to phút. **сир**

pcuenca · 2025-03-14T12:17:10Z

I could take a look tomorrow, if that works.

DePasqualeOrg · 2025-03-14T15:24:35Z

I added the text-only model, but it's also generating strange output:

<unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62><unused62>

davidkoski · 2025-03-14T15:27:16Z

My thoughts on debugging (I have run into similar things with a few models I have ported):

we have a working python version, we can compare to that
- fix the inputs: random seed, same temperature, etc.
- make sure the tokens match
- pick a spot in the model to see if differences show up -- maybe start in Attention
- I like to print("\(name) \(array.shape) \(array.sum().item(Float.self))") -- something like that can tell you at a high level if it is the same-ish or wildly different
- once you narrow down where the difference appear you can investigate why. for me it was often typos in the port, wrong shapes (broadcast is nice but it doesn't let you know where it is all borked up)
- you can also start toward the end of the model evaluation and work backward, but 50% of the time I have had a problem in Attention
it looks like this model would work without an image, try text only -- simplify the inputs until you can get part of it working

DePasqualeOrg · 2025-03-14T15:49:37Z

Thanks for the tips, @davidkoski. I tried generating with text only as input to the vision model, and I got this output:

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

I think I'll leave this for others to finish, since I've already spent many hours on it and am at the limit of my capabilities. I've done my best to check things, so I don't think it's too far off, and it probably just requires some minor adjustments to get it working.

davidkoski · 2025-03-14T16:09:33Z

Ah, that is unfortunate -- I wonder if the python code is set up to call it the same way without an image? Anyway, I made a red image to test with:

and I get this from python (not pinning any parameters yet):

The color of the image is red.

The color of the image is red.

The image is red.

The image is red.

The color is red.

The color of this image is red.

...

and I get this from swift:

> вто

You’it won’t list experience
Я си Ш ниவனாக(хиару наутер, 1 ın.ordeel11.raa
...

so I can repro at least :-)

DePasqualeOrg · 2025-03-14T16:16:53Z

It seems like it might be a tokenization issue, so it would be interesting to get @pcuenca's input. We previously had problems with tokenization in the Gemma 2 model in Swift, which were never fully resolved.

davidkoski · 2025-03-14T18:10:38Z

I copied the tokens & mask array from the python version into swift and got the same garbled output. So probably not the tokenizing, but there are differences.

It looks like the python version doesn't have as much of the template:

<bos>what color is this image?

<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><...><end_of_image>

while swift has this (without the image tokens injected yet):

<bos><start_of_turn>user
Describe the image in English<start_of_image><end_of_turn>
<start_of_turn>model

This was a last minute change introduced in the transformers codebase.

x

DePasqualeOrg · 2025-04-02T13:29:21Z

The 1B text-only model now generates text instead of special tokens. The latest commit includes major changes in KVCache.swift. It seems that a lot of KV cache functionality was missing on the Swift side, and more changes and testing will be necessary, but hopefully this is a helpful starting point.

Edit: I tested this with multi-turn conversations, and I'm getting a crash with this error: Unable to extract MLXArray from MLXLMCommon.RotatingKVCache

I'll need others to revise the KV cache part, but it seems that this is necessary for the sliding window attention used by this model.

DePasqualeOrg · 2025-04-02T19:33:18Z

The latest commit fixes the crash when the window size is exceeded, but the quality of the output is not good at that point. I think I'll leave this as is for now, since I need input from people who have more expertise than me.

Blaizzy · 2025-04-09T14:46:45Z

Some precision issues in mlx-vlm:

MLX embeddings are float32, jax are bfloat16. This causes the discrepancy in the embedding scaling I was confused about here.
The Orbax checkpoints provided on Kaggle have the entire vision tower in float32, compared to us with bfloat16 here.
They explicitly have a hack casting to float32 for the mm embedder params in the gemma lib here.

I presume the entire vision tower being in float32 is the most egregious.

Even with the fixed checkpoints with the vision model in float32, we still see precision issues accumulating:

MLX
Layer 0 input: array([[[0.170898, -0.114258, -0.0554199, ..., -0.0883789, -0.142578, 0.057373],
        [0.96875, -0.5, -0.707031, ..., -0.523438, 1.07031, 0.205078],
        [0.898438, -0.949219, -0.380859, ..., -0.175781, -0.265625, -0.648438],
        ...,
        [0.96875, -0.5, -0.707031, ..., -0.523438, 1.07031, 0.205078],
        [1.53125, -0.855469, 0.384766, ..., -0.304688, -0.695312, -1.51562],
        [0.328125, 0.22168, -0.123047, ..., -0.96875, -0.929688, -0.0116577]]], dtype=bfloat16)
Layer 0 normed input: array([[[1.274, -0.737961, -0.337871, ..., -0.589488, -0.959603, 0.685704],
        [7.4592, -3.33555, -4.45219, ..., -3.60612, 7.44045, 2.53162],
        [6.85332, -6.2733, -2.37592, ..., -1.19972, -1.82932, -7.93012],
        ...,
        [7.4592, -3.33555, -4.45219, ..., -3.60612, 7.44045, 2.53162],
        [11.7407, -5.68286, 2.41267, ..., -2.09024, -4.8132, -18.631],
        [2.49508, 1.46045, -0.76519, ..., -6.591, -6.38248, -0.14212]]], dtype=float32)
Layer 0 attention output: array([[[4.54509, -5.11538, -2.82002, ..., -2.08619, 0.521431, -0.443083],
        [4.74691, -2.74774, -1.95743, ..., -1.12529, 2.3418, -0.251879],
        [5.89301, 4.02072, -0.384996, ..., 5.00522, 3.55999, -7.12699],
        ...,
        [2.01418, -1.56629, 2.3474, ..., 6.84909, 5.94273, 6.49313],
        [3.41726, 2.77498, -0.779456, ..., 6.43237, 4.48127, -2.37191],
        [-2.09707, 2.15272, -5.95365, ..., 2.11183, 9.53019, 4.1295]]], dtype=float32)
Layer 0 output: array([[[0.722761, 0.166563, -0.147977, ..., -0.211307, -0.59125, -9.07894],
        [0.277977, -0.754566, -0.697288, ..., -0.175588, 1.76548, -3.8593],
        [1.34864, -0.262994, -0.252896, ..., -0.662074, 0.373211, 3.60931],
        ...,
        [0.663915, -1.32717, -0.404602, ..., 0.169748, 1.86574, -4.7229],
        [1.19292, -0.726707, 0.352503, ..., 0.0104168, -0.992238, 26.6086],
        [-0.158453, 0.403345, -0.0828339, ..., -0.368116, -0.296552, 3.92262]]], dtype=float32)

JAX
Block 0 input: [[[0.170898 -0.114258 -0.0554199 ... -0.0883789 -0.142578 0.057373]
  [0.96875 -0.5 -0.707031 ... -0.523438 1.07031 0.205078]
  [0.898438 -0.949219 -0.380859 ... -0.175781 -0.265625 -0.648438]
  ...
  [0.96875 -0.5 -0.707031 ... -0.523438 1.07031 0.205078]
  [1.53125 -0.855469 0.384766 ... -0.304688 -0.695312 -1.51562]
  [0.328125 0.22168 -0.123047 ... -0.96875 -0.929688 -0.0116577]]]
Inputs normalized: [[[1.27344 -0.738281 -0.337891 ... -0.589844 -0.960938 0.6875]
  [7.4375 -3.32812 -4.4375 ... -3.59375 7.40625 2.53125]
  [6.84375 -6.28125 -2.375 ... -1.20312 -1.82031 -7.9375]
  ...
  [7.4375 -3.32812 -4.4375 ... -3.59375 7.40625 2.53125]
  [11.6875 -5.65625 2.40625 ... -2.09375 -4.8125 -18.5]
  [2.48438 1.45312 -0.765625 ... -6.59375 -6.375 -0.141602]]]
Attention output: [[[4.53125 -5.125 -2.82812 ... -2.07812 0.539062 -0.429688]
  [4.75 -2.73438 -1.98438 ... -1.125 2.35938 -0.255859]
  [5.875 4.03125 -0.384766 ... 5.03125 3.57812 -7.125]
  ...
  [2.9375 -2.875 0.738281 ... 6.09375 5.71875 4.3125]
  [4.25 1.83594 -0.761719 ... 6.875 4.75 -3.92188]
  [-1.75 3.75 -7.1875 ... 1.94531 9.1875 3.625]]]
Block 0 output: [[[0.726562 0.166016 -0.148438 ... -0.210938 -0.589844 -9.125]
  [0.28125 -0.75 -0.699219 ... -0.179688 1.76562 -3.95312]
  [1.35156 -0.265625 -0.257812 ... -0.664062 0.373047 3.5625]
  ...
  [0.632812 -1.41406 -0.351562 ... 0.25 1.875 -7.375]
  [1.20312 -0.710938 0.363281 ... 0.0292969 -1.03125 27.75]
  [-0.175781 0.449219 -0.0800781 ... -0.339844 -0.320312 3.8125]]]

Fixed ✅

It turns out post layernorm values where pretty huge and overflowed in FP16, the temporary solution was to use BF16. As Awni suggested.

Read more here
https://x.com/SeunghyunSEO7/status/1907350826940805266

Now, a more robust approach is to upcast and clip the values of the activations after the post layer norm when inputs are in FP16

PR with fix:
Blaizzy/mlx-vlm#293

Gemma-3-4b-it (original)

Files: ['https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'] 

Prompt: <bos><start_of_turn>user
Describe this image in detail.<start_of_image><end_of_turn>
<start_of_turn>model

mx.metal.get_peak_memory is deprecated and will be removed in a future version. Use mx.get_peak_memory instead.
Here's a detailed description of the image:

**Overall Impression:**

The image is a close-up, vibrant shot of a garden scene, focusing on a large pink cosmos flower with a bee actively collecting pollen. The composition is natural and slightly blurred in the background, giving a sense of depth.

**Foreground:**

*   **Cosmos Flower:** The central focus is a large, fully open pink cosmos flower. It has six prominent, slightly ruffled petals, a creamy yellow center, and a delicate, almost papery texture. The color is a bright, saturated pink.
*   **Bee:** A fuzzy, dark brown and black bee is actively feeding on the flower. It's positioned on one of the petals, its body covered in pollen. The bee is a key element, emphasizing the flower's role as a source of nectar and pollen.
*   **Surrounding Flowers:** Several other cosmos flowers are visible in the background, some in full bloom and others beginning to wilt or dry out. They share the same pink hue. There's also a small, red flower with a pointed shape in the very back, adding a pop of color.

**Background:**

*   **Green Foliage:** Lush green leaves and stems of various plants form the backdrop. The leaves have a slightly blurred appearance, suggesting a shallow depth of field.
*   **Dried Flowers:** Several dried and withered flowers are scattered around, indicating the natural cycle of a garden. They are mostly shades of pink and brown.

**Lighting and Composition:**

*   **Lighting:**
 The image is well-lit, likely by natural sunlight. The light is soft and even, highlighting the details of the flowers and bee.
*   **Depth of Field:** The shallow depth of field keeps the main focus (the pink cosmos and the bee) sharp while blurring the background, drawing the viewer's attention.

**Overall Mood:**

The image evokes a feeling of tranquility, natural beauty, and the busy activity of pollinators. It captures a small, intimate moment in a garden setting.

Would you like me to zoom in on a specific part of the image or describe something more particular?

Gemma-3-4b-it-4bit (skip vision module)

Files: ['https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'] 

Prompt: <bos><start_of_turn>user
Describe this image in detail.<start_of_image><end_of_turn>
<start_of_turn>model

mx.metal.get_peak_memory is deprecated and will be removed in a future version. Use mx.get_peak_memory instead.
Here's a detailed description of the image:

**Overall Impression:**

The image is a close-up photograph of a vibrant garden scene, focusing on a large pink cosmos flower and a bee. The composition is intimate, drawing the viewer into the details of the flowers and the insect.

**Central Focus:**

*   **Cosmos Flower:** A large, fully open pink cosmos flower dominates the center of the frame. The petals are a soft, slightly dusty pink, with a subtle texture suggesting a velvety feel. The flower has a prominent yellow center with a dark brown seed head.
*   **Bee:** A fuzzy, dark-colored bee is actively collecting nectar from the cosmos flower. It’s a dark, almost black bee with yellow stripes and a fuzzy body. It's positioned on one of the petals, appearing to be deeply immersed in the flower.

**Surrounding Elements:**

*   **Flowers:** Several other pink cosmos flowers are visible in the background, some in full bloom and others starting to wilt, indicating a natural, slightly overgrown garden scene. There's also a small, red flower (possibly a poppy) to the right.
*   **Foliage:** Lush green foliage surrounds the flowers, providing a backdrop and adding depth to the image. There are various shades of green and some with a slightly textured appearance.
*   **Texture:** The image is rich in texture – the velvety petals of the cosmos, the fuzzy body of the bee, the rough stems of the plants, and the detailed leaves.

**Lighting and Composition:**

*   **Lighting:** The lighting appears to be natural, with soft, diffused light, creating gentle shadows and highlighting the details of the flowers and bee.
*   **Focus:** The focus is sharp on the central cosmos flower and the bee, while the background elements are slightly softer, creating a shallow depth of field.

**Overall Impression:** The image evokes a sense of a thriving, natural garden scene, highlighting the beauty of the flowers and the important role of pollinators like bees. It's a peaceful and detailed snapshot of nature.

If you’d like, you can ask me to describe another image!

DePasqualeOrg · 2025-04-09T15:04:52Z

Thanks, @Blaizzy. I'm hoping that others can take over from here on the Swift side, since I'm not able to do this myself.

FL33TW00D · 2025-04-09T15:39:05Z

@Blaizzy great detective work 🔍

Blaizzy · 2025-04-09T16:27:26Z

@Blaizzy great detective work 🔍

Thanks! ❤️

Actually it became quite the topic.

Just didn't have the bandwidth.

But I got plenty of time now.

Blaizzy · 2025-04-09T16:30:48Z

Thanks, @Blaizzy. I'm hoping that others can take over from here on the Swift side, since I'm not able to do this myself.

I'm sad you feel that way.

I know you are very much capable of a lot more than you give yourself credit for.

We all get stuck. I'm stuck with llama 4 and phi-4 vision analysis right now since their release. Could only get the LMs to work but I don't give up, keep chipping at it every day.

But don't worry, I'm thinking about a more permanent solution for Swift support in the next few weeks

DePasqualeOrg force-pushed the gemma-3 branch 2 times, most recently from e4afada to 040d5c1 Compare March 12, 2025 10:40

pcuenca reviewed Mar 12, 2025

View reviewed changes