-
Notifications
You must be signed in to change notification settings - Fork 207
Add Gemma 3 #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Gemma 3 #238
Conversation
e4afada
to
040d5c1
Compare
Edit: This is now fixed. |
There is one in MLXNN as well, but they don't all have the same definition. Refactoring models can be tricky IMHO |
I fixed some more errors, and now there are just a few errors and TODO comments left, which I'll need help resolving. |
I can take a look this afternoon! |
The config is working, although it can probably be improved (see TODO comment and possibly remove unneeded properties). But now I'm getting the following error when I run the model:
|
Something is wrong with the image tokens that are being inserted by the tokenizer vs. what's expected in this implementation vs. what's in the config vs. what I see in the Python implementation. I'll need help sorting this out. https://huggingface.co/mlx-community/gemma-3-4b-it-4bit/blob/main/config.json Debug output from the current commit:
|
The is the prompt right before tokenization:
and per the config object we are looking for this token:
This is not in the chat template, it looks like something Gemma3Processor (transformers) adds: # Replace image tokens by the full expanded sequence
batch_num_crops = to_py_obj(image_inputs.pop("num_crops"))
text_with_crops = text
for batch_idx, (prompt, images, num_crops) in enumerate(zip(text, batched_images, batch_num_crops)):
image_indexes = [m.start() for m in re.finditer(self.boi_token, prompt)]
if len(images) != len(image_indexes):
raise ValueError(
f"Prompt contained {len(image_indexes)} image tokens but received {len(images)} images."
)
# Insert additional image tokens for Pan-and-Scan crops
for num, idx in reversed(list(zip(num_crops, image_indexes))):
if num:
formatted_image_text = (
f"Here is the original image {self.boi_token} and here are some crops to help you see better "
+ " ".join([self.boi_token] * num)
)
prompt = prompt[:idx] + formatted_image_text + prompt[idx + len(self.boi_token) :]
text_with_crops[batch_idx] = prompt
# Expand placeholder image tokens to the full image token sequence
text = [prompt.replace(self.boi_token, self.full_image_sequence) for prompt in text] The last line in particular is inserting the special tokens: 'Describe the image in English\n\n<start_of_image><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><image_soft_token><end_of_image>\n\n' |
So I think some of the transformers code needs to be included in the UserInputProcessor, along the lines of this from paligemma:
|
@DePasqualeOrg ^^^ not sure if this notified you -- we are missing some code that lives in transformers. |
Got it. Do you want to take on that part? I don't know if I'll be able to add anything else today. |
Maybe -- I will post here when/if I am able to start it today. |
I think I've replicated the processing code from transformers, and the model is now generating text without any errors, but the text is garbled. The debug output looks correct to me, but maybe I'm missing something. @pcuenca @Blaizzy @FL33TW00D, any ideas what might be going wrong? Debug output:
Generated text:
|
I could take a look tomorrow, if that works. |
I added the text-only model, but it's also generating strange output:
|
My thoughts on debugging (I have run into similar things with a few models I have ported):
|
Thanks for the tips, @davidkoski. I tried generating with text only as input to the vision model, and I got this output:
I think I'll leave this for others to finish, since I've already spent many hours on it and am at the limit of my capabilities. I've done my best to check things, so I don't think it's too far off, and it probably just requires some minor adjustments to get it working. |
Ah, that is unfortunate -- I wonder if the python code is set up to call it the same way without an image? Anyway, I made a red image to test with: and I get this from python (not pinning any parameters yet):
and I get this from swift:
so I can repro at least :-) |
It seems like it might be a tokenization issue, so it would be interesting to get @pcuenca's input. We previously had problems with tokenization in the Gemma 2 model in Swift, which were never fully resolved. |
I copied the tokens & mask array from the python version into swift and got the same garbled output. So probably not the tokenizing, but there are differences. It looks like the python version doesn't have as much of the template:
while swift has this (without the image tokens injected yet):
|
This was a last minute change introduced in the transformers codebase.
The 1B text-only model now generates text instead of special tokens. The latest commit includes major changes in Edit: I tested this with multi-turn conversations, and I'm getting a crash with this error: I'll need others to revise the KV cache part, but it seems that this is necessary for the sliding window attention used by this model. |
The latest commit fixes the crash when the window size is exceeded, but the quality of the output is not good at that point. I think I'll leave this as is for now, since I need input from people who have more expertise than me. |
Fixed ✅ It turns out post layernorm values where pretty huge and overflowed in FP16, the temporary solution was to use BF16. As Awni suggested. Read more here Now, a more robust approach is to upcast and clip the values of the activations after the post layer norm when inputs are in FP16 PR with fix: Gemma-3-4b-it (original)
Gemma-3-4b-it-4bit (skip vision module)
|
Thanks, @Blaizzy. I'm hoping that others can take over from here on the Swift side, since I'm not able to do this myself. |
@Blaizzy great detective work 🔍 |
Thanks! ❤️ Actually it became quite the topic. Just didn't have the bandwidth. But I got plenty of time now. |
I'm sad you feel that way. I know you are very much capable of a lot more than you give yourself credit for. We all get stuck. I'm stuck with llama 4 and phi-4 vision analysis right now since their release. Could only get the LMs to work but I don't give up, keep chipping at it every day. But don't worry, I'm thinking about a more permanent solution for Swift support in the next few weeks |
This is a first attempt at porting https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/gemma3 to Swift. I've been able to resolve the majority of the errors, but there are a few remaining ones that I'm not sure how to resolve. Also see my TODO comments on lines that need to be checked.