-
Notifications
You must be signed in to change notification settings - Fork 231
VLM support for image and video processing with SmolVLM support #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Video/image fixes
Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.
I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames untouched :)
smolvlm processing
Some cleanup
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider { | |||
return final | |||
} | |||
|
|||
// inputs_merger | |||
// TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.
Sorry, I've been distracted with other stuff. I'll get back to this soon to address all the feedback! |
[wip] Addressing PR comments
I think I addressed most of the comments:
I think what's pending is:
|
userInfo: [NSLocalizedDescriptionKey: "Failed to load the asset's duration"]) | ||
} | ||
let fps = targetFPS(duration) | ||
// Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the comment, this may result in an additional frame being extracted for users of the previous asCIImageSequence
(only Qwen VL). I don't think this would be a big deal, so we can just remove the comment.
// Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration. |
Awesome! @pcuenca and @cyrilzakka see my suggestion here on the video: https://github.com/ml-explore/mlx-swift-examples/pull/206/files#r2010564639 and yes, it looks like it needs swift-format Then I think it is ready to go. |
@pcuenca and @cyrilzakka -- I think there were just a few pending issues, in particular around the inclusion of the video. What do you think about this? Also needs swift-format run. If both of you are busy I am happy to get these last couple items so we can merge this. |
Sorry, I dropped the ball here. Looking at the final pieces today. |
Co-authored-by: David Koski <[email protected]>
Update SmolVLM PR
Please, let us know if there's anything else to revisit 🤗 |
Awesome @pcuenca ! I will review it and hopefully merge it this afternoon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you @cyrilzakka and @pcuenca for your hard work here!
Hey all,
@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!