Skip to content

VLM support for image and video processing with SmolVLM support #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Apr 9, 2025

Conversation

cyrilzakka
Copy link
Contributor

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

cyrilzakka and others added 30 commits February 12, 2025 10:21
Text inputs, with hardcoded values and considering a single image.
Image patching still not done.

You need to define HF_TOKEN in the environment to be able to download
the model.
I believe pre-processing matches transformers', but inference fails
because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames
untouched :)
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider {
return final
}

// inputs_merger
// TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.

@pcuenca
Copy link
Contributor

pcuenca commented Mar 12, 2025

Sorry, I've been distracted with other stuff. I'll get back to this soon to address all the feedback!

@pcuenca
Copy link
Contributor

pcuenca commented Mar 23, 2025

I think I addressed most of the comments:

  • Extracted processor and processor configuration to a separate file.
  • Refactored asCIImageSequence -> asProcessedSequence, where each frame is pre-processed and rendered to an MLX array as we go (potentially addressing MediaProcessing.asCIImageSequence should produce an array of downsampled frames #223).
  • Refactored Qwen pre-processing as per the above.
  • Small refactor of the aspect ratio calculation for the resample methods.
  • Synced with main as of a couple of days ago.

I think what's pending is:

  • Decide whether to provide an example video or not. I'm fine either way, leaning towards doing it to make it easier for others to test. If we have concerns on project size, we could download it on first launch instead of adding it to the project, and then we can easily update in the future.
  • Potentially apply style rules.

userInfo: [NSLocalizedDescriptionKey: "Failed to load the asset's duration"])
}
let fps = targetFPS(duration)
// Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration.
Copy link
Contributor

@pcuenca pcuenca Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the comment, this may result in an additional frame being extracted for users of the previous asCIImageSequence (only Qwen VL). I don't think this would be a big deal, so we can just remove the comment.

Suggested change
// Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration.

@davidkoski
Copy link
Collaborator

I think what's pending is:

  • Decide whether to provide an example video or not. I'm fine either way, leaning towards doing it to make it easier for others to test. If we have concerns on project size, we could download it on first launch instead of adding it to the project, and then we can easily update in the future.
  • Potentially apply style rules.

Awesome!

@pcuenca and @cyrilzakka see my suggestion here on the video: https://github.com/ml-explore/mlx-swift-examples/pull/206/files#r2010564639

and yes, it looks like it needs swift-format

Then I think it is ready to go.

@davidkoski
Copy link
Collaborator

@pcuenca and @cyrilzakka -- I think there were just a few pending issues, in particular around the inclusion of the video. What do you think about this?

Also needs swift-format run. If both of you are busy I am happy to get these last couple items so we can merge this.

@pcuenca
Copy link
Contributor

pcuenca commented Apr 9, 2025

Sorry, I dropped the ball here. Looking at the final pieces today.

@pcuenca
Copy link
Contributor

pcuenca commented Apr 9, 2025

  • Synced with main
  • Replaced video, thanks @davidkoski for the idea and URL!
  • Applied format

Please, let us know if there's anything else to revisit 🤗

@davidkoski
Copy link
Collaborator

Awesome @pcuenca ! I will review it and hopefully merge it this afternoon.

Copy link
Collaborator

@davidkoski davidkoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you @cyrilzakka and @pcuenca for your hard work here!

@davidkoski davidkoski merged commit ec9523b into ml-explore:main Apr 9, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants