VLM support for image and video processing with SmolVLM support #206

cyrilzakka · 2025-02-18T20:08:40Z

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

Video/image fixes

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

The configuration fixes that make this work have been applied.

Generation (single image) works now 🔥

Also changed the input type to `image` to keep the sequence of frames untouched :)

smolvlm processing

Some cleanup

Additional smolvlm changes and adjustments

Images are always upscaled, so always tiled.

Fix single image pre-processing

pcuenca · 2025-03-05T18:18:05Z

Libraries/MLXVLM/Models/Idefics3.swift

@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider {
        return final
    }

+    // inputs_merger
+    // TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)?


This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.

pcuenca · 2025-03-12T15:47:43Z

Sorry, I've been distracted with other stuff. I'll get back to this soon to address all the feedback!

[wip] Addressing PR comments

pcuenca · 2025-03-23T17:02:25Z

I think I addressed most of the comments:

Extracted processor and processor configuration to a separate file.
Refactored asCIImageSequence -> asProcessedSequence, where each frame is pre-processed and rendered to an MLX array as we go (potentially addressing MediaProcessing.asCIImageSequence should produce an array of downsampled frames #223).
Refactored Qwen pre-processing as per the above.
Small refactor of the aspect ratio calculation for the resample methods.
Synced with main as of a couple of days ago.

I think what's pending is:

Decide whether to provide an example video or not. I'm fine either way, leaning towards doing it to make it easier for others to test. If we have concerns on project size, we could download it on first launch instead of adding it to the project, and then we can easily update in the future.
Potentially apply style rules.

pcuenca · 2025-03-23T17:05:49Z

Libraries/MLXVLM/MediaProcessing.swift

+                userInfo: [NSLocalizedDescriptionKey: "Failed to load the asset's duration"])
+        }
+        let fps = targetFPS(duration)
+        // Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration.


As noted in the comment, this may result in an additional frame being extracted for users of the previous asCIImageSequence (only Qwen VL). I don't think this would be a big deal, so we can just remove the comment.

Suggested change

// Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration.

davidkoski · 2025-03-24T16:51:22Z

I think what's pending is:

Decide whether to provide an example video or not. I'm fine either way, leaning towards doing it to make it easier for others to test. If we have concerns on project size, we could download it on first launch instead of adding it to the project, and then we can easily update in the future.

Potentially apply style rules.

Awesome!

@pcuenca and @cyrilzakka see my suggestion here on the video: https://github.com/ml-explore/mlx-swift-examples/pull/206/files#r2010564639

and yes, it looks like it needs swift-format

Then I think it is ready to go.

davidkoski · 2025-04-08T19:42:58Z

@pcuenca and @cyrilzakka -- I think there were just a few pending issues, in particular around the inclusion of the video. What do you think about this?

https://github.com/ml-explore/mlx-swift-examples/pull/206/files#r2010564639

Also needs swift-format run. If both of you are busy I am happy to get these last couple items so we can merge this.

pcuenca · 2025-04-09T09:49:43Z

Sorry, I dropped the ball here. Looking at the final pieces today.

Co-authored-by: David Koski <[email protected]>

Update SmolVLM PR

pcuenca · 2025-04-09T18:20:59Z

Synced with main
Replaced video, thanks @davidkoski for the idea and URL!
Applied format

Please, let us know if there's anything else to revisit 🤗

davidkoski · 2025-04-09T18:28:28Z

Awesome @pcuenca ! I will review it and hopefully merge it this afternoon.

davidkoski

Awesome, thank you @cyrilzakka and @pcuenca for your hard work here!

cyrilzakka and others added 30 commits February 12, 2025 10:21

Update MediaProcessing.swift

b7c61ac

Merge pull request #1 from ml-explore/main

cc31f91

Video/image fixes

smolvlm processing

1ba603c

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

Fix arch name

610a457

Optional config values

113f93d

Perform image tiling

dc6b71f

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

Inference runs, but generations are random.

1cf5906

Restore Idefics3 processor, add SmolVLMProcessor

da3f80f

The configuration fixes that make this work have been applied.

clean

ad6f05c

Reorder to unhardcode rows, cols

7be743d

Remove unused var

6b63fcf

Fix typo lol

521b927

Generation (single image) works now 🔥

Initial support for image and video processing

0eaab62

Added global token to video prompt

76707dd

Merge remote-tracking branch 'cyril/main' into smolvlm-processing

5f39269

Fix prompt handling in video

ee2cd3b

Also changed the input type to `image` to keep the sequence of frames untouched :)

Merge pull request #2 from pcuenca/smolvlm-processing

cb22d0d

smolvlm processing

Small cleanup

a94f419

Add preprocessor configuration

0fe3a46

Unhardcode some values from config

a831a12

Remove prints

7a6d2c6

Chaining API -> some vars are now lets.

c73bfe3

Merge pull request #3 from pcuenca/smolvlm-changes

2807b88

Some cleanup

Change llm-tool to follow the smolvlm format

d407259

Fix system prompt, use prompts by Miquel

b86cdf2

Add system prompt

97ed22b

Update Applications/VLMEval/ContentView.swift

08b1e8c

Merge pull request #4 from pcuenca/smolvlm-changes

6f5e2f4

Additional smolvlm changes and adjustments

Fix single image pre-processing

9d7ad6e

Images are always upscaled, so always tiled.

Merge pull request #5 from pcuenca/image-preprocessing

ac482a3

Fix single image pre-processing

pcuenca reviewed Mar 5, 2025

View reviewed changes

pcuenca added 2 commits March 20, 2025 14:21

Merge branch 'main' of https://github.com/ml-explore/mlx-swift-examples

8791309

Add missing error descriptions

06dc330

pcuenca mentioned this pull request Mar 20, 2025

[wip] Addressing PR comments cyrilzakka/mlx-swift-examples#8

Merged

pcuenca and others added 8 commits March 20, 2025 14:52

Default to smolvlm because it's small, update comments

49de515

Remove comment

b48773e

Extract processor to its own file.

f9881d5

Render frames as we go.

ac0ff2b

Adapt Qwen2VL to sequential pre-processing

67b6c0d

Use VLMRegistry

20ebcc4

Refactor aspect ratio calculation.

5323a5e

Merge pull request #8 from pcuenca/main

fcc30a5

[wip] Addressing PR comments

pcuenca reviewed Mar 23, 2025

View reviewed changes

pcuenca and others added 7 commits April 9, 2025 19:55

Merge remote-tracking branch 'upstream/main'

466722e

post-merge fix

92eaa56

Merge branch 'main' of https://github.com/cyrilzakka/mlx-swift-examples

a1404fb

Format

7ed7086

Replace example video

99cffac

Co-authored-by: David Koski <[email protected]>

style

8af30ff

Merge pull request #9 from pcuenca/main

6d977f6

Update SmolVLM PR

davidkoski approved these changes Apr 9, 2025

View reviewed changes

davidkoski merged commit ec9523b into ml-explore:main Apr 9, 2025
3 checks passed

davidkoski mentioned this pull request Apr 15, 2025

llm-tool / VLMEval does not build prompt correctly when used with images/video #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM support for image and video processing with SmolVLM support #206

VLM support for image and video processing with SmolVLM support #206

cyrilzakka commented Feb 18, 2025

pcuenca Mar 5, 2025

pcuenca commented Mar 12, 2025

pcuenca commented Mar 23, 2025 •

edited

Loading

pcuenca Mar 23, 2025 •

edited

Loading

davidkoski commented Mar 24, 2025

davidkoski commented Apr 8, 2025

pcuenca commented Apr 9, 2025

pcuenca commented Apr 9, 2025

davidkoski commented Apr 9, 2025

davidkoski left a comment

VLM support for image and video processing with SmolVLM support #206

VLM support for image and video processing with SmolVLM support #206

Conversation

cyrilzakka commented Feb 18, 2025

pcuenca Mar 5, 2025

Choose a reason for hiding this comment

pcuenca commented Mar 12, 2025

pcuenca commented Mar 23, 2025 • edited Loading

pcuenca Mar 23, 2025 • edited Loading

Choose a reason for hiding this comment

davidkoski commented Mar 24, 2025

davidkoski commented Apr 8, 2025

pcuenca commented Apr 9, 2025

pcuenca commented Apr 9, 2025

davidkoski commented Apr 9, 2025

davidkoski left a comment

Choose a reason for hiding this comment

pcuenca commented Mar 23, 2025 •

edited

Loading

pcuenca Mar 23, 2025 •

edited

Loading