Skip to content

[Bugfix] Fix num video tokens calculation for Qwen2-VL #13148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 12, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion vllm/model_executor/models/qwen2_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -800,7 +800,11 @@ def _get_vision_info(
preprocessed_size = ImageSize(width=image_width,
height=image_height)

grid_t = max(num_frames // temporal_patch_size, 1)
# NOTE: Frames are padded to be divisible by `temporal_patch_size`
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L294
padded_num_frames = num_frames + num_frames % temporal_patch_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 This may cause a warning, please see: QwenLM/Qwen2.5-VL#738

Due to padding in the process of dummy_data, shorter sequences do not affect the calculation of num_gpu_blocks and num_gpu_blocks in the KV cache. Therefore, using a rounding-down operation is more appropriate.

Copy link
Member Author

@DarkLight1337 DarkLight1337 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using round-down results in the error shown in #13099 because of the mismatch in token count. The warning is not affected by this PR (nor #13012) because the default _MAX_FRAMES_PER_VIDEO is 16 which is divisible by 2, so no padding is applied.

Copy link
Member Author

@DarkLight1337 DarkLight1337 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is intentional because it is indeed possible for the context length to be exceeded if you pass in too many video frames.

For example,

(131072 tokens in total, out of which {'image': 16384, 'video': 114688} are reserved for multi-modal embeddings)

means that 16384 tokens are reserved for images while 114688 tokens are reserved for videos. If you pass max number of images and videos together, then there will be a total of 131072 tokens which indeed exceeds the context length of the model (128000).

Nevertheless, if you only intend to pass images to the model, you don't have to worry about the warning as only 16384 tokens (16384 image and 0 video tokens) can exist in practice. To silence the warning, you can set --limit-mm-per-prompt video=0 to prevent videos from being passed to the model at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 Yes, your fix is right, and padding num_frames in _get_vision_info is the same as applying a floor function to the max_total_frames value obtained from _get_max_video_frames in the original implementation. Crucially, this also guarantees that the result is divisible by temporal_patch_size(2).


grid_t = max(padded_num_frames // temporal_patch_size, 1)
grid_h = preprocessed_size.height // patch_size
grid_w = preprocessed_size.width // patch_size

Expand Down