Add InternVL (2.5 MPO) #35968

yonigozlan · 2025-01-29T22:46:05Z

What does this PR do?

Add InternVL to Transformers.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…essor

HuggingFaceDocBuilderDev · 2025-02-20T20:31:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…rmers into add-intern-vl

Cyrilvallez

ALright, very nice!! 🤗 Maybe we can push a bit to use modular more, especially on the Vision part though? Let me know if it could work! Otherwise it's in a very good state so we can merge it very soon 👌

Just about the checkpoints in the examples/docstrings/a bit everywhere, I see you used "yonigozlan/...". Should those point to the main repo instead? It's no issue in the tests, but in the docstrings etc it's best to use original checkpoints if any!

docs/source/en/model_doc/internvl.md

src/transformers/models/internvl/configuration_internvl.py

src/transformers/models/internvl/modular_internvl.py

Cyrilvallez · 2025-04-18T09:47:26Z

src/transformers/models/internvl/modular_internvl.py

+        >>> print(processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True))
+        The images depict the Statue of Liberty and the Golden Gate Bridge.
+        ```"""
+


Any other Llava-like that could use more inheritance maybe? 🤗

tests/models/internvl/test_modeling_internvl.py

Cyrilvallez · 2025-04-18T09:51:21Z

tests/models/internvl/test_modeling_internvl.py

+
+@slow
+@require_torch_gpu
+class InternVLQwen2IntegrationTest(unittest.TestCase):
+    def setUp(self):
+        self.small_model_checkpoint = "yonigozlan/InternVL3-1B-hf"
+        self.medium_model_checkpoint = "yonigozlan/InternVL3-2B-hf"
+
+    def tearDown(self):
+        cleanup(torch_device, gc_collect=True)
+
+    def test_qwen2_small_model_integration_generate(self):
+        processor = AutoProcessor.from_pretrained(self.small_model_checkpoint)
+        model = InternVLForConditionalGeneration.from_pretrained(
+            self.small_model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16


Nice IntegrationTests! 🤗 Did you make sure that the outputs are the same on T4 by any chance so we don't have to potentially adjust later?

I used A10, the CI runners have A10 no?

I think they use both A10 and T4, but if you run them on A10 it's all good!

tests/models/internvl/test_processor_internvl.py

tests/test_configuration_common.py

yonigozlan · 2025-04-18T12:26:36Z

Thanks for the review @Cyrilvallez ! Made the modifs and I'm using more modular for InternVLVision.

Just about the checkpoints in the examples/docstrings/a bit everywhere, I see you used "yonigozlan/...". Should those point to the main repo instead? It's no issue in the tests, but in the docstrings etc it's best to use original checkpoints if any!

Yes of course, I'll do a "replace all" once I moved the checkpoints to the main repo :), just need to convert them again, and I'll move them once you give me the green light!

…rmers into add-intern-vl

Cyrilvallez

Alright, see the final comments! Then we're all good!
You need to patch the checkpoints in the processor tests as well, as it's currently failing because it does not seem to exist on the hub

Cyrilvallez · 2025-04-18T13:31:19Z

src/transformers/models/internvl/configuration_internvl.py

+        text_config=None,
+        image_token_index=151667,


We just decided to standardize and use id instead of index here for the tokens, so let's change it before merging! See #37573

That means I have to override a big chunk of the forward function no, since llava uses image_token_index right now... I'll change here if #37573 is merged first, otherwise i'll ping in the PR :)

Cyrilvallez · 2025-04-18T13:33:31Z

src/transformers/models/internvl/modular_internvl.py

+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        proj_dropout = config.projection_dropout
+        qk_norm = config.use_qk_norm
+
+        # InternVLVision has no MHA, hence for `eager_attention_forward` call setting `num_key_value_groups` to 1.
+        self.num_key_value_groups = 1
+
+        # Needed for flash attention
+        self.is_causal = False
+
+        self.q_proj = nn.Linear(self.embed_dim, self.num_heads * self.head_dim, bias=config.attention_bias)
+        self.k_proj = nn.Linear(self.embed_dim, self.num_heads * self.head_dim, bias=config.attention_bias)
+        self.v_proj = nn.Linear(self.embed_dim, self.num_heads * self.head_dim, bias=config.attention_bias)
+        self.projection_layer = nn.Linear(self.embed_dim, self.embed_dim)
+        self.projection_dropout = nn.Dropout(proj_dropout) if proj_dropout > 0 else nn.Identity()
+
+        self.q_norm = InternVLVisionRMSNorm(self.embed_dim) if qk_norm else nn.Identity()
+        self.k_norm = nn.LayerNorm(self.embed_dim) if qk_norm else nn.Identity()


We should not need most of that here - also the Norms are mixed up

Oups thanks for catching that! And I didn't know modular could just partially override a method, that's very cool thanks!

Cyrilvallez · 2025-04-18T13:34:30Z

src/transformers/models/internvl/modular_internvl.py

+class InternVLVisionMLP(CLIPMLP):
+    pass


Nice! I knew we had this one somewhere! 🤗

Cyrilvallez

Last picky comments! 🤗

Cyrilvallez · 2025-04-18T16:01:11Z

src/transformers/models/internvl/modular_internvl.py

+        # InternVLVision has no MHA, hence for `eager_attention_forward` call setting `num_key_value_groups` to 1.
+        self.num_key_value_groups = 1


eager is not using it, you removed the repeat_kv already! Let's remove it!

src/transformers/models/internvl/modular_internvl.py

Cyrilvallez

Perfect, LGTM, nothing to add! Thanks a lot!! 🤗🤗

Last comment is related to @zucchini-nlp's comment, let's apply it before merging! But then, it's all good feel free to merge!

Cyrilvallez · 2025-04-18T16:57:29Z

Merging! Thanks again! 🤗🤗

* initial commit * add convert internvl * add first end-to-end working internvl * nit prompt and image proc * add working chat template * add conversion llama-based models * add tests * pass all tests * fix isort * fix modular after main merge * add video processing for internvl * add support for interlaced images and videos * Remove processing and config from modular, add more tests * add llama model tests * Modify processor for compatibility with refactored got ocr image processor * add comments in processor * Add docs and nits * change video processing to use custom sample_indices_fn * rebase and fix tests * add processor tests * Add changes Raushan review * Use the new attention interface for the vision model * nits * add support for custom video_load_backend * remove mention to InternVLTokenizer * refactor vision model to simplify logic * refactor processor for better readibility * fix copies * fix require av processor test * refactor internVL vision * Update processor and fix processing tests * fix docstring * update convert_weights for internvl3 * change image processor to fast by default * remove do_center_crop=True in convert_weights * force use_cache to True * push_to_hub before reloading * fix internVLVision for larger models * update convert weight for qk norm * fix convert_weights * fix eos_token_id in convert * update docs and integration tests * make modifs after review * fix wrong k_norm and reduce modular * change image_token_index to image_token_id * change checkpoint to OpenGVLab org * last nits * explicitely del self.num_key_value_groups * add extra special tokens

yonigozlan added 3 commits January 31, 2025 21:52

initial commit

9f14e29

add convert internvl

18a2907

add first end-to-end working internvl

bb754e9

yonigozlan force-pushed the add-intern-vl branch from b8d9dcb to bb754e9 Compare January 31, 2025 23:48

thisisiron mentioned this pull request Feb 1, 2025

[WIP] Add InternVL2.5 model and related components #36003

Closed

yonigozlan added 3 commits February 3, 2025 14:50

nit prompt and image proc

ce881fa

add working chat template

c8bc2e5

add conversion llama-based models

a8a6142

ArthurZucker added the New model label Feb 10, 2025

yonigozlan added 6 commits February 10, 2025 23:24

add tests

aa6b6fa

pass all tests

72a5482

Merge remote-tracking branch 'upstream/main' into add-intern-vl

7c4de89

fix isort

f842255

fix modular after main merge

a275a1b

add video processing for internvl

747ec09

yonigozlan mentioned this pull request Feb 11, 2025

Fix make_batched_videos and add tests #36143

Merged

yonigozlan added 3 commits February 12, 2025 03:58

add support for interlaced images and videos

e747692

Remove processing and config from modular, add more tests

3005a9f

add llama model tests

d7f5d5f

yonigozlan mentioned this pull request Feb 13, 2025

Add Got-OCR 2 Fast image processor and refactor slow one #36185

Merged

Kuangdd01 mentioned this pull request Feb 14, 2025

有支持internvl微调的计划么？ hiyouga/LLaMA-Factory#6236

Closed

yonigozlan added 3 commits February 14, 2025 19:20

Modify processor for compatibility with refactored got ocr image proc…

bc13ecb

…essor

add comments in processor

63b981a

Add docs and nits

9dce4a7

yonigozlan changed the title ~~[WIP] Add InternVL (2.5 MPO)~~ Add InternVL (2.5 MPO) Feb 20, 2025

yonigozlan added 3 commits February 20, 2025 18:47

Merge remote-tracking branch 'upstream/main' into add-intern-vl

8249051

change video processing to use custom sample_indices_fn

9e93a45

rebase and fix tests

86d9049

add processor tests

329dc54

yonigozlan and others added 11 commits April 16, 2025 17:14

push_to_hub before reloading

ac5b7fd

fix internVLVision for larger models

31529a4

update convert weight for qk norm

50b05a9

Merge remote-tracking branch 'upstream/main' into add-intern-vl

b4857b6

fix convert_weights

8420c3d

Merge remote-tracking branch 'upstream/main' into add-intern-vl

57cbbd7

Merge branch 'add-intern-vl' of https://github.com/yonigozlan/transfo…

189d315

…rmers into add-intern-vl

fix eos_token_id in convert

0501394

update docs and integration tests

b4c13d5

Merge branch 'main' into add-intern-vl

d536a23

Merge branch 'main' into add-intern-vl

8c70ded

Cyrilvallez reviewed Apr 18, 2025

View reviewed changes

yonigozlan added 3 commits April 18, 2025 12:40

make modifs after review

1d2f943

Merge branch 'add-intern-vl' of https://github.com/yonigozlan/transfo…

300fdaa

…rmers into add-intern-vl

Merge remote-tracking branch 'upstream/main' into add-intern-vl

549085e

Cyrilvallez reviewed Apr 18, 2025

View reviewed changes

yonigozlan added 4 commits April 18, 2025 15:00

fix wrong k_norm and reduce modular

3456dc7

Merge remote-tracking branch 'upstream/main' into add-intern-vl

135ab88

change image_token_index to image_token_id

e3ec223

change checkpoint to OpenGVLab org

14beed0

Cyrilvallez reviewed Apr 18, 2025

View reviewed changes

yonigozlan added 3 commits April 18, 2025 16:11

last nits

496e3c8

explicitely del self.num_key_value_groups

3ba730a

Merge remote-tracking branch 'upstream/main' into add-intern-vl

231e9c8

Cyrilvallez approved these changes Apr 18, 2025

View reviewed changes

add extra special tokens

015d3b2

Cyrilvallez merged commit a245011 into huggingface:main Apr 18, 2025
18 of 20 checks passed

		# InternVLVision has no MHA, hence for `eager_attention_forward` call setting `num_key_value_groups` to 1.
		self.num_key_value_groups = 1

Add InternVL (2.5 MPO) #35968

Add InternVL (2.5 MPO) #35968

Uh oh!

Conversation

yonigozlan commented Jan 29, 2025

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yonigozlan commented Apr 18, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!