You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* update siglip2 model card
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* address comments
* separate naflex and fixres variant
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/siglip2.md
Co-authored-by: Steven Liu <[email protected]>
---------
Co-authored-by: Steven Liu <[email protected]>
The SigLIP2 model was proposed in [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://huggingface.co/papers/2502.14786) by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
28
-
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
29
-
Andreas Steiner and Xiaohua Zhai.
30
-
31
-
The model comes in two variants
32
-
33
-
1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
34
-
2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)
35
-
36
-
The abstract from the paper is the following:
37
-
38
-
*We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success
39
-
of the original SigLIP. In this second iteration, we extend the original image-text training objective with
40
-
several prior, independently developed techniques into a unified recipe—this includes decoder-based
41
-
pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With
42
-
these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities,
accuracy), image-text retrieval, and transfer performance when extracting visual representations for
45
-
Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements
46
-
on localization and dense prediction tasks. We also train variants which support multiple resolutions
47
-
and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that
48
-
includes de-biasing techniques, leading to much better multilingual understanding and improved fair-
49
-
ness. To provide users with the ability to trade-off inference cost with performance, we release model
50
-
checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).*
51
-
52
-
## Usage tips
53
-
54
-
- Usage of SigLIP2 is similar to [SigLIP](siglip) and [CLIP](clip). The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
55
-
- Training is supported but does not use `torch.distributed` utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
56
-
- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
57
-
- Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
58
-
- To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
59
-
- The NaFlex variant supports processing images at higher resolutions by adjusting the `max_num_patches` parameter in the `Processor`. The default value is `max_num_patches=256`. Increasing `max_num_patches` to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.
This model was contributed by [qubvel](https://huggingface.co/qubvel-hf).
65
-
The original code can be found [here](https://github.com/google-research/big_vision/tree/main).
29
+
[SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders that builds on the [SigLIP](./siglip) training recipe. It includes decoder-based pretraining, self-distillation, and masked prediction to improve dense prediction tasks (segmentation, depth estimation, etc.). This model is available in two variants:
66
30
67
-
## Usage example
31
+
- NaFlex supports different resolutions and maintains the native image aspect ratio
32
+
- FixRes supports fixed resolutions and is backwards compatible with [SigLIP](./siglip)
68
33
69
-
There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the `Siglip2Model` class yourself.
70
34
71
-
### FixRes variant
35
+
You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection.
72
36
73
-
**Pipeline API**
37
+
> [!TIP]
38
+
> Click on the SigLIP2 models in the right sidebar for more examples of how to apply SigLIP2 to different image and text tasks.
74
39
75
-
The pipeline allows to use the model in a few lines of code:
40
+
The example below demonstrates zero-shot classification with [`Pipeline`] or the [`AutoModel`] class.
- Demo notebook for SigLIP2 can be found [here](https://github.com/qubvel/transformers-notebooks/tree/master/notebooks/SigLIP2_inference.ipynb). 🌎
106
+
with torch.no_grad():
107
+
outputs = model(**inputs)
186
108
187
-
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
109
+
logits_per_image = outputs.logits_per_image
110
+
probs = torch.sigmoid(logits_per_image)
111
+
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
112
+
```
188
113
114
+
</hfoption>
115
+
</hfoptions>
189
116
190
-
## Combining SigLIP2 and Flash Attention 2
117
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
191
118
192
-
First, make sure to install the latest version of Flash Attention 2.
119
+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
193
120
194
-
```bash
195
-
pip install -U flash-attn --no-build-isolation
196
-
```
121
+
```py
122
+
import torch
123
+
import requests
124
+
fromPILimport Image
125
+
from transformers import AutoProcessor, AutoModel, BitsAndBytesConfig
197
126
198
-
Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``)
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
232
-
>>>print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
233
-
19.8% that image 0is'2 cats'
144
+
logits_per_image = outputs.logits_per_image
145
+
probs = torch.sigmoid(logits_per_image)
146
+
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
234
147
```
235
148
149
+
## Notes
150
+
151
+
- Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
152
+
- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
153
+
- Model was trained with *lowercased* text, so make sure your text labels are preprocessed the same way.
154
+
- To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
155
+
- The NaFlex variant processes different types of images at the appropriate resolution (using a larger resolution to process document images for example), while also minimizing the impact of aspect ratio distortion for certain inference tasks like OCR.
156
+
157
+
NaFlex resizes the input image so the height and width are multiples of the patch size after resizing. It keeps the aspect ratio distortion as low as possible and produces a sequence length of at most the desired target sequence length (`max_num_patches`). After resizing, the image is split into a sequence of patches and a mask with padding information is added.
158
+
- Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention.
0 commit comments