Skip to content

Commit 9a85105

Browse files
mreraserstevhliu
andauthored
Updated the model card for ViTMAE (#38302)
* Update vit_mae.md * badge float:right * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/vit_mae.md Co-authored-by: Steven Liu <[email protected]> * Update model_doc/vit_mae.md * fix --------- Co-authored-by: Steven Liu <[email protected]>
1 parent c9fcbd5 commit 9a85105

File tree

1 file changed

+36
-60
lines changed

1 file changed

+36
-60
lines changed

docs/source/en/model_doc/vit_mae.md

Lines changed: 36 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -14,87 +14,63 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# ViTMAE
1817

19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
22-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
23-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
22+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
23+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
24+
</div>
2425
</div>
2526

26-
## Overview
27-
28-
The ViTMAE model was proposed in [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v2) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li,
29-
Piotr Dollár, Ross Girshick. The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after
30-
fine-tuning that outperform supervised pre-training.
31-
32-
The abstract from the paper is the following:
27+
# ViTMAE
3328

34-
*This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the
35-
input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates
36-
only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask
37-
tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs
38-
enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity
39-
models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream
40-
tasks outperforms supervised pre-training and shows promising scaling behavior.*
29+
[ViTMAE](https://huggingface.co/papers/2111.06377) is a self-supervised vision model that is pretrained by masking large portions of an image (~75%). An encoder processes the visible image patches and a decoder reconstructs the missing pixels from the encoded patches and mask tokens. After pretraining, the encoder can be reused for downstream tasks like image classification or object detection — often outperforming models trained with supervised learning.
4130

4231
<img src="https://user-images.githubusercontent.com/11435359/146857310-f258c86c-fde6-48e8-9cee-badd2b21bd2c.png"
4332
alt="drawing" width="600"/>
4433

45-
<small> MAE architecture. Taken from the <a href="https://arxiv.org/abs/2111.06377">original paper.</a> </small>
34+
You can find all the original ViTMAE checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=vit-mae) organization.
4635

47-
This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlow version of the model was contributed by [sayakpaul](https://github.com/sayakpaul) and
48-
[ariG23498](https://github.com/ariG23498) (equal contribution). The original code can be found [here](https://github.com/facebookresearch/mae).
36+
> [!TIP]
37+
> Click on the ViTMAE models in the right sidebar for more examples of how to apply ViTMAE to vision tasks.
4938
50-
## Usage tips
39+
The example below demonstrates how to reconstruct the missing pixels with the [`ViTMAEForPreTraining`] class.
5140

52-
- MAE (masked auto encoding) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training objective is relatively simple:
53-
by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values. One can use [`ViTMAEForPreTraining`] for this purpose.
54-
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
55-
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
56-
- One can use [`ViTImageProcessor`] to prepare images for the model. See the code examples for more info.
57-
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
58-
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
59-
sin/cos position embeddings are added both to the input of the encoder and the decoder.
60-
- For a visual understanding of how MAEs work you can check out this [post](https://keras.io/examples/vision/masked_image_modeling/).
41+
<hfoptions id="usage">
42+
<hfoption id="AutoModel">
6143

62-
### Using Scaled Dot Product Attention (SDPA)
44+
```python
45+
import torch
46+
import requests
47+
from PIL import Image
48+
from transformers import ViTImageProcessor, ViTMAEForPreTraining
6349

64-
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
65-
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
66-
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
67-
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
68-
page for more information.
50+
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
51+
image = Image.open(requests.get(url, stream=True).raw)
6952

70-
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
71-
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
53+
processor = ViTImageProcessor.from_pretrained("facebook/vit-mae-base")
54+
inputs = processor(image, return_tensors="pt")
55+
inputs = {k: v.to("cuda") for k, v in inputs.items()}
7256

73-
```
74-
from transformers import ViTMAEModel
75-
model = ViTMAEModel.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa", torch_dtype=torch.float16)
76-
...
77-
```
57+
model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa").to("cuda")
58+
with torch.no_grad():
59+
outputs = model(**inputs)
7860

79-
For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
61+
reconstruction = outputs.logits
62+
```
8063

81-
On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `facebook/vit-mae-base` model, we saw the following speedups during inference.
64+
</hfoption>
65+
</hfoptions>
8266

83-
| Batch size | Average inference time (ms), eager mode | Average inference time (ms), sdpa model | Speed up, Sdpa / Eager (x) |
84-
|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
85-
| 1 | 11 | 6 | 1.83 |
86-
| 2 | 8 | 6 | 1.33 |
87-
| 4 | 8 | 6 | 1.33 |
88-
| 8 | 8 | 6 | 1.33 |
67+
## Notes
68+
- ViTMAE is typically used in two stages. Self-supervised pretraining with [`ViTMAEForPreTraining`], and then discarding the decoder and fine-tuning the encoder. After fine-tuning, the weights can be plugged into a model like [`ViTForImageClassification`].
69+
- Use [`ViTImageProcessor`] for input preparation.
8970

9071
## Resources
9172

92-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViTMAE.
93-
94-
- [`ViTMAEForPreTraining`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining), allowing you to pre-train the model from scratch/further pre-train the model on custom data.
95-
- A notebook that illustrates how to visualize reconstructed pixel values with [`ViTMAEForPreTraining`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb).
96-
97-
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
73+
- Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) to learn how to visualize the reconstructed pixels from [`ViTMAEForPreTraining`].
9874

9975
## ViTMAEConfig
10076

0 commit comments

Comments
 (0)