Skip to content

Commit 1aebd71

Browse files
lyttonhaofacebook-github-bot
authored andcommitted
Release MViTv2 model & README
Summary: Add MViTv2 models and configs Reviewed By: feichtenhofer Differential Revision: D36884896 fbshipit-source-id: 6e185bf095d424fc9b9eb018cb55654980945982
1 parent 6069ea9 commit 1aebd71

23 files changed

+800
-168
lines changed

MODEL_ZOO.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@
1111
| Slow | R50 | 3 x 10 | 8 x 8 | 74.8 | 91.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWONLY_8x8_R50.pkl) | Kinetics/c2/SLOW_8x8_R50 | K400 |
1212
| SlowFast | R50 | 3 x 10 | 4 x 16 | 75.6 | 92.0 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_4x16_R50.pkl) | Kinetics/c2/SLOWFAST_4x16_R50 | K400 |
1313
| SlowFast | R50 | 3 x 10 | 8 x 8 | 77.0 | 92.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | Kinetics/c2/SLOWFAST_8x8_R50 | K400 |
14-
| MViT | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.5 | [`link`](https://drive.google.com/file/d/194gJinVejq6A1FmySNKQ8vAN5-FOY-QL/view?usp=sharing) | Kinetics/MVIT_B_16x4_CONV | K400 |
15-
| MViT | B-Conv | 1 x 5 | 32 x 3 | 80.4 | 94.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k400.pyth) | Kinetics/MVIT_B_32x3_CONV | K600 |
16-
| MViT | B-Conv | 1 x 5 | 32 x 3 | 83.9 | 96.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k600.pyth) | Kinetics/MVIT_B_32x3_CONV_K600 | K600 |
14+
| MViTv1 | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.5 | [`link`](https://drive.google.com/file/d/194gJinVejq6A1FmySNKQ8vAN5-FOY-QL/view?usp=sharing) | Kinetics/MVIT_B_16x4_CONV | K400 |
15+
| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 80.4 | 94.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k400.pyth) | Kinetics/MVIT_B_32x3_CONV | K400 |
16+
| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 83.9 | 96.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k600.pyth) | Kinetics/MVIT_B_32x3_CONV_K600 | K600 |
17+
| MViTv2 | S | 1 x 5 | 16 x 4 | 81.0 | 94.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 | K400 |
18+
| MViTv2 | B | 1 x 5 | 32 x 3 | 82.9 | 95.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 | K400 |
1719

1820
## X3D models (details in projects/x3d)
1921

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,10 @@ The goal of PySlowFast is to provide a high-performance, light-weight pytorch co
2222
- I3D
2323
- Non-local Network
2424
- X3D
25+
- MViTv1 and MViTv2
2526

2627
## Updates
28+
- We now support [MViTv2](https://arxiv.org/abs/2104.11227.pdf) in PySlowFast. See [`projects/mvitv2`](./projects/mvitv2/README.md) for more information.
2729
- We now support [Multiscale Vision Transformers](https://arxiv.org/abs/2104.11227.pdf) on Kinetics and ImageNet. See [`projects/mvit`](./projects/mvit/README.md) for more information.
2830
- We now support [PyTorchVideo](https://github.com/facebookresearch/pytorchvideo) models and datasets. See [`projects/pytorchvideo`](./projects/pytorchvideo/README.md) for more information.
2931
- We now support [X3D Models](https://arxiv.org/abs/2004.04730). See [`projects/x3d`](./projects/x3d/README.md) for more information.

configs/Kinetics/MVITv2_B_32x3.yaml

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
TRAIN:
2+
ENABLE: True
3+
DATASET: kinetics
4+
BATCH_SIZE: 16
5+
EVAL_PERIOD: 10
6+
CHECKPOINT_PERIOD: 10
7+
AUTO_RESUME: True
8+
DATA:
9+
USE_OFFSET_SAMPLING: True
10+
DECODING_BACKEND: torchvision
11+
NUM_FRAMES: 32
12+
SAMPLING_RATE: 3
13+
TRAIN_JITTER_SCALES: [256, 320]
14+
TRAIN_CROP_SIZE: 224
15+
TEST_CROP_SIZE: 224
16+
INPUT_CHANNEL_NUM: [3]
17+
# PATH_TO_DATA_DIR: path-to-k400-dir
18+
TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
19+
TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
20+
MVIT:
21+
ZERO_DECAY_POS_CLS: False
22+
USE_ABS_POS: False
23+
REL_POS_SPATIAL: True
24+
REL_POS_TEMPORAL: True
25+
DEPTH: 24
26+
NUM_HEADS: 1
27+
EMBED_DIM: 96
28+
PATCH_KERNEL: (3, 7, 7)
29+
PATCH_STRIDE: (2, 4, 4)
30+
PATCH_PADDING: (1, 3, 3)
31+
MLP_RATIO: 4.0
32+
QKV_BIAS: True
33+
DROPPATH_RATE: 0.3
34+
NORM: "layernorm"
35+
MODE: "conv"
36+
CLS_EMBED_ON: True
37+
DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]
38+
HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]
39+
POOL_KVQ_KERNEL: [3, 3, 3]
40+
POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
41+
POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1]]
42+
DROPOUT_RATE: 0.0
43+
DIM_MUL_IN_ATT: True
44+
RESIDUAL_POOLING: True
45+
AUG:
46+
NUM_SAMPLE: 2
47+
ENABLE: True
48+
COLOR_JITTER: 0.4
49+
AA_TYPE: rand-m7-n4-mstd0.5-inc1
50+
INTERPOLATION: bicubic
51+
RE_PROB: 0.25
52+
RE_MODE: pixel
53+
RE_COUNT: 1
54+
RE_SPLIT: False
55+
MIXUP:
56+
ENABLE: True
57+
ALPHA: 0.8
58+
CUTMIX_ALPHA: 1.0
59+
PROB: 1.0
60+
SWITCH_PROB: 0.5
61+
LABEL_SMOOTH_VALUE: 0.1
62+
SOLVER:
63+
ZERO_WD_1D_PARAM: True
64+
BASE_LR_SCALE_NUM_SHARDS: True
65+
CLIP_GRAD_L2NORM: 1.0
66+
BASE_LR: 0.0001
67+
COSINE_AFTER_WARMUP: True
68+
COSINE_END_LR: 1e-6
69+
WARMUP_START_LR: 1e-6
70+
WARMUP_EPOCHS: 30.0
71+
LR_POLICY: cosine
72+
MAX_EPOCH: 200
73+
MOMENTUM: 0.9
74+
WEIGHT_DECAY: 0.05
75+
OPTIMIZING_METHOD: adamw
76+
COSINE_AFTER_WARMUP: True
77+
MODEL:
78+
NUM_CLASSES: 400
79+
ARCH: mvit
80+
MODEL_NAME: MViT
81+
LOSS_FUNC: soft_cross_entropy
82+
DROPOUT_RATE: 0.5
83+
TEST:
84+
ENABLE: True
85+
DATASET: kinetics
86+
BATCH_SIZE: 64
87+
NUM_SPATIAL_CROPS: 1
88+
NUM_ENSEMBLE_VIEWS: 5
89+
DATA_LOADER:
90+
NUM_WORKERS: 8
91+
PIN_MEMORY: True
92+
NUM_GPUS: 8
93+
NUM_SHARDS: 1
94+
RNG_SEED: 0
95+
OUTPUT_DIR: .
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
TRAIN:
2+
ENABLE: False
3+
DATA:
4+
USE_OFFSET_SAMPLING: True
5+
DECODING_BACKEND: torchvision
6+
NUM_FRAMES: 40
7+
SAMPLING_RATE: 3
8+
TRAIN_JITTER_SCALES: [356, 446]
9+
TRAIN_CROP_SIZE: 312
10+
TEST_CROP_SIZE: 312
11+
INPUT_CHANNEL_NUM: [3]
12+
# PATH_TO_DATA_DIR: path-to-k400-dir
13+
TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
14+
TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
15+
MEAN: [0.485, 0.456, 0.406]
16+
STD: [0.229, 0.224, 0.225]
17+
MVIT:
18+
ZERO_DECAY_POS_CLS: False
19+
USE_ABS_POS: False
20+
REL_POS_SPATIAL: True
21+
REL_POS_TEMPORAL: True
22+
DEPTH: 48
23+
NUM_HEADS: 2
24+
EMBED_DIM: 144
25+
PATCH_KERNEL: (3, 7, 7)
26+
PATCH_STRIDE: (2, 4, 4)
27+
PATCH_PADDING: (1, 3, 3)
28+
MLP_RATIO: 4.0
29+
QKV_BIAS: True
30+
DROPPATH_RATE: 0.75
31+
NORM: "layernorm"
32+
MODE: "conv"
33+
CLS_EMBED_ON: True
34+
DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]
35+
HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]
36+
POOL_KVQ_KERNEL: [3, 3, 3]
37+
POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
38+
POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],
39+
[11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],
40+
[21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],
41+
[31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],
42+
[41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 2, 2], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1] ]
43+
DROPOUT_RATE: 0.0
44+
DIM_MUL_IN_ATT: True
45+
RESIDUAL_POOLING: True
46+
AUG:
47+
# NUM_SAMPLE: 2
48+
ENABLE: True
49+
COLOR_JITTER: 0.4
50+
AA_TYPE: rand-m7-n4-mstd0.5-inc1
51+
INTERPOLATION: bicubic
52+
RE_PROB: 0.25
53+
RE_MODE: pixel
54+
RE_COUNT: 1
55+
RE_SPLIT: False
56+
MIXUP:
57+
ENABLE: True
58+
ALPHA: 0.8
59+
CUTMIX_ALPHA: 1.0
60+
PROB: 0.0
61+
SWITCH_PROB: 0.5
62+
LABEL_SMOOTH_VALUE: 0.1
63+
MODEL:
64+
NUM_CLASSES: 400
65+
ARCH: mvit
66+
MODEL_NAME: MViT
67+
LOSS_FUNC: soft_cross_entropy
68+
DROPOUT_RATE: 0.5
69+
ACT_CHECKPOINT: True
70+
TEST:
71+
ENABLE: True
72+
DATASET: kinetics
73+
BATCH_SIZE: 8
74+
NUM_SPATIAL_CROPS: 3
75+
NUM_ENSEMBLE_VIEWS: 5
76+
# CHECKPOINT_FILE_PATH: # download pre-trained model
77+
DATA_LOADER:
78+
NUM_WORKERS: 8
79+
PIN_MEMORY: True
80+
NUM_GPUS: 8
81+
NUM_SHARDS: 1
82+
RNG_SEED: 0
83+
OUTPUT_DIR: .

configs/Kinetics/MVITv2_S_16x4.yaml

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
TRAIN:
2+
ENABLE: True
3+
DATASET: kinetics
4+
BATCH_SIZE: 16
5+
EVAL_PERIOD: 10
6+
CHECKPOINT_PERIOD: 10
7+
AUTO_RESUME: True
8+
DATA:
9+
USE_OFFSET_SAMPLING: True
10+
DECODING_BACKEND: torchvision
11+
NUM_FRAMES: 16
12+
SAMPLING_RATE: 4
13+
TRAIN_JITTER_SCALES: [256, 320]
14+
TRAIN_CROP_SIZE: 224
15+
TEST_CROP_SIZE: 224
16+
INPUT_CHANNEL_NUM: [3]
17+
# PATH_TO_DATA_DIR: path-to-k400-dir
18+
TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
19+
TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
20+
MVIT:
21+
ZERO_DECAY_POS_CLS: False
22+
USE_ABS_POS: False
23+
REL_POS_SPATIAL: True
24+
REL_POS_TEMPORAL: True
25+
DEPTH: 16
26+
NUM_HEADS: 1
27+
EMBED_DIM: 96
28+
PATCH_KERNEL: (3, 7, 7)
29+
PATCH_STRIDE: (2, 4, 4)
30+
PATCH_PADDING: (1, 3, 3)
31+
MLP_RATIO: 4.0
32+
QKV_BIAS: True
33+
DROPPATH_RATE: 0.2
34+
NORM: "layernorm"
35+
MODE: "conv"
36+
CLS_EMBED_ON: True
37+
DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
38+
HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
39+
POOL_KVQ_KERNEL: [3, 3, 3]
40+
POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
41+
POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]
42+
DROPOUT_RATE: 0.0
43+
DIM_MUL_IN_ATT: True
44+
RESIDUAL_POOLING: True
45+
AUG:
46+
NUM_SAMPLE: 2
47+
ENABLE: True
48+
COLOR_JITTER: 0.4
49+
AA_TYPE: rand-m7-n4-mstd0.5-inc1
50+
INTERPOLATION: bicubic
51+
RE_PROB: 0.25
52+
RE_MODE: pixel
53+
RE_COUNT: 1
54+
RE_SPLIT: False
55+
MIXUP:
56+
ENABLE: True
57+
ALPHA: 0.8
58+
CUTMIX_ALPHA: 1.0
59+
PROB: 1.0
60+
SWITCH_PROB: 0.5
61+
LABEL_SMOOTH_VALUE: 0.1
62+
SOLVER:
63+
ZERO_WD_1D_PARAM: True
64+
BASE_LR_SCALE_NUM_SHARDS: True
65+
CLIP_GRAD_L2NORM: 1.0
66+
BASE_LR: 0.0001
67+
COSINE_AFTER_WARMUP: True
68+
COSINE_END_LR: 1e-6
69+
WARMUP_START_LR: 1e-6
70+
WARMUP_EPOCHS: 30.0
71+
LR_POLICY: cosine
72+
MAX_EPOCH: 200
73+
MOMENTUM: 0.9
74+
WEIGHT_DECAY: 0.05
75+
OPTIMIZING_METHOD: adamw
76+
COSINE_AFTER_WARMUP: True
77+
MODEL:
78+
NUM_CLASSES: 400
79+
ARCH: mvit
80+
MODEL_NAME: MViT
81+
LOSS_FUNC: soft_cross_entropy
82+
DROPOUT_RATE: 0.5
83+
TEST:
84+
ENABLE: True
85+
DATASET: kinetics
86+
BATCH_SIZE: 64
87+
NUM_SPATIAL_CROPS: 1
88+
NUM_ENSEMBLE_VIEWS: 5
89+
DATA_LOADER:
90+
NUM_WORKERS: 8
91+
PIN_MEMORY: True
92+
NUM_GPUS: 8
93+
NUM_SHARDS: 1
94+
RNG_SEED: 0
95+
OUTPUT_DIR: .

projects/mvitv2/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)
2+
3+
Official PyTorch implementation of **MViTv2**, from the following paper:
4+
5+
[MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526). CVPR 2022.\
6+
Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*
7+
8+
---
9+
10+
MViT is a multiscale transformer which serves as a general vision backbone for different visual recognition tasks. PySlowFast supports MViTv2 for video action recognition and detection tasks. For other tasks, please check:
11+
12+
> **Image Classification**: See [MViTv2 for image classification](https://github.com/facebookresearch/mvit).
13+
14+
> **Object Detection and Instance Segmentation**: See [MViTv2 in Detectron2](https://github.com/facebookresearch/detectron2/tree/main/projects/MViTv2).
15+
16+
<div align="center">
17+
<img src="mvitv2.png" width="500px" />
18+
</div>
19+
<br/>
20+
21+
## Results
22+
23+
### Kinetics-400
24+
25+
26+
| name | frame length x sample rate | top1 | top5 | Flops (G) x views | #params (M) | model | config |
27+
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
28+
| MViTv2-S | 16 x 4 | 81.0 | 94.6 | 64 x 1 x 5 | 34.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 |
29+
| MViTv2-B | 32 x 3 | 82.9 | 95.7 | 225 x 1 x 5 | 51.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 |
30+
| MViTv2-L | 40 x 3 | 86.1 | 97.0 | 2828 x 3 x 5 | 217.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_L_40x3_k400_f306903192.pyth) | Kinetics/MVITv2_L_40x3_test |
31+
32+
## Get started
33+
34+
Here we can train a standard MViTv2 model from scratch by:
35+
36+
```
37+
python tools/run_net.py \
38+
--cfg configs/Kinetics/MVIT-B.yaml \
39+
DATA.PATH_TO_DATA_DIR path_to_your_dataset \
40+
```
41+
42+
43+
## Citing MViTv2
44+
If you find this repository helpful, please consider citing:
45+
```
46+
@inproceedings{li2021improved,
47+
title={MViTv2: Improved multiscale vision transformers for classification and detection},
48+
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
49+
booktitle={CVPR},
50+
year={2022}
51+
}
52+
53+
@inproceedings{fan2021multiscale,
54+
title={Multiscale vision transformers},
55+
author={Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph},
56+
booktitle={ICCV},
57+
year={2021}
58+
}
59+
```

projects/mvitv2/mvitv2.png

516 KB
Loading

0 commit comments

Comments
 (0)