facebookresearch
diff --git a/‎MODEL_ZOO.md
Lines changed: 5 additions & 3 deletions b/‎MODEL_ZOO.md
Lines changed: 5 additions & 3 deletions
diff --git a/‎README.md
Lines changed: 2 additions & 0 deletions b/‎README.md
Lines changed: 2 additions & 0 deletions
diff --git a/‎configs/Kinetics/MVITv2_B_32x3.yaml
Lines changed: 95 additions & 0 deletions b/‎configs/Kinetics/MVITv2_B_32x3.yaml
Lines changed: 95 additions & 0 deletions
diff --git a/‎configs/Kinetics/MVITv2_L_40x3_test.yaml
Lines changed: 83 additions & 0 deletions b/‎configs/Kinetics/MVITv2_L_40x3_test.yaml
Lines changed: 83 additions & 0 deletions
diff --git a/‎configs/Kinetics/MVITv2_S_16x4.yaml
Lines changed: 95 additions & 0 deletions b/‎configs/Kinetics/MVITv2_S_16x4.yaml
Lines changed: 95 additions & 0 deletions
diff --git a/‎projects/mvitv2/README.md
Lines changed: 59 additions & 0 deletions b/‎projects/mvitv2/README.md
Lines changed: 59 additions & 0 deletions
diff --git a/‎projects/mvitv2/mvitv2.png
516 KB b/‎projects/mvitv2/mvitv2.png
516 KB
@@ -11,9 +11,11 @@
 | Slow | R50 | 3 x 10 | 8 x 8 | 74.8 | 91.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWONLY_8x8_R50.pkl) | Kinetics/c2/SLOW_8x8_R50 | K400 |
 | SlowFast | R50 | 3 x 10 | 4 x 16 | 75.6 | 92.0 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_4x16_R50.pkl) | Kinetics/c2/SLOWFAST_4x16_R50 | K400 |
 | SlowFast | R50 | 3 x 10 | 8 x 8 | 77.0 | 92.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | Kinetics/c2/SLOWFAST_8x8_R50 | K400 |
-| MViT | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.5 | [`link`](https://drive.google.com/file/d/194gJinVejq6A1FmySNKQ8vAN5-FOY-QL/view?usp=sharing) | Kinetics/MVIT_B_16x4_CONV | K400 |
-| MViT | B-Conv | 1 x 5 | 32 x 3 | 80.4 | 94.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k400.pyth) | Kinetics/MVIT_B_32x3_CONV | K600 |
-| MViT | B-Conv | 1 x 5 | 32 x 3 | 83.9 | 96.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k600.pyth) | Kinetics/MVIT_B_32x3_CONV_K600 | K600 |
+| MViTv1 | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.5 | [`link`](https://drive.google.com/file/d/194gJinVejq6A1FmySNKQ8vAN5-FOY-QL/view?usp=sharing) | Kinetics/MVIT_B_16x4_CONV | K400 |
+| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 80.4 | 94.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k400.pyth) | Kinetics/MVIT_B_32x3_CONV | K400 |
+| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 83.9 | 96.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k600.pyth) | Kinetics/MVIT_B_32x3_CONV_K600 | K600 |
+| MViTv2 | S | 1 x 5 | 16 x 4 | 81.0 | 94.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 | K400 |
+| MViTv2 | B | 1 x 5 | 32 x 3 | 82.9 | 95.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 | K400 |
 
 ## X3D models (details in projects/x3d)
 
 
@@ -22,8 +22,10 @@ The goal of PySlowFast is to provide a high-performance, light-weight pytorch co
 - I3D
 - Non-local Network
 - X3D
+- MViTv1 and MViTv2
 
 ## Updates
+ - We now support [MViTv2](https://arxiv.org/abs/2104.11227.pdf) in PySlowFast. See [`projects/mvitv2`](./projects/mvitv2/README.md) for more information.
  - We now support [Multiscale Vision Transformers](https://arxiv.org/abs/2104.11227.pdf) on Kinetics and ImageNet. See [`projects/mvit`](./projects/mvit/README.md) for more information.
  - We now support [PyTorchVideo](https://github.com/facebookresearch/pytorchvideo) models and datasets. See [`projects/pytorchvideo`](./projects/pytorchvideo/README.md) for more information.
  - We now support [X3D Models](https://arxiv.org/abs/2004.04730). See [`projects/x3d`](./projects/x3d/README.md) for more information.
 
@@ -0,0 +1,95 @@
+TRAIN:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 16
+  EVAL_PERIOD: 10
+  CHECKPOINT_PERIOD: 10
+  AUTO_RESUME: True
+DATA:
+  USE_OFFSET_SAMPLING: True
+  DECODING_BACKEND: torchvision
+  NUM_FRAMES: 32
+  SAMPLING_RATE: 3
+  TRAIN_JITTER_SCALES: [256, 320]
+  TRAIN_CROP_SIZE: 224
+  TEST_CROP_SIZE: 224
+  INPUT_CHANNEL_NUM: [3]
+  # PATH_TO_DATA_DIR: path-to-k400-dir
+  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
+  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
+MVIT:
+  ZERO_DECAY_POS_CLS: False
+  USE_ABS_POS: False
+  REL_POS_SPATIAL: True
+  REL_POS_TEMPORAL: True
+  DEPTH: 24
+  NUM_HEADS: 1
+  EMBED_DIM: 96
+  PATCH_KERNEL: (3, 7, 7)
+  PATCH_STRIDE: (2, 4, 4)
+  PATCH_PADDING: (1, 3, 3)
+  MLP_RATIO: 4.0
+  QKV_BIAS: True
+  DROPPATH_RATE: 0.3
+  NORM: "layernorm"
+  MODE: "conv"
+  CLS_EMBED_ON: True
+  DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]
+  HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]
+  POOL_KVQ_KERNEL: [3, 3, 3]
+  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
+  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1]]
+  DROPOUT_RATE: 0.0
+  DIM_MUL_IN_ATT: True
+  RESIDUAL_POOLING: True
+AUG:
+  NUM_SAMPLE: 2
+  ENABLE: True
+  COLOR_JITTER: 0.4
+  AA_TYPE: rand-m7-n4-mstd0.5-inc1
+  INTERPOLATION: bicubic
+  RE_PROB: 0.25
+  RE_MODE: pixel
+  RE_COUNT: 1
+  RE_SPLIT: False
+MIXUP:
+  ENABLE: True
+  ALPHA: 0.8
+  CUTMIX_ALPHA: 1.0
+  PROB: 1.0
+  SWITCH_PROB: 0.5
+  LABEL_SMOOTH_VALUE: 0.1
+SOLVER:
+  ZERO_WD_1D_PARAM: True
+  BASE_LR_SCALE_NUM_SHARDS: True
+  CLIP_GRAD_L2NORM: 1.0
+  BASE_LR: 0.0001
+  COSINE_AFTER_WARMUP: True
+  COSINE_END_LR: 1e-6
+  WARMUP_START_LR: 1e-6
+  WARMUP_EPOCHS: 30.0
+  LR_POLICY: cosine
+  MAX_EPOCH: 200
+  MOMENTUM: 0.9
+  WEIGHT_DECAY: 0.05
+  OPTIMIZING_METHOD: adamw
+  COSINE_AFTER_WARMUP: True
+MODEL:
+  NUM_CLASSES: 400
+  ARCH: mvit
+  MODEL_NAME: MViT
+  LOSS_FUNC: soft_cross_entropy
+  DROPOUT_RATE: 0.5
+TEST:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 64
+  NUM_SPATIAL_CROPS: 1
+  NUM_ENSEMBLE_VIEWS: 5
+DATA_LOADER:
+  NUM_WORKERS: 8
+  PIN_MEMORY: True
+NUM_GPUS: 8
+NUM_SHARDS: 1
+RNG_SEED: 0
+OUTPUT_DIR: .
@@ -0,0 +1,83 @@
+TRAIN:
+  ENABLE: False
+DATA:
+  USE_OFFSET_SAMPLING: True
+  DECODING_BACKEND: torchvision
+  NUM_FRAMES: 40
+  SAMPLING_RATE: 3
+  TRAIN_JITTER_SCALES: [356, 446]
+  TRAIN_CROP_SIZE: 312
+  TEST_CROP_SIZE: 312
+  INPUT_CHANNEL_NUM: [3]
+  # PATH_TO_DATA_DIR: path-to-k400-dir
+  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
+  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
+  MEAN: [0.485, 0.456, 0.406]
+  STD: [0.229, 0.224, 0.225]
+MVIT:
+  ZERO_DECAY_POS_CLS: False
+  USE_ABS_POS: False
+  REL_POS_SPATIAL: True
+  REL_POS_TEMPORAL: True
+  DEPTH: 48
+  NUM_HEADS: 2
+  EMBED_DIM: 144
+  PATCH_KERNEL: (3, 7, 7)
+  PATCH_STRIDE: (2, 4, 4)
+  PATCH_PADDING: (1, 3, 3)
+  MLP_RATIO: 4.0
+  QKV_BIAS: True
+  DROPPATH_RATE: 0.75
+  NORM: "layernorm"
+  MODE: "conv"
+  CLS_EMBED_ON: True
+  DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]
+  HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]
+  POOL_KVQ_KERNEL: [3, 3, 3]
+  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
+  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],
+  [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],
+  [21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],
+  [31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],
+  [41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 2, 2], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1] ]
+  DROPOUT_RATE: 0.0
+  DIM_MUL_IN_ATT: True
+  RESIDUAL_POOLING: True
+AUG:
+  # NUM_SAMPLE: 2
+  ENABLE: True
+  COLOR_JITTER: 0.4
+  AA_TYPE: rand-m7-n4-mstd0.5-inc1
+  INTERPOLATION: bicubic
+  RE_PROB: 0.25
+  RE_MODE: pixel
+  RE_COUNT: 1
+  RE_SPLIT: False
+MIXUP:
+  ENABLE: True
+  ALPHA: 0.8
+  CUTMIX_ALPHA: 1.0
+  PROB: 0.0
+  SWITCH_PROB: 0.5
+  LABEL_SMOOTH_VALUE: 0.1
+MODEL:
+  NUM_CLASSES: 400
+  ARCH: mvit
+  MODEL_NAME: MViT
+  LOSS_FUNC: soft_cross_entropy
+  DROPOUT_RATE: 0.5
+  ACT_CHECKPOINT: True
+TEST:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 8
+  NUM_SPATIAL_CROPS: 3
+  NUM_ENSEMBLE_VIEWS: 5
+  # CHECKPOINT_FILE_PATH: # download pre-trained model
+DATA_LOADER:
+  NUM_WORKERS: 8
+  PIN_MEMORY: True
+NUM_GPUS: 8
+NUM_SHARDS: 1
+RNG_SEED: 0
+OUTPUT_DIR: .
@@ -0,0 +1,95 @@
+TRAIN:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 16
+  EVAL_PERIOD: 10
+  CHECKPOINT_PERIOD: 10
+  AUTO_RESUME: True
+DATA:
+  USE_OFFSET_SAMPLING: True
+  DECODING_BACKEND: torchvision
+  NUM_FRAMES: 16
+  SAMPLING_RATE: 4
+  TRAIN_JITTER_SCALES: [256, 320]
+  TRAIN_CROP_SIZE: 224
+  TEST_CROP_SIZE: 224
+  INPUT_CHANNEL_NUM: [3]
+  # PATH_TO_DATA_DIR: path-to-k400-dir
+  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
+  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
+MVIT:
+  ZERO_DECAY_POS_CLS: False
+  USE_ABS_POS: False
+  REL_POS_SPATIAL: True
+  REL_POS_TEMPORAL: True
+  DEPTH: 16
+  NUM_HEADS: 1
+  EMBED_DIM: 96
+  PATCH_KERNEL: (3, 7, 7)
+  PATCH_STRIDE: (2, 4, 4)
+  PATCH_PADDING: (1, 3, 3)
+  MLP_RATIO: 4.0
+  QKV_BIAS: True
+  DROPPATH_RATE: 0.2
+  NORM: "layernorm"
+  MODE: "conv"
+  CLS_EMBED_ON: True
+  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
+  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
+  POOL_KVQ_KERNEL: [3, 3, 3]
+  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
+  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]
+  DROPOUT_RATE: 0.0
+  DIM_MUL_IN_ATT: True
+  RESIDUAL_POOLING: True
+AUG:
+  NUM_SAMPLE: 2
+  ENABLE: True
+  COLOR_JITTER: 0.4
+  AA_TYPE: rand-m7-n4-mstd0.5-inc1
+  INTERPOLATION: bicubic
+  RE_PROB: 0.25
+  RE_MODE: pixel
+  RE_COUNT: 1
+  RE_SPLIT: False
+MIXUP:
+  ENABLE: True
+  ALPHA: 0.8
+  CUTMIX_ALPHA: 1.0
+  PROB: 1.0
+  SWITCH_PROB: 0.5
+  LABEL_SMOOTH_VALUE: 0.1
+SOLVER:
+  ZERO_WD_1D_PARAM: True
+  BASE_LR_SCALE_NUM_SHARDS: True
+  CLIP_GRAD_L2NORM: 1.0
+  BASE_LR: 0.0001
+  COSINE_AFTER_WARMUP: True
+  COSINE_END_LR: 1e-6
+  WARMUP_START_LR: 1e-6
+  WARMUP_EPOCHS: 30.0
+  LR_POLICY: cosine
+  MAX_EPOCH: 200
+  MOMENTUM: 0.9
+  WEIGHT_DECAY: 0.05
+  OPTIMIZING_METHOD: adamw
+  COSINE_AFTER_WARMUP: True
+MODEL:
+  NUM_CLASSES: 400
+  ARCH: mvit
+  MODEL_NAME: MViT
+  LOSS_FUNC: soft_cross_entropy
+  DROPOUT_RATE: 0.5
+TEST:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 64
+  NUM_SPATIAL_CROPS: 1
+  NUM_ENSEMBLE_VIEWS: 5
+DATA_LOADER:
+  NUM_WORKERS: 8
+  PIN_MEMORY: True
+NUM_GPUS: 8
+NUM_SHARDS: 1
+RNG_SEED: 0
+OUTPUT_DIR: .
@@ -0,0 +1,59 @@
+# [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)
+
+Official PyTorch implementation of **MViTv2**, from the following paper:
+
+[MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526). CVPR 2022.\
+Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*
+
+---
+
+MViT is a multiscale transformer which serves as a general vision backbone for different visual recognition tasks. PySlowFast supports MViTv2 for video action recognition and detection tasks. For other tasks, please check:
+
+> **Image Classification**: See [MViTv2 for image classification](https://github.com/facebookresearch/mvit).
+
+> **Object Detection and Instance Segmentation**: See [MViTv2 in Detectron2](https://github.com/facebookresearch/detectron2/tree/main/projects/MViTv2).
+
+<div align="center">
+  <img src="mvitv2.png" width="500px" />
+</div>
+<br/>
+
+## Results
+
+### Kinetics-400
+
+
+| name | frame length x sample rate | top1 |  top5  | Flops (G) x views | #params (M) |  model | config |
+| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
+| MViTv2-S | 16 x 4 | 81.0 | 94.6 | 64 x 1 x 5 | 34.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 |
+| MViTv2-B | 32 x 3 | 82.9 | 95.7 | 225 x 1 x 5 | 51.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 |
+| MViTv2-L | 40 x 3 | 86.1 | 97.0 | 2828 x 3 x 5 | 217.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_L_40x3_k400_f306903192.pyth) | Kinetics/MVITv2_L_40x3_test |
+
+## Get started
+
+Here we can train a standard MViTv2 model from scratch by:
+
+```
+python tools/run_net.py \
+  --cfg configs/Kinetics/MVIT-B.yaml \
+  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
+```
+
+
+## Citing MViTv2
+If you find this repository helpful, please consider citing:
+```
+@inproceedings{li2021improved,
+  title={MViTv2: Improved multiscale vision transformers for classification and detection},
+  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
+  booktitle={CVPR},
+  year={2022}
+}
+
+@inproceedings{fan2021multiscale,
+  title={Multiscale vision transformers},
+  author={Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph},
+  booktitle={ICCV},
+  year={2021}
+}
+```