Skip to content

Commit cc92d6b

Browse files
NicolasHugfacebook-github-bot
authored andcommitted
[fbsync] Introduce resize params, fix lr estimation, update docs. (#6444)
Reviewed By: datumbox Differential Revision: D39013673 fbshipit-source-id: b2f48dc16310311a2555fde77fae5db8edba8ec3
1 parent 3c365cd commit cc92d6b

File tree

3 files changed

+107
-67
lines changed

3 files changed

+107
-67
lines changed

references/video_classification/README.md

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ We assume the training and validation AVI videos are stored at `/data/kinectics4
1818

1919
Run the training on a single node with 8 GPUs:
2020
```bash
21-
torchrun --nproc_per_node=8 train.py --data-path=/data/kinectics400 --kinetics-version="400" --batch-size=16 --cache-dataset --sync-bn --amp
21+
torchrun --nproc_per_node=8 train.py --data-path=/data/kinectics400 --kinetics-version="400" --lr 0.08 --cache-dataset --sync-bn --amp
2222
```
2323

2424
**Note:** all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs. Expected training time for 64 GPUs is 24 hours, depending on the storage solution.
25-
**Note 2:** hyperparameters for exact replication of our training can be found [here](https://github.com/pytorch/vision/blob/main/torchvision/models/video/README.md). Some hyperparameters such as learning rate are scaled linearly in proportion to the number of GPUs.
25+
**Note 2:** hyperparameters for exact replication of our training can be found on the section below. Some hyperparameters such as learning rate must be scaled linearly in proportion to the number of GPUs. The default values assume 64 GPUs.
2626

2727
### Single GPU
2828

@@ -40,3 +40,70 @@ Since the original release, additional versions of Kinetics dataset became avail
4040
Our training scripts support these versions of dataset as well by setting the `--kinetics-version` parameter to `"600"`.
4141

4242
**Note:** training on Kinetics 600 requires a different set of hyperparameters for optimal performance. We do not provide Kinetics 600 pretrained models.
43+
44+
45+
## Video classification models
46+
47+
Starting with version `0.4.0` we have introduced support for basic video tasks and video classification modelling.
48+
For more information about the available models check [here](https://pytorch.org/docs/stable/torchvision/models.html#video-classification).
49+
50+
### Video ResNet models
51+
52+
See reference training script [here](https://github.com/pytorch/vision/blob/main/references/video_classification/train.py):
53+
54+
- input space: RGB
55+
- resize size: [128, 171]
56+
- crop size: [112, 112]
57+
- mean: [0.43216, 0.394666, 0.37645]
58+
- std: [0.22803, 0.22145, 0.216989]
59+
- number of classes: 400
60+
61+
Input data augmentations at training time (with optional parameters):
62+
63+
1. ConvertImageDtype
64+
2. Resize (resize size value above)
65+
3. Random horizontal flip (0.5)
66+
4. Normalization (mean, std, see values above)
67+
5. Random Crop (crop size value above)
68+
6. Convert BCHW to CBHW
69+
70+
Input data augmentations at validation time (with optional parameters):
71+
72+
1. ConvertImageDtype
73+
2. Resize (resize size value above)
74+
3. Normalization (mean, std, see values above)
75+
4. Center Crop (crop size value above)
76+
5. Convert BCHW to CBHW
77+
78+
This translates in the following set of command-line arguments. Please note that `--batch-size` parameter controls the
79+
batch size per GPU. Moreover note that our default `--lr` is configured for 64 GPUs which is how many we used for the
80+
Video resnet models:
81+
```
82+
# number of frames per clip
83+
--clip_len 16 \
84+
# allow for temporal jittering
85+
--clips_per_video 5 \
86+
--batch-size 24 \
87+
--epochs 45 \
88+
--lr 0.64 \
89+
# we use 10 epochs for linear warmup
90+
--lr-warmup-epochs 10 \
91+
# learning rate is decayed at 20, 30, and 40 epoch by a factor of 10
92+
--lr-milestones 20, 30, 40 \
93+
--lr-gamma 0.1 \
94+
--train-resize-size 128 171 \
95+
--train-crop-size 112 112 \
96+
--val-resize-size 128 171 \
97+
--val-crop-size 112 112
98+
```
99+
100+
### Additional video modelling resources
101+
102+
- [Video Model Zoo](https://github.com/facebookresearch/VMZ)
103+
- [PySlowFast](https://github.com/facebookresearch/SlowFast)
104+
105+
### References
106+
107+
[0] _D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri_: A Closer Look at Spatiotemporal Convolutions for Action Recognition. _CVPR 2018_ ([paper](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf))
108+
109+
[1] _W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman_: The Kinetics Human Action Video Dataset ([paper](https://arxiv.org/abs/1705.06950))

references/video_classification/train.py

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -149,13 +149,18 @@ def main(args):
149149

150150
# Data loading code
151151
print("Loading data")
152+
val_resize_size = tuple(args.val_resize_size)
153+
val_crop_size = tuple(args.val_crop_size)
154+
train_resize_size = tuple(args.train_resize_size)
155+
train_crop_size = tuple(args.train_crop_size)
156+
152157
traindir = os.path.join(args.data_path, "train")
153158
valdir = os.path.join(args.data_path, "val")
154159

155160
print("Loading training data")
156161
st = time.time()
157162
cache_path = _get_cache_path(traindir, args)
158-
transform_train = presets.VideoClassificationPresetTrain(crop_size=(112, 112), resize_size=(128, 171))
163+
transform_train = presets.VideoClassificationPresetTrain(crop_size=train_crop_size, resize_size=train_resize_size)
159164

160165
if args.cache_dataset and os.path.exists(cache_path):
161166
print(f"Loading dataset_train from {cache_path}")
@@ -192,7 +197,7 @@ def main(args):
192197
weights = torchvision.models.get_weight(args.weights)
193198
transform_test = weights.transforms()
194199
else:
195-
transform_test = presets.VideoClassificationPresetEval(crop_size=(112, 112), resize_size=(128, 171))
200+
transform_test = presets.VideoClassificationPresetEval(crop_size=val_crop_size, resize_size=val_resize_size)
196201

197202
if args.cache_dataset and os.path.exists(cache_path):
198203
print(f"Loading dataset_test from {cache_path}")
@@ -253,8 +258,7 @@ def main(args):
253258

254259
criterion = nn.CrossEntropyLoss()
255260

256-
lr = args.lr * args.world_size
257-
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=args.weight_decay)
261+
optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay)
258262
scaler = torch.cuda.amp.GradScaler() if args.amp else None
259263

260264
# convert scheduler to be per iteration, not per epoch, for warmup that lasts
@@ -354,7 +358,7 @@ def get_args_parser(add_help=True):
354358
parser.add_argument(
355359
"-j", "--workers", default=10, type=int, metavar="N", help="number of data loading workers (default: 10)"
356360
)
357-
parser.add_argument("--lr", default=0.01, type=float, help="initial learning rate")
361+
parser.add_argument("--lr", default=0.64, type=float, help="initial learning rate")
358362
parser.add_argument("--momentum", default=0.9, type=float, metavar="M", help="momentum")
359363
parser.add_argument(
360364
"--wd",
@@ -400,6 +404,35 @@ def get_args_parser(add_help=True):
400404
parser.add_argument("--world-size", default=1, type=int, help="number of distributed processes")
401405
parser.add_argument("--dist-url", default="env://", type=str, help="url used to set up distributed training")
402406

407+
parser.add_argument(
408+
"--val-resize-size",
409+
default=(128, 171),
410+
nargs="+",
411+
type=int,
412+
help="the resize size used for validation (default: (128, 171))",
413+
)
414+
parser.add_argument(
415+
"--val-crop-size",
416+
default=(112, 112),
417+
nargs="+",
418+
type=int,
419+
help="the central crop size used for validation (default: (112, 112))",
420+
)
421+
parser.add_argument(
422+
"--train-resize-size",
423+
default=(128, 171),
424+
nargs="+",
425+
type=int,
426+
help="the resize size used for training (default: (128, 171))",
427+
)
428+
parser.add_argument(
429+
"--train-crop-size",
430+
default=(112, 112),
431+
nargs="+",
432+
type=int,
433+
help="the random crop size used for training (default: (112, 112))",
434+
)
435+
403436
parser.add_argument("--weights", default=None, type=str, help="the weights enum name to load")
404437

405438
# Mixed precision training parameters

torchvision/models/video/README.md

Lines changed: 0 additions & 60 deletions
This file was deleted.

0 commit comments

Comments
 (0)