Add DDP to WMT + faster ImageNet data loading #85

runame · 2022-06-21T16:44:04Z

Follow-up PR to #81.

This PR

adds DDP to WMT PyTorch workload,
fixes bug in WMT Jax workload,
improves the data loading for ImageNet PyTorch workload,
adds instructions for running DDP to README.

One question specifically for @mikerabbat: I have implemented the distributed sampling for the TF dataset by simply manually sharding each global batch across devices, see this wrapper which assumes this function has been mapped on the dataset before. I think you mentioned at some point that there is an issue with this, but I'm not sure what it was.

And one question @znado might be able to help with: when running the DDP WMT PyTorch workload with 8xV100s instead of 4xV100s, I'm getting this error:

Check failed: ret == 0 (11 vs. 0)Thread tf_pjrt_thread_pool creation via pthread_create() failed.

I think this is caused by the fact that the TF input pipeline is created for each process separately, i.e. as many times as there are GPUs. The error seems to indicate that this exhausts the resources, see this explanation. I tried setting AUTOTUNE = None (here) and increasing the number of threads which are allowed per process (using torch.set_num_threads(N) with N up to 8; the default ist N=1), but both didn't help.

github-actions · 2022-06-21T16:44:18Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

znado · 2022-06-22T19:29:14Z

I would have hoped that setting AUTOTUNE = None would have fixed that out of threads issue. Beyond the number of threads allowed per process, do you know if the host machines you're running on have a low number of max threads allowed (I could see that being the case on some clusters to avoid accidental fork bombs)? I'm not sure the most relevant place to check, but I know there are some places you can check at the Linux/system level (you may already have, just checking!).

As for sharding the global batch across GPUs like you do, would it instead be more efficient to run each copy of the input pipeline with the per-GPU batch size? Then you don't need to slice the batch (like you do here), you would just need to add a toggleable flag to the input pipeline to not shard the batches. Maybe this would help avoid memory issues (although it looks like the resource issue you're having is with threading, not memory?) To make sure you get a unique batch on each process, you can (hopefully?) fold the process rank into the seed, for example you could add this here:

data_rng = jax.random.fold_in(data_rng, torch.distributed.get_rank())
np_iter = super().build_input_queue(data_rng, ...)

znado

just one question, otherwise this is fantastic, thanks so much for all these fixes!! it's awesome to get the pytorch pipelines sped up

algorithmic_efficiency/workloads/wmt/input_pipeline.py

algorithmic_efficiency/workloads/wmt/wmt_jax/workload.py

Conflicts: reference_submissions/imagenet_vit/imagenet_pytorch/submission.py

runame · 2022-06-28T13:39:48Z

I would have hoped that setting AUTOTUNE = None would have fixed that out of threads issue. Beyond the number of threads allowed per process, do you know if the host machines you're running on have a low number of max threads allowed (I could see that being the case on some clusters to avoid accidental fork bombs)? I'm not sure the most relevant place to check, but I know there are some places you can check at the Linux/system level (you may already have, just checking!).

As for sharding the global batch across GPUs like you do, would it instead be more efficient to run each copy of the input pipeline with the per-GPU batch size? Then you don't need to slice the batch (like you do here), you would just need to add a toggleable flag to the input pipeline to not shard the batches. Maybe this would help avoid memory issues (although it looks like the resource issue you're having is with threading, not memory?) To make sure you get a unique batch on each process, you can (hopefully?) fold the process rank into the seed, for example you could add this here:

I haven't looked much further into the threads issue, but will try again later. PyTorch limits the threads per process to 1 by default when using DDP (OMP_NUM_THREADS=1), but increasing this didn't help -- I'll double check that there is no limit on the hardware/system level. In any case, I think it's fine to merge this PR before this is fixed.

Regarding a more efficient sharding strategy: I also briefly thought about this, but unless I'm missing something there might be examples appearing multiple times in the same global batch with your suggested approach, since even though the local batches won't exactly be the same due to the different random seeds, all processes still have access to the full dataset. When only passing a subset of the dataset to each process, the issue is that the shuffling will be biased, since there cannot be > per_device_batch_size samples from one of the subsets in the same global batch. Hence why I chose the simple but inefficient approach.

znado · 2022-06-29T05:17:59Z

Yeah that's a good point regarding possibly repeated examples. A solution we've used to that is sharding the input files across processes, so you guarantee you have different examples per process (and then you don't need to mess with the RNGs).

runame added 12 commits June 13, 2022 22:51

Address Zack's comments

dd5dcc9

Change ImageNet batch size + fix minor diffs

0452f1f

Add fast_collate fn and PrefetchedWrapper for ImageNet PyTorch workload

955daf2

Fix README typo

d39aab8

Ignore bash scripts and output files

c2a1b6c

Rename PYTORCH_DDP to USE_PYTORCH_DDP

c89e1a4

Add DDP instructions to README

676149e

Add TFDistributedSampler

ba0daf2

Adjust WMT input pipeline for DDP

d1cac47

Fix WMT Jax workload

102aaaf

Add DDP to WMT

4c57f05

Ignore pylint error

b6b6af8

runame requested review from znado, mikerabbat and fsschneider June 21, 2022 16:44

znado approved these changes Jun 23, 2022

View reviewed changes

algorithmic_efficiency/workloads/wmt/input_pipeline.py Outdated Show resolved Hide resolved

algorithmic_efficiency/workloads/wmt/wmt_jax/workload.py Show resolved Hide resolved

runame added 2 commits June 23, 2022 14:20

Minor fixes

5b9ef21

Merge branch 'main' into pytorch-speedups

1317227

Conflicts: reference_submissions/imagenet_vit/imagenet_pytorch/submission.py

znado merged commit ffdfc08 into mlcommons:main Jun 29, 2022

github-actions bot locked and limited conversation to collaborators Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDP to WMT + faster ImageNet data loading #85

Add DDP to WMT + faster ImageNet data loading #85

runame commented Jun 21, 2022

github-actions bot commented Jun 21, 2022 •

edited

Loading

znado commented Jun 22, 2022

znado left a comment

runame commented Jun 28, 2022

znado commented Jun 29, 2022

Add DDP to WMT + faster ImageNet data loading #85

Add DDP to WMT + faster ImageNet data loading #85

Conversation

runame commented Jun 21, 2022

github-actions bot commented Jun 21, 2022 • edited Loading

znado commented Jun 22, 2022

znado left a comment

Choose a reason for hiding this comment

runame commented Jun 28, 2022

znado commented Jun 29, 2022

github-actions bot commented Jun 21, 2022 •

edited

Loading