Fix bugs and implement DDP #81

runame · 2022-06-13T10:06:34Z

Changes implemented in this PR:

Implement DDP for MNIST + ImageNet
Fix MNIST and ImageNet bugs
Add data_utils.py file which collects helper functions related to the input pipelines

There will be a follow-up PR which covers:

Add DDP to WMT worklod
Potentially speed-ups for ImageNet and WMT workload
Maybe improve interface, to enable user to choose number of GPUs to use (right now it simply uses all available GPUs)

github-actions · 2022-06-13T10:06:52Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

runame · 2022-06-13T10:08:02Z

algorithmic_efficiency/workloads/imagenet/imagenet_jax/input_pipeline.py

-  crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
+  # Note (runame): Cannot be done in graph mode, i.e. during ds.map().
+  # Alternative?
+  # crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)


@znado This doesn't work in graph mode (see my comment), what do you think is the best alternative?

your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform

Something like "stateless_split iterating over tf.Tensor is not allowed in Graph execution" (according to my google search history lol). It's especially weird because here, stateless_fold_in() works just fine and is also called inside of the ds.map() call, i.e. is also executed in graph mode. And it is calling the same underlying function. So the issue might be related to the shape argument in stateless_random_uniform(), but I haven't looked into it further.

znado · 2022-06-13T17:30:54Z

algorithmic_efficiency/workloads/imagenet/imagenet_jax/input_pipeline.py

-  crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
+  # Note (runame): Cannot be done in graph mode, i.e. during ds.map().
+  # Alternative?
+  # crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)


your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform

znado · 2022-06-13T17:33:12Z

algorithmic_efficiency/workloads/imagenet/imagenet_jax/input_pipeline.py

  if split == 'eval_train':
-    split = 'train'
+    split = 'train[:50000]'


good catch. should we instead in the caller function, pass in split='train[:{num_eval_train_examples}]'?

Makes a lot of sense, already changed it.

znado · 2022-06-13T17:34:54Z

algorithmic_efficiency/workloads/imagenet/imagenet_pytorch/workload.py

-      yield {'inputs': images, 'targets': labels}
-    except StopIteration:
-      iterator = iter(iterable)
+PYTORCH_DDP = 'LOCAL_RANK' in os.environ


nit: should we rename this USE_PYTORCH_DDP ?

I don't really have any opinion on this, maybe @fsschneider has?

USE_PYTORCH_DDP sounds good to me

znado · 2022-06-13T17:39:00Z

algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py

+      if isinstance(batch, dict):
+        inputs = batch['inputs']
+        targets = batch['targets']
+      else:
+        inputs, targets = batch


given that we got rid of the DictMNIST class, can we just always assume these are tuples (so we can delete the if/else)?

The custom cycle function returns a dict, but is only used for training here, because for testing itertools.cycle is used, which catches the whole dataset in memory, which shouldn't happen during training even if the size of the dataset is no issue because it wil not re-shuffle. But I don't think it is less efficient to just use my custom cycle function also for the test set, so we can get rid of the if-statement (and it is consistent with e.g. the ImageNet workload).

fsschneider · 2022-06-13T12:56:22Z

setup.cfg

@@ -142,7 +142,7 @@ profile=google
 # pylint configuration
 [pylint.MASTER]
 persistent=no  # Pickle collected data for later comparisons.
-cache-size=500  # Set the cache size for astng objects.
+#cache-size=500  # Set the cache size for astng objects.


Did Pylint update and some options became deprecated, or what is happening here?
In general, we could switch to the pylintrc from the Google Style Guide and replace our options, what do you think?

Yes, these options are not supported anymore in pylint>=2.14.0. Sounds reasonable to switch to the pylintrc from the Google Style Guide.

fsschneider · 2022-06-13T13:18:32Z

submission_runner.py

+  if FLAGS.framework == 'pytorch':
+    # From the docs: "(...) causes cuDNN to benchmark multiple convolution
+    # algorithms and select the fastest."
+    torch.backends.cudnn.benchmark = True


I think the issue with benchmark = True is that it is less deterministic. Would be interesting to see whether it makes any difference in terms of speed.
Similarily, we might want to set torch.use_deterministic_algorithms, see https://pytorch.org/docs/stable/notes/randomness.html.
Perhaps @mikerabbat can guide us here?

fsschneider · 2022-06-14T07:41:28Z

algorithmic_efficiency/workloads/imagenet/imagenet_pytorch/workload.py

-      yield {'inputs': images, 'targets': labels}
-    except StopIteration:
-      iterator = iter(iterable)
+PYTORCH_DDP = 'LOCAL_RANK' in os.environ


USE_PYTORCH_DDP sounds good to me

runame added 14 commits June 7, 2022 18:18

Fix eval metric name

d2ebae5

Minor clean up

9fef79e

Fix submission_runner test

259e0b4

Fix MNIST workloads

9ed87e9

Add data_utils

07589d3

Implement DDP

22f10e7

Add DistributedEvalSampler

1e1da7d

Add DDP to ImageNet

0704595

Fix Jax ImageNet workload

801591c

Fix yapf/isort/pylint

a235b88

Missing isort fix

4f3dda1

Adjust setup for pylint>=2.14.0

889690b

Use generator instead of list comprehension

997dbd3

Add DDP to MNIST

c4a07ac

runame requested review from znado and fsschneider June 13, 2022 10:06

runame commented Jun 13, 2022

View reviewed changes

znado approved these changes Jun 13, 2022

View reviewed changes

znado merged commit 1b5505f into mlcommons:main Jun 13, 2022

github-actions bot locked and limited conversation to collaborators Jun 13, 2022

fsschneider reviewed Jun 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs and implement DDP #81

Fix bugs and implement DDP #81

runame commented Jun 13, 2022

github-actions bot commented Jun 13, 2022

runame Jun 13, 2022

znado Jun 13, 2022

runame Jun 13, 2022

znado Jun 13, 2022

znado Jun 13, 2022

runame Jun 13, 2022

znado Jun 13, 2022

runame Jun 13, 2022

fsschneider Jun 14, 2022

znado Jun 13, 2022

runame Jun 13, 2022

fsschneider Jun 13, 2022

runame Jun 14, 2022

fsschneider Jun 13, 2022

fsschneider Jun 14, 2022

Fix bugs and implement DDP #81

Fix bugs and implement DDP #81

Conversation

runame commented Jun 13, 2022

github-actions bot commented Jun 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment