Skip to content

Fix bugs and implement DDP #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 13, 2022
Merged

Fix bugs and implement DDP #81

merged 14 commits into from
Jun 13, 2022

Conversation

runame
Copy link
Contributor

@runame runame commented Jun 13, 2022

Changes implemented in this PR:

  • Implement DDP for MNIST + ImageNet
  • Fix MNIST and ImageNet bugs
  • Add data_utils.py file which collects helper functions related to the input pipelines

There will be a follow-up PR which covers:

  • Add DDP to WMT worklod
  • Potentially speed-ups for ImageNet and WMT workload
  • Maybe improve interface, to enable user to choose number of GPUs to use (right now it simply uses all available GPUs)

@runame runame requested review from znado and fsschneider June 13, 2022 10:06
@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
# Note (runame): Cannot be done in graph mode, i.e. during ds.map().
# Alternative?
# crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@znado This doesn't work in graph mode (see my comment), what do you think is the best alternative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "stateless_split iterating over tf.Tensor is not allowed in Graph execution" (according to my google search history lol). It's especially weird because here, stateless_fold_in() works just fine and is also called inside of the ds.map() call, i.e. is also executed in graph mode. And it is calling the same underlying function. So the issue might be related to the shape argument in stateless_random_uniform(), but I haven't looked into it further.

crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
# Note (runame): Cannot be done in graph mode, i.e. during ds.map().
# Alternative?
# crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform

if split == 'eval_train':
split = 'train'
split = 'train[:50000]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. should we instead in the caller function, pass in split='train[:{num_eval_train_examples}]'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense, already changed it.

yield {'inputs': images, 'targets': labels}
except StopIteration:
iterator = iter(iterable)
PYTORCH_DDP = 'LOCAL_RANK' in os.environ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we rename this USE_PYTORCH_DDP ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have any opinion on this, maybe @fsschneider has?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_PYTORCH_DDP sounds good to me

Comment on lines +121 to +125
if isinstance(batch, dict):
inputs = batch['inputs']
targets = batch['targets']
else:
inputs, targets = batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we got rid of the DictMNIST class, can we just always assume these are tuples (so we can delete the if/else)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom cycle function returns a dict, but is only used for training here, because for testing itertools.cycle is used, which catches the whole dataset in memory, which shouldn't happen during training even if the size of the dataset is no issue because it wil not re-shuffle. But I don't think it is less efficient to just use my custom cycle function also for the test set, so we can get rid of the if-statement (and it is consistent with e.g. the ImageNet workload).

@znado znado merged commit 1b5505f into mlcommons:main Jun 13, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Jun 13, 2022
@@ -142,7 +142,7 @@ profile=google
# pylint configuration
[pylint.MASTER]
persistent=no # Pickle collected data for later comparisons.
cache-size=500 # Set the cache size for astng objects.
#cache-size=500 # Set the cache size for astng objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did Pylint update and some options became deprecated, or what is happening here?
In general, we could switch to the pylintrc from the Google Style Guide and replace our options, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these options are not supported anymore in pylint>=2.14.0. Sounds reasonable to switch to the pylintrc from the Google Style Guide.

if FLAGS.framework == 'pytorch':
# From the docs: "(...) causes cuDNN to benchmark multiple convolution
# algorithms and select the fastest."
torch.backends.cudnn.benchmark = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with benchmark = True is that it is less deterministic. Would be interesting to see whether it makes any difference in terms of speed.
Similarily, we might want to set torch.use_deterministic_algorithms, see https://pytorch.org/docs/stable/notes/randomness.html.
Perhaps @mikerabbat can guide us here?

yield {'inputs': images, 'targets': labels}
except StopIteration:
iterator = iter(iterable)
PYTORCH_DDP = 'LOCAL_RANK' in os.environ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_PYTORCH_DDP sounds good to me

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants