-
Notifications
You must be signed in to change notification settings - Fork 72
Fix bugs and implement DDP #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2) | ||
# Note (runame): Cannot be done in graph mode, i.e. during ds.map(). | ||
# Alternative? | ||
# crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@znado This doesn't work in graph mode (see my comment), what do you think is the best alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like "stateless_split iterating over tf.Tensor
is not allowed in Graph execution" (according to my google search history lol). It's especially weird because here, stateless_fold_in()
works just fine and is also called inside of the ds.map()
call, i.e. is also executed in graph mode. And it is calling the same underlying function. So the issue might be related to the shape argument in stateless_random_uniform()
, but I haven't looked into it further.
crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2) | ||
# Note (runame): Cannot be done in graph mode, i.e. during ds.map(). | ||
# Alternative? | ||
# crop_rng, flip_rng = tf.random.experimental.stateless_split(rng, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your current change is fine for now, it's not good to reuse seeds but lets just add it to the list of issues. what was the error you got? it's really weird that this couldn't be run in graph mode (but I believe it lol), all this fn does is call random_uniform
if split == 'eval_train': | ||
split = 'train' | ||
split = 'train[:50000]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. should we instead in the caller function, pass in split='train[:{num_eval_train_examples}]'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense, already changed it.
yield {'inputs': images, 'targets': labels} | ||
except StopIteration: | ||
iterator = iter(iterable) | ||
PYTORCH_DDP = 'LOCAL_RANK' in os.environ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we rename this USE_PYTORCH_DDP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have any opinion on this, maybe @fsschneider has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE_PYTORCH_DDP
sounds good to me
if isinstance(batch, dict): | ||
inputs = batch['inputs'] | ||
targets = batch['targets'] | ||
else: | ||
inputs, targets = batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we got rid of the DictMNIST class, can we just always assume these are tuples (so we can delete the if/else)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom cycle
function returns a dict, but is only used for training here, because for testing itertools.cycle
is used, which catches the whole dataset in memory, which shouldn't happen during training even if the size of the dataset is no issue because it wil not re-shuffle. But I don't think it is less efficient to just use my custom cycle function also for the test set, so we can get rid of the if-statement (and it is consistent with e.g. the ImageNet workload).
@@ -142,7 +142,7 @@ profile=google | |||
# pylint configuration | |||
[pylint.MASTER] | |||
persistent=no # Pickle collected data for later comparisons. | |||
cache-size=500 # Set the cache size for astng objects. | |||
#cache-size=500 # Set the cache size for astng objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did Pylint update and some options became deprecated, or what is happening here?
In general, we could switch to the pylintrc from the Google Style Guide and replace our options, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these options are not supported anymore in pylint>=2.14.0. Sounds reasonable to switch to the pylintrc from the Google Style Guide.
if FLAGS.framework == 'pytorch': | ||
# From the docs: "(...) causes cuDNN to benchmark multiple convolution | ||
# algorithms and select the fastest." | ||
torch.backends.cudnn.benchmark = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue with benchmark = True
is that it is less deterministic. Would be interesting to see whether it makes any difference in terms of speed.
Similarily, we might want to set torch.use_deterministic_algorithms
, see https://pytorch.org/docs/stable/notes/randomness.html.
Perhaps @mikerabbat can guide us here?
yield {'inputs': images, 'targets': labels} | ||
except StopIteration: | ||
iterator = iter(iterable) | ||
PYTORCH_DDP = 'LOCAL_RANK' in os.environ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE_PYTORCH_DDP
sounds good to me
Changes implemented in this PR:
data_utils.py
file which collects helper functions related to the input pipelinesThere will be a follow-up PR which covers: