-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[RFC] Default to infinite epochs, not 1000 #10343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If we were to change this, we would have the opposite problem: the user forgot to set We have to set a default, so this is why a fixed number of epochs was chosen. The specific number of a 1000 is arbitrary and has been kept for backward compatibility. To improve this, one thing we could do is to print a message if no value is passed. |
If the user is worried about their energy/AWS bill, they should explicitly set a termination condition appropriate to their workload. I would guess the majority of users aren't somehow forgetting about their training jobs and racking up huge bills. On AWS, the instance isn't gonna shut down when the job ends, so this doesn't even really help (unless they're running it in a container on EKS or something - which is advanced enough that I would expect they would consider setting a proper termination point). And on a personal computer, it's rather hard to forget that you have a training job running. But if we must keep it, yes printing a message on launch would be much better. And ideally also printing a message when the training ends, saying "run stopped because max_epochs (set to 1000) was reached". As is, I have no idea why my training ended abruptly. |
I don't have any strong preferences or arguments for or against the change. Asking the user to set a limit themselves vs. asking them to set I definitely agree that we should improve our reporting for when the training loop stops. There can be a few reasons actually. This was definitely discussed before and I think we are all in favor of that (not sure where it was discussed, hard to find) @puhuk This is in a discussion phase so we should wait and give more people the chance to leave their comments here. |
@awaelchli, got it ! |
cc @ananthsub as I remember you commented once about the default epoch number |
I don't think this is quite right - if someone wants to limit the duration of training, the default of 1000 epochs is almost never going to be right/helpful, so they're going to have to set a different limit. So the current state of affairs means that everyone has to set a value for max_epochs. Whereas if the default were -1, then people that are ok with stopping the experiment manually won't have to set any value for this parameter. So I feel that a default of -1 is less effort overall. |
This issue highlights some great points for improvement:
We should address this in the trainer documentation on the site.
What's the right means of capturing hyperparameters that have custom post-processing inside of the trainer?
My suggestion for the default was to make it 1 epoch so if execution unexpectedly terminates, this happens much sooner than waiting for 1000 epochs. I personally feel that something that stops by default is safer than infinite training. Otherwise users need to figure out how to kill the jobs, which can be painful especially if they haven't configured early stopping and/or are doing multi-process jobs. |
I agree with this.
|
agreed. One could also consider 1 step instead of 1 epoch, since 1 epoch is not guaranteed to be "small" or even finite in size (depends all on the dataloader). |
Hey, I wish to contribute to this. Any steps on how I can get started? |
Hi @Rajathbharadwaj, you can work on 2 separate PRs. These would be
|
@Rajathbharadwaj I have assigned the issue to you, go for it! Feel free to ask questions |
Awesome thanks! @kaushikb11 @carmocca Also, should this go to |
just an info msg is enough. and should be a userwarning. You can update the docs for it here as a note: |
@rohitgr7 @kaushikb11 @carmocca could you please let me know if any changes are required? Maybe I'm guessing I should remove the print statement? 😄 |
#10444 would've addressed any ambiguity around defaults for fitting. It could've been an error to not call |
@ananthsub I'm sorry, but I'm not sure I understand what you mean. |
Currently max_epochs defaults to 1000:
As a user, though, I would expect that if I don't specify a specific ending point, the training would continue indefinitely. In my own experiments, when the training cut off at 999 epochs, I was confused, and googling the issue didn't readily turn up this line in the documentation. When I checked my logs of all the hyperparams, max_epochs was set to None (I guess this override is applied internally). So as a user I feel like this is bad UX - I can't see a reason to put an arbitrary cutoff versus defaulting to infinite training.
It's especially frustrating when you've invested significant time into a training run, only to have it prematurely cut off due to this unexpected max_epochs limit.
cc @Borda
The text was updated successfully, but these errors were encountered: