[RFC] Default to infinite epochs, not 1000 #10343

zplizzi · 2021-11-03T18:13:06Z

Currently max_epochs defaults to 1000:

If both max_epochs and max_steps aren't specified, max_epochs will default to 1000. To enable infinite training, set max_epochs = -1.

As a user, though, I would expect that if I don't specify a specific ending point, the training would continue indefinitely. In my own experiments, when the training cut off at 999 epochs, I was confused, and googling the issue didn't readily turn up this line in the documentation. When I checked my logs of all the hyperparams, max_epochs was set to None (I guess this override is applied internally). So as a user I feel like this is bad UX - I can't see a reason to put an arbitrary cutoff versus defaulting to infinite training.

It's especially frustrating when you've invested significant time into a training run, only to have it prematurely cut off due to this unexpected max_epochs limit.

cc @Borda

carmocca · 2021-11-03T19:28:55Z

If we were to change this, we would have the opposite problem: the user forgot to set max_epochs and suddenly their energy (or AWS) bill is way too high because training has been running non-stop.

We have to set a default, so this is why a fixed number of epochs was chosen. The specific number of a 1000 is arbitrary and has been kept for backward compatibility.

To improve this, one thing we could do is to print a message if no value is passed.

zplizzi · 2021-11-03T20:11:25Z

If the user is worried about their energy/AWS bill, they should explicitly set a termination condition appropriate to their workload. I would guess the majority of users aren't somehow forgetting about their training jobs and racking up huge bills. On AWS, the instance isn't gonna shut down when the job ends, so this doesn't even really help (unless they're running it in a container on EKS or something - which is advanced enough that I would expect they would consider setting a proper termination point). And on a personal computer, it's rather hard to forget that you have a training job running.

But if we must keep it, yes printing a message on launch would be much better. And ideally also printing a message when the training ends, saying "run stopped because max_epochs (set to 1000) was reached". As is, I have no idea why my training ended abruptly.

puhuk · 2021-11-04T05:17:19Z

@zplizzi , @carmocca If you do not mind for me to tackle it, I want to take this. :)

awaelchli · 2021-11-04T12:37:41Z

I don't have any strong preferences or arguments for or against the change. Asking the user to set a limit themselves vs. asking them to set max_epochs=-1 for infinite training is about the same effort.

I definitely agree that we should improve our reporting for when the training loop stops. There can be a few reasons actually. This was definitely discussed before and I think we are all in favor of that (not sure where it was discussed, hard to find)

@puhuk This is in a discussion phase so we should wait and give more people the chance to leave their comments here.

puhuk · 2021-11-04T12:41:17Z

@awaelchli, got it !

carmocca · 2021-11-04T16:15:33Z

cc @ananthsub as I remember you commented once about the default epoch number

zplizzi · 2021-11-04T17:24:04Z

Asking the user to set a limit themselves vs. asking them to set max_epochs=-1 for infinite training is about the same effort.

I don't think this is quite right - if someone wants to limit the duration of training, the default of 1000 epochs is almost never going to be right/helpful, so they're going to have to set a different limit. So the current state of affairs means that everyone has to set a value for max_epochs. Whereas if the default were -1, then people that are ok with stopping the experiment manually won't have to set any value for this parameter. So I feel that a default of -1 is less effort overall.

ananthsub · 2021-11-04T18:59:28Z

This issue highlights some great points for improvement:

How to make the documentation clearer to avoid these issues:

I was confused, and googling the issue didn't readily turn up this line in the documentation

We should address this in the trainer documentation on the site.

The UX for hyperparameters when the trainer defaults to behavior not expressed in the default values of the constructor arguments

When I checked my logs of all the hyperparams, max_epochs was set to None (I guess this override is applied internally)

What's the right means of capturing hyperparameters that have custom post-processing inside of the trainer?

The default stopping condition for training

I can't see a reason to put an arbitrary cutoff versus defaulting to infinite training.
cc @ananthsub as I remember you commented once about the default epoch number

My suggestion for the default was to make it 1 epoch so if execution unexpectedly terminates, this happens much sooner than waiting for 1000 epochs. I personally feel that something that stops by default is safer than infinite training. Otherwise users need to figure out how to kill the jobs, which can be painful especially if they haven't configured early stopping and/or are doing multi-process jobs.

carmocca · 2021-11-09T01:02:12Z

I personally feel that something that stops by default is safer than infinite training. Otherwise users need to figure out how to kill the jobs, which can be painful especially if they haven't configured early stopping and/or are doing multi-process jobs

I agree with this.
If @awaelchli agrees, the takeaways of this are:

Show a warning about max_epochs not being set.
Improve docs about our default behavior.
In a separate issue, discuss the 1000 epochs -> 1 epoch change

awaelchli · 2021-11-09T22:47:06Z

agreed. One could also consider 1 step instead of 1 epoch, since 1 epoch is not guaranteed to be "small" or even finite in size (depends all on the dataloader).

Rajathbharadwaj · 2021-11-19T18:17:20Z

Hey, I wish to contribute to this. Any steps on how I can get started?

carmocca · 2021-11-22T14:31:12Z

Hi @Rajathbharadwaj, you can work on 2 separate PRs. These would be

Show an info message when Trainer(max_epochs) is not passed and we default to 1000.
Look for improvements in our docs about the default behavior for max_epochs. @rohitgr7 and/or @kaushikb11 can help you with where's the best place for it.

kaushikb11 · 2021-11-22T15:44:51Z

@Rajathbharadwaj I have assigned the issue to you, go for it! Feel free to ask questions

Rajathbharadwaj · 2021-11-23T07:01:41Z

Awesome thanks! @kaushikb11 @carmocca
Also, shall I also add functionality to pause until the user confirms? Maybe like an input from the user 'y' to confirm. Or just show an info message when max_epochs is not passed.

Also, should this go to userwarning or resourcewarning as the warning type

rohitgr7 · 2021-11-23T10:53:21Z

@Rajathbharadwaj

Also, shall I also add functionality to pause until the user confirms? Maybe like an input from the user 'y' to confirm. Or just show an info message when max_epochs is not passed.

just an info msg is enough. and should be a userwarning. You can update the docs for it here as a note:

Rajathbharadwaj · 2021-11-23T12:43:23Z

@rohitgr7 @kaushikb11 @carmocca could you please let me know if any changes are required? Maybe I'm guessing I should remove the print statement? 😄

ananthsub · 2021-11-23T23:18:11Z

#10444 would've addressed any ambiguity around defaults for fitting. It could've been an error to not call fit without specifying a stopping criterion, which the Trainer cannot do today.

Rajathbharadwaj · 2021-11-24T08:42:27Z

@ananthsub I'm sorry, but I'm not sure I understand what you mean.

zplizzi added the feature Is an improvement or enhancement label Nov 3, 2021

carmocca added this to the v1.6 milestone Nov 3, 2021

carmocca added docs Documentation related good first issue Good for newcomers and removed feature Is an improvement or enhancement labels Nov 4, 2021

kaushikb11 changed the title ~~Default to infinite epochs, not 1000~~ [RFC] Default to infinite epochs, not 1000 Nov 9, 2021

kaushikb11 added design Includes a design discussion discussion In a discussion stage and removed design Includes a design discussion labels Nov 9, 2021

kaushikb11 assigned Rajathbharadwaj Nov 22, 2021

Rajathbharadwaj mentioned this issue Nov 23, 2021

added UserWarnings if max_epochs not set in the Trainer class #10700

Merged

12 tasks

carmocca removed this from the 1.6 milestone Feb 1, 2022

carmocca added this to the future milestone Feb 1, 2022

carmocca added help wanted Open to be worked on let's do it! approved to implement trainer: argument and removed good first issue Good for newcomers docs Documentation related discussion In a discussion stage labels Feb 1, 2022

carmocca unassigned Rajathbharadwaj Feb 1, 2022

Borda added this to Lightning RFCs Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Default to infinite epochs, not 1000 #10343

[RFC] Default to infinite epochs, not 1000 #10343

zplizzi commented Nov 3, 2021 •

edited by github-actions bot

Loading

carmocca commented Nov 3, 2021

zplizzi commented Nov 3, 2021

puhuk commented Nov 4, 2021

awaelchli commented Nov 4, 2021

puhuk commented Nov 4, 2021

carmocca commented Nov 4, 2021

zplizzi commented Nov 4, 2021

ananthsub commented Nov 4, 2021 •

edited

Loading

carmocca commented Nov 9, 2021

awaelchli commented Nov 9, 2021 •

edited

Loading

Rajathbharadwaj commented Nov 19, 2021

carmocca commented Nov 22, 2021

kaushikb11 commented Nov 22, 2021

Rajathbharadwaj commented Nov 23, 2021 •

edited

Loading

rohitgr7 commented Nov 23, 2021

Rajathbharadwaj commented Nov 23, 2021 •

edited

Loading

ananthsub commented Nov 23, 2021

Rajathbharadwaj commented Nov 24, 2021

[RFC] Default to infinite epochs, not 1000 #10343

[RFC] Default to infinite epochs, not 1000 #10343

Comments

zplizzi commented Nov 3, 2021 • edited by github-actions bot Loading

carmocca commented Nov 3, 2021

zplizzi commented Nov 3, 2021

puhuk commented Nov 4, 2021

awaelchli commented Nov 4, 2021

puhuk commented Nov 4, 2021

carmocca commented Nov 4, 2021

zplizzi commented Nov 4, 2021

ananthsub commented Nov 4, 2021 • edited Loading

carmocca commented Nov 9, 2021

awaelchli commented Nov 9, 2021 • edited Loading

Rajathbharadwaj commented Nov 19, 2021

carmocca commented Nov 22, 2021

kaushikb11 commented Nov 22, 2021

Rajathbharadwaj commented Nov 23, 2021 • edited Loading

rohitgr7 commented Nov 23, 2021

Rajathbharadwaj commented Nov 23, 2021 • edited Loading

ananthsub commented Nov 23, 2021

Rajathbharadwaj commented Nov 24, 2021

zplizzi commented Nov 3, 2021 •

edited by github-actions bot

Loading

ananthsub commented Nov 4, 2021 •

edited

Loading

awaelchli commented Nov 9, 2021 •

edited

Loading

Rajathbharadwaj commented Nov 23, 2021 •

edited

Loading

Rajathbharadwaj commented Nov 23, 2021 •

edited

Loading