Skip to content

Provide a Default Parameter for Fit's Checkpoint Restore Path #10573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blisc opened this issue Nov 16, 2021 · 14 comments · Fixed by #11334
Closed

Provide a Default Parameter for Fit's Checkpoint Restore Path #10573

blisc opened this issue Nov 16, 2021 · 14 comments · Fixed by #11334
Labels
checkpointing Related to checkpointing discussion In a discussion stage trainer: argument

Comments

@blisc
Copy link

blisc commented Nov 16, 2021

Proposed refactor

Do 1 of 2 options:

  1. Do not deprecate resume_from_checkpoint from Trainer's initialization as recommended in [checkpoint] Resolve 2 different checkpoint loading paths across fit vs validate/test/predict #9405
  2. Provide a public facing property that acts as a default in the case when ckpt_path is passed as None to trainer.fit. This property should be accessible after trainer initialization

Motivation

Our current design pattern is to initialization the Trainer object, and then inject our own logic into the Trainer class, and finally calling trainer.fit(model) without adding any additional arguments to fit(). Moving this argument from trainer's init to fit() will require us to rewrite all of our training scripts.

Pitch

See Proposed refactor

@blisc blisc added the refactor label Nov 16, 2021
@ananthsub
Copy link
Contributor

ananthsub commented Nov 16, 2021

Hi @blisc I originally proposed the deprecation in #9405 .

These are my opinions:

  1. The Lightning framework should offer a single convention that works across the different trainer functions. Offering resume_from_checkpoint for fit in the trainer constructor vs ckpt_path as arguments to validate / test / predict was a broken experience.

That means either:

  • Make ckpt_path an argument to each of the trainer functions
  • Or add fit_ckpt_path, validate_ckpt_path, test_ckpt_path and predict_ckpt_path in the trainer constructor.
    While the latter makes things more configurable, why add them in the constructor vs the function the arguments are prefixed with? Where do you draw the line?

then inject our own logic into the Trainer class

Could you describe what this does and why it blocks calling trainer.fit(model, ckpt_path="...") ?

@ananthsub ananthsub added the checkpointing Related to checkpointing label Nov 16, 2021
@carmocca carmocca added discussion In a discussion stage trainer: argument and removed refactor labels Nov 16, 2021
@blisc
Copy link
Author

blisc commented Nov 16, 2021

Could you describe what this does and why it blocks calling trainer.fit(model, ckpt_path="...")

Consider an average lightning training script:

MyModel = LightningModules(**model_cfg)
trainer = Trainer(**trainer_cfg)
# Additional logic here to add logging and checkpoint that is common to all scripts
myfunction(trainer)  # I want some ability to set a default checkpoint for fit
trainer.fit(model)  # Note: We define our dataloaders in the model itself

Since the training script is so similar, we just copy it for every model. Requiring a change in fit() causes us to change the script for every model. This gets unfeasible as the number of models grows.
I want some ability to change trainer parameters inside of myfunction() so we can just change the single function as opposed to all the train_{model}.py files.

@ananthsub
Copy link
Contributor

@blisc - it seems like the main issue is then around myfunction and adding defaults that it might not make sense for the Lightning trainer to support.

#9741 is the issue we filed to allow users the ability to register implementations, which the lightning trainer can then look up & add automatically for callbacks/loggers/profiler/plugins. Would that address your concern?

However, In your example, I don't see why you couldn't look up the checkpoint path for fit from the trainer_cfg and pass it to trainer.fit(model, ckpt_path)

@blisc
Copy link
Author

blisc commented Nov 16, 2021

#9741 is the issue we filed to allow users the ability to register implementations, which the lightning trainer can then look up & add automatically for callbacks/loggers/profiler/plugins. Would that address your concern?

I'm not asking for a custom plugin. I'm asking for lightning to continue supporting what it has in the past

However, In your example, I don't see why you couldn't look up the checkpoint path for fit from the trainer_cfg and pass it to trainer.fit(model, ckpt_path)

I'm trying to avoid changing anything in train_{model}.py files.


Your main goal was to unify the lightning experience and you have already,

All I am asking for is a parameter that is part of lightning trainer that can be used as the default if the relevant arguments are not passed to fit, validate, test, predict. You already have it for validate, test, predict. Can you just add fit as well, then everything will be unified!

@ananthsub
Copy link
Contributor

All I am asking for is a parameter that is part of lightning trainer that can be used as the default if the relevant arguments are not passed to fit, validate, test, predict. You already have it for validate, test, predict. Can you just add fit as well, then everything will be unified!

Could you share a code pointer for what you're looking for just so we're clear on the request?

@ananthsub
Copy link
Contributor

ananthsub commented Nov 16, 2021

Discussed with @blisc - the request is to have a self.fit_ckpt_path defined as an attribute on the Trainer, similar to the ones here: https://github.com/PyTorchLightning/pytorch-lightning/blob/247f5aacc2fc96ab1df14d7176838dc65757a786/pytorch_lightning/trainer/trainer.py#L474-L476

Is this supported behavior?

trainer.validated_ckpt_path = "path"
trainer.validate(model, dataloaders)

vs

trainer.validate(model, dataloaders, ckpt_path="path")

this seems more like a quirk of attributes on the Trainer that were inadvertently made public instead of private. @awaelchli @carmocca other checkpointing API experts, do you know? It's certainly not documented anywhere, and offering only the 2nd option feels more natural to support.

@SeanNaren
Copy link
Contributor

trainer.validated_ckpt_path = "path"
trainer.validate(model, dataloaders)

Yikes! Imo that's a strange compromise. If we're now ok advertising two different ways to set the variable for our users, I'd rather us just introduce Trainer(fit_ckpt_path=...).

@awaelchli
Copy link
Contributor

this seems more like a quirk of attributes on the Trainer that were inadvertently made public instead of private. @awaelchli @carmocca other checkpointing API experts, do you know?

These attributes were basically there since the beginning. Their purpose originally was to serve as read-only to access programmatically the path that was loaded, i.e., when setting "best" as the checkpoint path. And it also served as a way to test the Trainer (no longer done this way).

trainer.validated_ckpt_path = "path"
trainer.validate(model, dataloaders)

This is not supposed to work, it is an unintended behavior and one should not rely on it, not today.

@blisc
Copy link
Author

blisc commented Nov 17, 2021

Ok, it seems like you are leaning towards removing those attributes. Then can I suggest the following compromise:

  • Have a single argument in Trainer's initialization
    • Whether this is still resume_from_checkpoint or a new argument is up to you. For the purpose of this discussion, I'll just call it default_ckpt_path
    • It will support all possible values of ckpt_path as passed to fit, validate, etc. That is it will support either a path to a checkpoint or "best"
    • This will just set some trainer property: trainer.default_ckpt_path = default_ckpt_path
  • On calls to fit, validate, etc.
    • Run something like: if ckpt_path is None; chpt_path = trainer.default_ckpt_path

There will only be two ways to set ckpt_path:

  1. Passing it as ckpt_path to the relevant function
  2. Passing a default_ckpt_path and using this property

This does not block any further refactor/unification that @ananthsub still has left to do.

@tchaton
Copy link
Contributor

tchaton commented Nov 22, 2021

Dear @blisc,

rewrite all of our training scripts.

Would you mind sharing how much of an effort it would be there?

Additionally, here is a proposed vision for the Lightning Trainer API to become more modular, and here is Ananth's response about it: #10444 (comment).

@stale
Copy link

stale bot commented Dec 27, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 27, 2021
@blisc
Copy link
Author

blisc commented Jan 4, 2022

Happy new year, everyone!
Looking at the discussion from #10444, it seems like there was a decision to leave the functionality inside the Lightning trainer constructor. And from our discussion here, I am hearing that people are mostly ok with adding fit_ckpt_path to the constructor that does what resume_from_checkpoint used to do.

Can we add fit_ckpt_path to the Lightning constructor?

@stale stale bot removed the won't fix This will not be worked on label Jan 4, 2022
@carmocca
Copy link
Contributor

carmocca commented Jan 5, 2022

Happy new year to you too!

The best way forward at the moment is to extend the removal of resume_from_checkpoint until the 2.0 release (with no set date). This gives users extra time to update their code and gives us time to decide what's the best option in the long term.

Similarly to the decision in #10410 (comment)

@awaelchli
Copy link
Contributor

The best way forward at the moment is to extend the removal of resume_from_checkpoint until the 2.0 release

That would be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing discussion In a discussion stage trainer: argument
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants