Fix interaction with `save_last` and `every_n_epochs` #12391

carmocca · 2022-03-21T12:07:45Z

@carmocca for the following snippet, a last.ckpt is generated before this PR, but not anymore after:

import uuid

import torch
from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


if __name__ == "__main__":
    tmpdir = f"/tmp/{uuid.uuid4()}"
    print(tmpdir)

    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        callbacks=[ModelCheckpoint(dirpath=tmpdir, every_n_epochs=10, save_last=True)],
        enable_checkpointing=True,
    )
    model = BoringModel()
    trainer.fit(
        model, train_dataloaders=DataLoader(RandomDataset(32, 64), batch_size=2)
    )

The following contract was respected prior to this PR, but not anymore after:

save_last: When True, always saves the model at the end of the epoch to a file last.ckpt

It is a BC-breaking change that changes a behavior some users rely on, regardless of whether it is believed to be a "bug".

Originally posted by @yifuwang in #11805 (comment)

cc @tchaton @rohitgr7 @akihironitta @carmocca @awaelchli @ninginthecloud @jjenniferdai

The text was updated successfully, but these errors were encountered:

carmocca self-assigned this Mar 21, 2022

carmocca added this to the 1.6.x milestone Mar 21, 2022

carmocca added callback: model checkpoint bug Something isn't working deprecation Includes a deprecation and removed deprecation Includes a deprecation labels Mar 21, 2022

carmocca added this to Frameworks Planning Mar 21, 2022

carmocca moved this to Todo in Frameworks Planning Mar 21, 2022

carmocca added the priority: 0 High priority task label Mar 21, 2022

carmocca modified the milestones: 1.6.x, 1.6 Mar 22, 2022

carmocca mentioned this issue Mar 23, 2022

ModelCheckpoint's save_last now ignores every_n_epochs #12418

Merged

12 tasks

carmocca closed this as completed in #12418 Mar 24, 2022

Repository owner moved this from Todo to Done in Frameworks Planning Mar 24, 2022

carmocca mentioned this issue Mar 30, 2022

[RFC] Create a ModelCheckpointBase callback #6504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix interaction with `save_last` and `every_n_epochs` #12391

Fix interaction with `save_last` and `every_n_epochs` #12391

carmocca commented Mar 21, 2022 •

edited by github-actions bot

Loading

Fix interaction with save_last and every_n_epochs #12391

Fix interaction with save_last and every_n_epochs #12391

Comments

carmocca commented Mar 21, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix interaction with `save_last` and `every_n_epochs` #12391

Fix interaction with `save_last` and `every_n_epochs` #12391

carmocca commented Mar 21, 2022 •

edited by github-actions bot

Loading