Skip to content

Commit bcd7c87

Browse files
ananthsublexierule
authored andcommitted
Use fsspec in checkpoint connector for fault-tolerant training (#11776)
1 parent 6631bb8 commit bcd7c87

File tree

3 files changed

+11
-7
lines changed

3 files changed

+11
-7
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
1515
- Fixed `restore_optimizers` for mapping states ([#11757](https://github.com/PyTorchLightning/pytorch-lightning/pull/11757))
1616
- With `DPStrategy`, the batch is not explictly moved to the device ([#11780](https://github.com/PyTorchLightning/pytorch-lightning/pull/11780))
1717
- Fixed an issue to avoid val bar disappear after `trainer.validate()` ([#11700](https://github.com/PyTorchLightning/pytorch-lightning/pull/11700))
18+
- Fixed supporting remote filesystems with `Trainer.weights_save_path` for fault-tolerant training ([#11776](https://github.com/PyTorchLightning/pytorch-lightning/pull/11776))
19+
- Fixed check for available modules ([#11526](https://github.com/PyTorchLightning/pytorch-lightning/pull/11526))
1820

1921

2022
## [1.5.9] - 2022-01-18
@@ -25,7 +27,6 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
2527
- Skip testing with PyTorch 1.7 and Python 3.9 on Ubuntu ([#11217](https://github.com/PyTorchLightning/pytorch-lightning/pull/11217))
2628
- Fixed type promotion when tensors of higher category than float are logged ([#11401](https://github.com/PyTorchLightning/pytorch-lightning/pull/11401))
2729
- Fixed the format of the configuration saved automatically by the CLI's `SaveConfigCallback` ([#11532](https://github.com/PyTorchLightning/pytorch-lightning/pull/11532))
28-
- Fixed check for available modules ([#11526](https://github.com/PyTorchLightning/pytorch-lightning/pull/11526))
2930

3031
### Changed
3132

_notebooks

pytorch_lightning/trainer/connectors/checkpoint_connector.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,15 +50,18 @@ def __init__(self, trainer: "pl.Trainer", resume_from_checkpoint: Optional[_PATH
5050

5151
@property
5252
def hpc_resume_path(self) -> Optional[str]:
53-
if not os.path.isdir(self.trainer.weights_save_path):
53+
weights_save_path = self.trainer.weights_save_path
54+
fs = get_filesystem(weights_save_path)
55+
if not fs.isdir(weights_save_path):
5456
return None
55-
dir_path_hpc = str(self.trainer.weights_save_path)
57+
dir_path_hpc = str(weights_save_path)
5658
max_version = self.max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
5759
if max_version is not None:
5860
return os.path.join(dir_path_hpc, f"hpc_ckpt_{max_version}.ckpt")
59-
auto_save_checkpoint = os.path.join(dir_path_hpc, ".pl_auto_save.ckpt")
60-
if os.path.exists(auto_save_checkpoint):
61-
return auto_save_checkpoint
61+
62+
auto_saved_path = os.path.join(str(self.trainer.weights_save_path), ".pl_auto_save.ckpt")
63+
fs = get_filesystem(auto_saved_path)
64+
return auto_saved_path if fs.exists(auto_saved_path) else None
6265

6366
def resume_start(self, checkpoint_path: Optional[_PATH] = None) -> None:
6467
"""Attempts to pre-load the checkpoint file to memory, with the source path determined in this priority:

0 commit comments

Comments
 (0)