-
Notifications
You must be signed in to change notification settings - Fork 20
clean up assignment of run_ids #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
b940e08
to
c27b39a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Does it work with submit-slurm.py
?
I have to say, I am still confused by the logic of restarting a run with the same run_id. What is the expectation? restart from zero or continue? I think the logging code will need to be fixed there. |
This is only concerned with assigning run_ids not weather to start from zero or continue. Where to restart is handled via the
I hope that clears the logic up, contact me if you have any questions/suggestions/... 😅 |
Thanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this new logic with the slurm submit script and it works. Waiting for the requested changes below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- run_id): run
train
,train_continue
orevaluate
with--run_id <RUNID>
flag => assign a run_id manually to this runThanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description.
The main usecase should be injecting run_ids generated eg. by a wrapper script, but lets discuss this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you make this function pure and return a new Config object? Performance is definitely not an issue here, and purity should always be preferred, it makes composing behavior easier to understand. You can start the code with
config = config.copy() ... return config
good point
c27b39a
to
04bd932
Compare
@clessig indeed option 2 is for the slurm wrapper. Without setting the run id at slurm level, it is hard to track the files being generated and/or the source code being copied. That script already assumes a unique run id and uses the same logic to generate a new run_id. What I would suggest: make clear in the documentation of the CLI that the run id should not be manually set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# use OmegaConf.unsafe_merge if too slow | ||
return OmegaConf.merge(base_config, private_config, *overwrite_configs) | ||
|
||
|
||
def set_run_id(config: Config, run_id: str | None, reuse_run_id: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: -> Config
return type
_logger.info(f"using generated run_id: {config.run_id}") | ||
else: | ||
config.run_id = run_id | ||
_logger.info(f"using assigned run_id: {config.run_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on conversation with @clessig , maybe make it more clear: f"using assigned run_id: {config.run_id}. If you manually selected this run_id, this is an error."
Yes, you are completely right--this is why this was always there. And agreed, let's make it clear in the documentation that this is not to be used manually. |
Determining the run id should follow the following logic: | ||
|
||
1. (default case): run train, train_continue or evaluate without any flags => generate a new run_id for this run. | ||
2. (assign run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "For train this should not be used manually.
When running experiments at the weekend, I realized that train_continue misses an option --new_run_id : bool. We can manually assign a new run_id but by default it should be auto-generated. Does this PR add this option. Do we have it for evaluate? @grassesi : could be please add it before merging if not already there. Thanks! |
@@ -106,7 +106,7 @@ def load_config( | |||
|
|||
Args: | |||
private_home: Configuration file containing platform dependent information and secretes | |||
run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate | |||
from_run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is run id and not model id. It's a bit of a convention but we do not use the plain weights but load all the config parameters, and potentially only overwrite selected ones. Thus, run id seems more appropriate (and this is also what the command line arg is called)
parser.add_argument( | ||
"--reuse_run_id", | ||
action="store_true", | ||
help="Use the id given via --from_run_id also for the current run. The storage location for artifacts will be reused as well. This might overwrite artifacts from previous runns.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo runns -> runs.
There should be no overwrite of artifacts. This happens (except for #280 and potentially other bugs) for *_latest.ckpt and *_latest.json. We should avoid this. It breaks reproducibility (I typically continue from the *_latest chkpt). Can we open a PR for this.
Description
This PR simplifies the logic for managing run_ids:
utils.config.set_run_id
instead oftrainer.Trainer.init
simplifying the call signatures forTrainer.run
,Trainer.evaluate
andTrainer.init
.--reuse_run_id
has been introduced to reenable continuing or evaluating a run without using a new run id.Type of Change
Issue Number
closes #261 #258
Code Compatibility
Code Performance and Testing
uv run train
and (if necessary)uv run evaluate
on a least one GPU node and it works$WEATHER_GENERATOR_PRIVATE
directoryDependencies
I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline
pytest-mock: pytest plugin that allows easy access to to
unittest.mock
functionality => quickly swapping out behavior to isolate test functions from the broader WeatherGenerator systemDocumentation
Additional Notes