clean up assignment of run_ids #272

grassesi · 2025-05-27T11:49:57Z

Description

This PR simplifies the logic for managing run_ids:

Determining the run id is now handled by utils.config.set_run_id instead of trainer.Trainer.init simplifying the call signatures for Trainer.run, Trainer.evaluate and Trainer.init.
The new CLI flag --reuse_run_id has been introduced to reenable continuing or evaluating a run without using a new run id.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

closes #261 #258

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline
pytest-mock: pytest plugin that allows easy access to to unittest.mock functionality => quickly swapping out behavior to isolate test functions from the broader WeatherGenerator system

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

…the same run_id

…ate`

tjhunter

A few comments. Does it work with submit-slurm.py ?

src/weathergen/utils/config.py

tjhunter · 2025-05-27T15:12:44Z

I have to say, I am still confused by the logic of restarting a run with the same run_id. What is the expectation? restart from zero or continue? I think the logging code will need to be fixed there.

grassesi · 2025-05-28T07:17:13Z

I have to say, I am still confused by the logic of restarting a run with the same run_id. What is the expectation? restart from zero or continue? I think the logging code will need to be fixed there.

This is only concerned with assigning run_ids not weather to start from zero or continue. Where to restart is handled via the --epoch flag as far as I understand it. For clarification these are different scenarios:

(default case): run train, train_continue or evaluate without any flags => generate a new run_id for this run.
(assign run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run
(reuse run_id -> only for train_continue and evaluate): reuse the run_id from the run specified by --from_run_id <RUNID>. Since the run_id correct run_id is already loaded in the config nothing has to be assigned. This case will happen if --reuse_run_id is specified.

I hope that clears the logic up, contact me if you have any questions/suggestions/... 😅

clessig · 2025-05-28T07:31:47Z

run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run

Thanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description.

tjhunter

I just tried this new logic with the slurm submit script and it works. Waiting for the requested changes below

pyproject.toml

grassesi

run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run

Thanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description.

The main usecase should be injecting run_ids generated eg. by a wrapper script, but lets discuss this

grassesi

could you make this function pure and return a new Config object? Performance is definitely not an issue here, and purity should always be preferred, it makes composing behavior easier to understand. You can start the code with
config = config.copy()
...
return config

good point

tjhunter · 2025-05-28T15:21:56Z

@clessig indeed option 2 is for the slurm wrapper. Without setting the run id at slurm level, it is hard to track the files being generated and/or the source code being copied. That script already assumes a unique run id and uses the same logic to generate a new run_id.

What I would suggest: make clear in the documentation of the CLI that the run id should not be manually set.

tjhunter

@grassesi it looks good to me, but let's make sure that @clessig is on board with this proposed change.

tjhunter · 2025-05-28T15:11:47Z

src/weathergen/utils/config.py


    # use OmegaConf.unsafe_merge if too slow
    return OmegaConf.merge(base_config, private_config, *overwrite_configs)


+def set_run_id(config: Config, run_id: str | None, reuse_run_id: bool):


small nit: -> Config return type

tjhunter · 2025-05-28T15:20:26Z

src/weathergen/utils/config.py

+            _logger.info(f"using generated run_id: {config.run_id}")
+        else:
+            config.run_id = run_id
+            _logger.info(f"using assigned run_id: {config.run_id}")


Based on conversation with @clessig , maybe make it more clear: f"using assigned run_id: {config.run_id}. If you manually selected this run_id, this is an error."

clessig · 2025-05-28T16:10:56Z

@clessig indeed option 2 is for the slurm wrapper. Without setting the run id at slurm level, it is hard to track the files being generated and/or the source code being copied. That script already assumes a unique run id and uses the same logic to generate a new run_id.

What I would suggest: make clear in the documentation of the CLI that the run id should not be manually set.

Yes, you are completely right--this is why this was always there. And agreed, let's make it clear in the documentation that this is not to be used manually.

clessig · 2025-05-30T06:52:47Z

src/weathergen/utils/config.py

+    Determining the run id should follow the following logic:
+
+    1. (default case): run train, train_continue or evaluate without any flags => generate a new run_id for this run.
+    2. (assign run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run


Add "For train this should not be used manually.

clessig · 2025-06-02T03:39:03Z

When running experiments at the weekend, I realized that train_continue misses an option --new_run_id : bool. We can manually assign a new run_id but by default it should be auto-generated. Does this PR add this option. Do we have it for evaluate?

@grassesi : could be please add it before merging if not already there. Thanks!

clessig · 2025-06-02T04:22:35Z

src/weathergen/utils/config.py

@@ -106,7 +106,7 @@ def load_config(

    Args:
        private_home: Configuration file containing platform dependent information and secretes
-        run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate
+        from_run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate


This is run id and not model id. It's a bit of a convention but we do not use the plain weights but load all the config parameters, and potentially only overwrite selected ones. Thus, run id seems more appropriate (and this is also what the command line arg is called)

clessig · 2025-06-02T04:26:58Z

src/weathergen/utils/cli.py

+    parser.add_argument(
+        "--reuse_run_id",
+        action="store_true",
+        help="Use the id given via --from_run_id also for the current run. The storage location for artifacts will be reused as well. This might overwrite artifacts from previous runns.",


Typo runns -> runs.

There should be no overwrite of artifacts. This happens (except for #280 and potentially other bugs) for *_latest.ckpt and *_latest.json. We should avoid this. It breaks reproducibility (I typically continue from the *_latest chkpt). Can we open a PR for this.

github-project-automation bot added this to WeatherGen-dev May 27, 2025

grassesi added 7 commits May 27, 2025 14:11

remove usused arguments from Trainer.init

05c90f2

implement new option --reuse_run_id to allow continue/evalute with …

c44c4a8

…the same run_id

move determination of run_id from trainer to config module

6e80d1e

remove unused run_id args from Trainer.init

01a3255

remove unused argument run_id_new from Trainer.run/`Trainer.evalu…

be77294

…ate`

ruffed

67ff30f

update help

4cf4f34

grassesi marked this pull request as ready for review May 27, 2025 13:15

grassesi force-pushed the sgrasse/develop/issue_261 branch from b940e08 to c27b39a Compare May 27, 2025 13:18

grassesi mentioned this pull request May 27, 2025

Ktezcan/dev/iss258 new runid finetng #259

Closed

13 tasks

tjhunter reviewed May 27, 2025

View reviewed changes

src/weathergen/utils/config.py Show resolved Hide resolved

src/weathergen/utils/config.py Show resolved Hide resolved

src/weathergen/utils/config.py Outdated Show resolved Hide resolved

tjhunter reviewed May 28, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

grassesi added 2 commits May 28, 2025 11:07

add dev dependency: pytest-mock

480fc86

add/update tests

952932b

grassesi commented May 28, 2025

View reviewed changes

address review comments

04bd932

grassesi force-pushed the sgrasse/develop/issue_261 branch from c27b39a to 04bd932 Compare May 28, 2025 09:36

grassesi requested a review from tjhunter May 28, 2025 12:48

tjhunter approved these changes May 28, 2025

View reviewed changes

clessig approved these changes May 30, 2025

View reviewed changes

clessig reviewed Jun 2, 2025

View reviewed changes

clean up assignment of run_ids #272

Are you sure you want to change the base?

clean up assignment of run_ids #272

Uh oh!

Conversation

grassesi commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjhunter commented May 27, 2025

Uh oh!

grassesi commented May 28, 2025

Uh oh!

clessig commented May 28, 2025

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grassesi left a comment

Choose a reason for hiding this comment

Uh oh!

grassesi left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter commented May 28, 2025

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter May 28, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter May 28, 2025

Choose a reason for hiding this comment

Uh oh!

clessig commented May 28, 2025

Uh oh!

clessig May 30, 2025

Choose a reason for hiding this comment

Uh oh!

clessig commented Jun 2, 2025

Uh oh!

clessig Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

clessig Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grassesi commented May 27, 2025 •

edited

Loading