Skip to content

clean up assignment of run_ids #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

grassesi
Copy link
Contributor

@grassesi grassesi commented May 27, 2025

Description

This PR simplifies the logic for managing run_ids:

  • Determining the run id is now handled by utils.config.set_run_id instead of trainer.Trainer.init simplifying the call signatures for Trainer.run, Trainer.evaluate and Trainer.init.
  • The new CLI flag --reuse_run_id has been introduced to reenable continuing or evaluating a run without using a new run id.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

closes #261 #258

Code Compatibility

  • I have performed a self-review of my code

Code Performance and Testing

  • I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
  • If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs

  • I have tested that new dependencies themselves are pip-installable.

  • I have not introduced new dependencies in the inference portion of the pipeline

  • pytest-mock: pytest plugin that allows easy access to to unittest.mock functionality => quickly swapping out behavior to isolate test functions from the broader WeatherGenerator system

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

@grassesi grassesi marked this pull request as ready for review May 27, 2025 13:15
@grassesi grassesi force-pushed the sgrasse/develop/issue_261 branch from b940e08 to c27b39a Compare May 27, 2025 13:18
@grassesi grassesi mentioned this pull request May 27, 2025
13 tasks
Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Does it work with submit-slurm.py ?

@tjhunter
Copy link
Collaborator

I have to say, I am still confused by the logic of restarting a run with the same run_id. What is the expectation? restart from zero or continue? I think the logging code will need to be fixed there.

@grassesi
Copy link
Contributor Author

I have to say, I am still confused by the logic of restarting a run with the same run_id. What is the expectation? restart from zero or continue? I think the logging code will need to be fixed there.

This is only concerned with assigning run_ids not weather to start from zero or continue. Where to restart is handled via the --epoch flag as far as I understand it. For clarification these are different scenarios:

  1. (default case): run train, train_continue or evaluate without any flags => generate a new run_id for this run.
  2. (assign run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run
  3. (reuse run_id -> only for train_continue and evaluate): reuse the run_id from the run specified by --from_run_id <RUNID>. Since the run_id correct run_id is already loaded in the config nothing has to be assigned. This case will happen if --reuse_run_id is specified.

I hope that clears the logic up, contact me if you have any questions/suggestions/... 😅

@clessig
Copy link
Collaborator

clessig commented May 28, 2025

  • run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run

Thanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description.

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this new logic with the slurm submit script and it works. Waiting for the requested changes below

Copy link
Contributor Author

@grassesi grassesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run

Thanks for the clarifications. I am skeptical we want to support option 2 since we would need conflict resolution then, which will be very, very complex in our (locally and globally) distributed environment: everyone will name their runs run1, exp1, era5, ... and experiment tracking will rely on a unique identified. There's a separate description field. wandb also separates the unique, auto-generated id and a not necessarily unique name and description.

The main usecase should be injecting run_ids generated eg. by a wrapper script, but lets discuss this

Copy link
Contributor Author

@grassesi grassesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make this function pure and return a new Config object? Performance is definitely not an issue here, and purity should always be preferred, it makes composing behavior easier to understand. You can start the code with

config = config.copy()
...
return config

good point

@grassesi grassesi force-pushed the sgrasse/develop/issue_261 branch from c27b39a to 04bd932 Compare May 28, 2025 09:36
@grassesi grassesi requested a review from tjhunter May 28, 2025 12:48
@tjhunter
Copy link
Collaborator

@clessig indeed option 2 is for the slurm wrapper. Without setting the run id at slurm level, it is hard to track the files being generated and/or the source code being copied. That script already assumes a unique run id and uses the same logic to generate a new run_id.

What I would suggest: make clear in the documentation of the CLI that the run id should not be manually set.

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grassesi it looks good to me, but let's make sure that @clessig is on board with this proposed change.


# use OmegaConf.unsafe_merge if too slow
return OmegaConf.merge(base_config, private_config, *overwrite_configs)


def set_run_id(config: Config, run_id: str | None, reuse_run_id: bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: -> Config return type

_logger.info(f"using generated run_id: {config.run_id}")
else:
config.run_id = run_id
_logger.info(f"using assigned run_id: {config.run_id}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on conversation with @clessig , maybe make it more clear: f"using assigned run_id: {config.run_id}. If you manually selected this run_id, this is an error."

@clessig
Copy link
Collaborator

clessig commented May 28, 2025

@clessig indeed option 2 is for the slurm wrapper. Without setting the run id at slurm level, it is hard to track the files being generated and/or the source code being copied. That script already assumes a unique run id and uses the same logic to generate a new run_id.

What I would suggest: make clear in the documentation of the CLI that the run id should not be manually set.

Yes, you are completely right--this is why this was always there. And agreed, let's make it clear in the documentation that this is not to be used manually.

Determining the run id should follow the following logic:

1. (default case): run train, train_continue or evaluate without any flags => generate a new run_id for this run.
2. (assign run_id): run train, train_continue or evaluate with --run_id <RUNID> flag => assign a run_id manually to this run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "For train this should not be used manually.

@clessig
Copy link
Collaborator

clessig commented Jun 2, 2025

When running experiments at the weekend, I realized that train_continue misses an option --new_run_id : bool. We can manually assign a new run_id but by default it should be auto-generated. Does this PR add this option. Do we have it for evaluate?

@grassesi : could be please add it before merging if not already there. Thanks!

@@ -106,7 +106,7 @@ def load_config(

Args:
private_home: Configuration file containing platform dependent information and secretes
run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate
from_run_id: Run/model id of pretrained WeatherGenerator model to continue training or evaluate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run id and not model id. It's a bit of a convention but we do not use the plain weights but load all the config parameters, and potentially only overwrite selected ones. Thus, run id seems more appropriate (and this is also what the command line arg is called)

parser.add_argument(
"--reuse_run_id",
action="store_true",
help="Use the id given via --from_run_id also for the current run. The storage location for artifacts will be reused as well. This might overwrite artifacts from previous runns.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo runns -> runs.

There should be no overwrite of artifacts. This happens (except for #280 and potentially other bugs) for *_latest.ckpt and *_latest.json. We should avoid this. It breaks reproducibility (I typically continue from the *_latest chkpt). Can we open a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Evaluation not returning new run_id by default
3 participants