Skip to content

Iluise/develop/stac database #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: develop
Choose a base branch
from
Open

Conversation

iluise
Copy link
Collaborator

@iluise iluise commented May 28, 2025

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • [ x] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

#196

Code Compatibility

  • [ x] I have performed a self-review of my code

Code Performance and Testing

it doesn't touch the main code

Dependencies

Should I add jsonnet to the uv modules? In principle nobody should touch the jsonnets.

Documentation

  • [ x] My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • [x ] I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

@FussyDuck
Copy link

FussyDuck commented May 28, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ tjhunter
✅ iluise
❌ Ilaria Luise


Ilaria Luise seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@iluise iluise requested a review from tjhunter May 28, 2025 12:31
@iluise iluise self-assigned this May 28, 2025
@iluise iluise added enhancement New feature or request datasets Anything related to the datasets used in the project labels May 28, 2025
@iluise iluise moved this to In Progress in WeatherGen-dev May 28, 2025
Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iluise this is fantastic, thank you for putting this together. I have a few changes suggested that should be easy to do with search & replace.

Also, if you want, you can use jsonnet fmt to nicely format all the jsonnet files

{
"description": "The data catalogue of the WeatherGenerator project",
"id": "weathergen",
"links": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this file still need to be here? it would be nice to only have jsonnet in this directory to cleanly separate the output from the jsonnet code

// for i, v in enumerate(ds.variables): print(f"\"{v}\": [{ds.statistics["minimum"][i]}, {ds.statistics["maximum"][i]},
// {ds.statistics["mean"][i]}, {ds.statistics["stdev"][i]},
// {ds.statistics_tendencies()["mean"][i]}, {ds.statistics_tendencies()["stdev"][i]}],")
variables: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is not great with this format is that it is hard to update with new statistics: we can append to the list but we need to document the precise order of the fields. json is not great at dealing with NaN. We can keep that array since you did all the work. If format change is necessary, we can have a more complex variables_v2 key.

"reanalysis"
],
providers: [
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to a common.jsonnet where you put all the common variables such the common provider definition:

{
  providers: {
    copernicus: {
       name: "Copernicus",...
   },
  ecmwf: {...}
  },
  hpc: {
    leonardo: "leonardo",
   hpc2020: "hpc2020",
   ...
  }
}

The reason for also adding the HPCs is that we are going to also read this file from the programs. This ensures that the name of the HPCs exactly match the code names used in the code:
https://xxx/WeatherGenerator-private/-/blob/main/hpc/platform-env.py?ref_type=heads#L10

dataset_name: "cerra-rr-an-oper-0001-mars-5p5km-1984-2020-6h-v2-hmsi.zarr",
type: "application/vnd+zarr",
description: "Anemoi dataset",
locations: ["HPC2020", "EWC", "MareNostrum5", "Leonardo"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above. These names do not match the code codenames. this will be easier to track with variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EWC is not officially registered, just add it in the list

"atmosphere",
"reanalysis"
],
providers: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about reusing teh provider

local fn = import 'functions.libsonnet';

// URL for hrefs
local href_link = "https://raw.githubusercontent.com/ecmwf/WeatherGenerator/refs/heads/iluise/develop/stac-database/database/";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking you could just have 2 href links and just generate 2x the files below?

local href_link = "https://raw.githubusercontent.com/ecmwf/WeatherGenerator/refs/heads/iluise/develop/stac-database/database/";

// TODO: improve this
local era5v8 = import "era5_v8.jsonnet";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all duplicate, let's try to combine. the only difference is the href link

],

variables: {
"quality_pixel_bitmask": ["TBA", "TBA", "TBA", "TBA", "TBA", "TBA"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be consistent with how to represent missing values? It is "NA" above, which is what I would pick

geometry: [-180, 180, -90, 90],

ref_links: [
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the links in this file? it should be generated for you.
what you could do otherwise is have a extra_ref_links field that you merge with the one you generate

description: "Observation dataset",
locations: ["HPC2020", "JSC"],
size: "120GB",
inodes: "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NA"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Anything related to the datasets used in the project enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants