Issue/209 xrdatarray b #229

peterdudfield · 2021-10-14T20:14:42Z

Pull Request

Description

There are two different flows for the data representting, 1. saving to a batch file, 2. loading ready for ml

Each data source outputs a xr.Dataset which has been validated using a pydantic extension
Batches are made and then saved to .netcdf, each datasource in a different file

Batches are loaded from netcdf
transfered to BatchML, which is pyadntic model of each data source, and each data source describes which tensor fields it should have

*** note that 2. could move to dataloader repo, but wanted to get it all sorted here first

Change batch maker to use xr.Datasets
validate all data source xr.Datasets to have the correct dims
add validation to xr.Datasets for each data source (done for satellite and nwp)
Create BatchML ready for ML training e.t.c

Fixes #209

How Has This Been Tested?

Adjust unittest accordingly
written new unittest for new code
ran script/prepare_ml_data.py - it works!!!
No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

conftest.py

nowcasting_dataset/data_sources/data_source.py

nowcasting_dataset/data_sources/datasource_output.py

nowcasting_dataset/data_sources/fake.py

nowcasting_dataset/data_sources/gsp/gsp_data_source.py

nowcasting_dataset/data_sources/metadata/metadata_data_source.py

nowcasting_dataset/data_sources/pv/pv_data_source.py

nowcasting_dataset/dataset/batch.py

nowcasting_dataset/dataset/subset.py

jacobbieker

Looks good! Just have a few suggestions possibly

nowcasting_dataset/data_sources/gsp/gsp_data_source.py

nowcasting_dataset/data_sources/nwp/nwp_model.py

nowcasting_dataset/data_sources/pv/pv_data_source.py

nowcasting_dataset/dataset/batch.py

flowirtz

wow, what a beast!

I think this PR is waaaaay too big. We should see if we can split up future work better to end up with smaller PRs. I find it really hard to review PRs this size and it takes quite a long time.

Google's eng-practices make a good case for smaller PRs (they call them CL: Change list): https://google.github.io/eng-practices/review/developer/small-cls.html

Also, my github browser window got really slow during reviewing haha.

I agree with @jacobbieker that we should try to remove commented-out code, as it's contained in the git history anyways.

There are also a bunch of unresolved todo items that are being introduced - I think we should generally try to either:

a) resolve those before merging the PR, or
b) raise an issue or link them to one, and then reference the issue number in the todo comment.

Otherwise I feel like we'll just forget about them or it's unclear who will work on resolving them in the future.

conftest.py

nowcasting_dataset/dataset/xr_utils.py

nowcasting_dataset/data_sources/README.md

nowcasting_dataset/dataset/xr_utils.py

tests/data_sources/satellite/test_satellite_data_source.py

tests/data_sources/test_datasource_output.py

nowcasting_dataset/dataset/subset.py

thanks for the suggestions Flo Co-authored-by: Flo <[email protected]>

peterdudfield · 2021-10-18T11:16:00Z

wow, what a beast!

I think this PR is waaaaay too big. We should see if we can split up future work better to end up with smaller PRs. I find it really hard to review PRs this size and it takes quite a long time.

Google's eng-practices make a good case for smaller PRs (they call them CL: Change list): https://google.github.io/eng-practices/review/developer/small-cls.html

Also, my github browser window got really slow during reviewing haha.

I agree with @jacobbieker that we should try to remove commented-out code, as it's contained in the git history anyways.

There are also a bunch of unresolved todo items that are being introduced - I think we should generally try to either:

a) resolve those before merging the PR, or b) raise an issue or link them to one, and then reference the issue number in the todo comment.

Otherwise I feel like we'll just forget about them or it's unclear who will work on resolving them in the future.

Yea I agree, this was a big one. I really dont like doing big PRs, so I totally agree about breaking them down.

Thanks very much @flowirtz for all the comments

2. option to only delete files in folder, not folders

JackKelly

LGTM!

Thanks for doing all this work!

peterdudfield added 15 commits October 14, 2021 13:11

first go at using xr.Dataset workflow

144e190

move from branch where data was too big

24dd7f4

move subselect functions to seperate file

11ef417

remove validation script.py, fix netcdf dataset test

e64d8d8

Merge branch 'main' into issue/209-xrdatarray-b

0009158

tidy imports

02d0887

pylinter + pv_data test data

19997b1

remove commented code

ffd16d9

add validation of dims in models

f6d75c4

add satellite and nwp specific validation

c8438e4

tidy imports

8628197

tidy unsed functions

3467d61

add fake datetime

ca4d5a8

tidy files

11f24d5

move fake functions into separate file

ff4fded