Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Issue/209 xrdatarray b #229

Merged
merged 32 commits into from
Oct 19, 2021
Merged

Issue/209 xrdatarray b #229

merged 32 commits into from
Oct 19, 2021

Conversation

peterdudfield
Copy link
Contributor

@peterdudfield peterdudfield commented Oct 14, 2021

Pull Request

Description

There are two different flows for the data representting, 1. saving to a batch file, 2. loading ready for ml

  • Each data source outputs a xr.Dataset which has been validated using a pydantic extension
  • Batches are made and then saved to .netcdf, each datasource in a different file
  • Batches are loaded from netcdf
  • transfered to BatchML, which is pyadntic model of each data source, and each data source describes which tensor fields it should have

*** note that 2. could move to dataloader repo, but wanted to get it all sorted here first

Screenshot 2021-10-15 at 10 56 25

  • Change batch maker to use xr.Datasets
  • validate all data source xr.Datasets to have the correct dims
  • add validation to xr.Datasets for each data source (done for satellite and nwp)
  • Create BatchML ready for ML training e.t.c

Fixes #209

How Has This Been Tested?

  • Adjust unittest accordingly

  • written new unittest for new code

  • ran script/prepare_ml_data.py - it works!!!

  • No

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@peterdudfield peterdudfield marked this pull request as ready for review October 15, 2021 11:00
@peterdudfield peterdudfield marked this pull request as ready for review October 18, 2021 08:03
Copy link
Member

@jacobbieker jacobbieker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just have a few suggestions possibly

Copy link
Contributor

@flowirtz flowirtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, what a beast!

I think this PR is waaaaay too big. We should see if we can split up future work better to end up with smaller PRs. I find it really hard to review PRs this size and it takes quite a long time.

Google's eng-practices make a good case for smaller PRs (they call them CL: Change list): https://google.github.io/eng-practices/review/developer/small-cls.html

Also, my github browser window got really slow during reviewing haha.


I agree with @jacobbieker that we should try to remove commented-out code, as it's contained in the git history anyways.


There are also a bunch of unresolved todo items that are being introduced - I think we should generally try to either:

a) resolve those before merging the PR, or
b) raise an issue or link them to one, and then reference the issue number in the todo comment.

Otherwise I feel like we'll just forget about them or it's unclear who will work on resolving them in the future.

@peterdudfield
Copy link
Contributor Author

wow, what a beast!

I think this PR is waaaaay too big. We should see if we can split up future work better to end up with smaller PRs. I find it really hard to review PRs this size and it takes quite a long time.

Google's eng-practices make a good case for smaller PRs (they call them CL: Change list): https://google.github.io/eng-practices/review/developer/small-cls.html

Also, my github browser window got really slow during reviewing haha.

I agree with @jacobbieker that we should try to remove commented-out code, as it's contained in the git history anyways.

There are also a bunch of unresolved todo items that are being introduced - I think we should generally try to either:

a) resolve those before merging the PR, or b) raise an issue or link them to one, and then reference the issue number in the todo comment.

Otherwise I feel like we'll just forget about them or it's unclear who will work on resolving them in the future.

Yea I agree, this was a big one. I really dont like doing big PRs, so I totally agree about breaking them down.

Thanks very much @flowirtz for all the comments

Copy link
Member

@JackKelly JackKelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for doing all this work!

@peterdudfield peterdudfield deleted the issue/209-xrdatarray-b branch October 20, 2021 10:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants