-
-
Notifications
You must be signed in to change notification settings - Fork 6
"Big new design" for nowcasting_dataset #213
Comments
In terms of the order in which to do this work... here's some initial thoughts (very much up for discussion!)
The thinking being that we can't keep each modality in a single datatype until we've removed pytorch (because pytorch requires us to move data from xarray to numpy to tensors); which we can't do until we've removed the pytorch dataloader; and the best way to remove the pytorch dataloader is to implement #202 :) Does that sound roughly right?! |
Great work @JackKelly putting this all together. I think this will be super useful at making a really solid base for our ML learning problems we are trying to solve. I'm slightly biased by 1. , but yea I would do 1. first. #195 I'll just share my thoughts on how #209 can be done, that allows data to end up as numpy / tensor values |
Yeah, great work! I also think the outline here for moving forward makes sense. For #86, I think that can mostly come as we refactor the code, we remove PyTorch as we see it, I think I agree with @peterdudfield that we can probably do 3 and 4 at the same time. With the focus on removing the on-the-fly loading, would we be moving the PyTorch dataloader that does currently load the prepared batches to somewhere else, like |
I like the idea of leaving it here too, some how keeps it rapped up nicely. I wonder if we could do it as an optional requirement. That means people dont' have to install it, but they do need to install it to make the data loader. And it keeps things all here |
Yeah, I think an optional requirement would be the way to go with it |
Here's a draft design sketch for the end-goal, once all the issues listed above are done: To recap: One of the main aims is to significantly simplify nowcasting_dataset (maybe reducing the total lines of code to something like half the current size!) Some specific design features of this new design:
This makes it easy to:
Once this is implemented, I think we may be able to literally delete several large chunks of code:
|
(UPDATE: Plan moved to top of this thread!) Does that sound OK, @peterdudfield ? |
Think that sounds good. I would move the test writing from 2. to 1. |
I've got the afternoon free for coding so I'm going to make a start on the plan in the comment above... Unless that will clash with your work, @peterdudfield ? |
go for it |
I love the idea of re-using stuff! But, a few thoughts and questions about doing
What do you think? |
|
Ah, yes, I agree - that would be a nice thing to be able to do, and is a really good idea! (as long as we can make sure Pydantic doesn't slow down the training loop 🙂 ) |
Ill try and put together a test |
That would be really useful, thanks load! 🙂 I'm sure you've thought about this already but, just for the sake of being overly cautious: In order to show a measurable latency, the test probably needs to use batches which are of similar size to our "real" batches; and probably needs to use an ML model which is similarly sized to our "real" ML models. And will probably need to try training on a bunch of batches (a single batch might be cached in wierd ways) |
|
Not using pydanitc is slightly (but significant - using t-test) Ran this code 10 times, Dict was faster by about ~0.1% of the running time. So its slowly, but its a debate of how much slowly is ok, for easier readability of the code. |
Sure, sounds good, thanks loads for doing the test! That's super-useful. (Yay for empirically testing stuff!) Let's go for it! |
** move to #209 Yea I will branch off. Yea I think in 'main',
My feeling is this can be done in parallel to #213 comment I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object. |
Cool beans, all sounds great! Thank you! I'm genuinely excited to see (edit: I'll copy this comment to #209!) |
Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates? |
Yea, I think thats the safes way to deal with them at the moment |
As just ints? And if so, are they just seconds since epoch? |
Hehe, yeah, we should document precisely what's going to go into the NetCDF files... I'll start a new issue to create that document (UPDATE: #227)... (the shapes etc. will be described in the pydantic models... but it might be more human-readable to also document the contents of the NetCDF files in a human-readable file somewhere, too... If only - as Peter mentioned the other day - to help define the interface between
In Once all the stuff in this issue is implemented, I think the NetCDF files will have "proper" datetimes in them. And, IIUC, those datetimes will get converted to ints by |
Yeah, tensors can't hold datetimes, but having them in datetimes makes it a bit easier to compute the time of day, day of year, etc. features for the positional encodings, otherwise, I'd probably go with converting them back to datetimes to do so |
If you can wait a few days, hopefully we'll have modified |
Yeah! Sounds good, I'll get it working assuming that |
Detailed Description
Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.
This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)
The plan:
In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:
First, do some stand-alone "preparatory" work. Specifically:
DataSource
s #204DataSources
, instead of using datetimes. #223DataSource
subclass' constructor arguments to align with the new YAML config field names #270split()
function #299Implement and write tests for some of the functions in the draft design in the GitHub comment below. (for now, these functions won't actually be called in the code):
sample_spatial_and_temporal_locations_for_examples()
(done in PR ImplementDataSourceList.sample_spatial_and_temporal_locations_for_examples()
#278)DataSource.prepare_batches()
Then, the big one (where the code size will hopefully shrink a lot!):
prepare_ml_data.py
so it looks like the sketch above. Useclick
to pass in command-line params.DataSourceList
intoManager
; and maintain DataSources in adict
instead of a list? #298n_timesteps_per_batch
?DataSourceList
as the one which defines the geospatial location of each example.history_length
andforecast_length
to disk for each modality, and use these in dataloader #293Remove:
dataset/datamodule.py
dataset.datasets.NowcastingDataset
dataset.datasets.worker_init_fn()
batch_to_dataset
anddataset_to_batch
stuffto_numpy
andfrom_numpy
stuff (assuming we can go straight fromxr.Dataset
totorch.Tensor
)Related issues
xarray.Dataset
#211DataSource
s #204DataSource
? #219A bit more context
I think I've made our lives far harder than they need to be by trying to support two different use-cases:
I think we can make life way easier by dropping support for use-case 1 :)
Here's the broad proposal that it'd be great to discuss:
We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow). nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).
This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209). e.g. Satellite data stays in an
xr.Dataset
. Each modality would be processed concurrently in different processes; and would be output into different directories (e.g.train/satellite/
andtrain/nwp/
) (#202).Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).
The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain 🙂
Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model. But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now 🙂
I certainly can't claim to have thought this all through properly! And everything's up for discussion, of course!
The text was updated successfully, but these errors were encountered: