Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

JackKelly · 2021-10-07T13:18:06Z

In the past, nowcasting_dataset was designed to feed data on-the-fly into a PyTorch model. Which meant that, as the data flowed through nowcasting_dataset, the data would change type: For example, satellite data would start as an xr.DataArray, then get turned into a numpy array (because PyTorch doesn't know what to do with an xr.DataArray), and then get turned into a torch.Tensor.

But, I think we can safely say now that nowcasting_dataset is just for pre-preparing batches (not for loading data on-the-fly). As such, we can probably simplify the code by keeping data in a single container type per modality. For example, satellite data could always live in an xr.DataArray for its entire life while flowing through nowcasting_dataset.

Sorry, I really should've thought of this earlier! But, yeah, I think this could simplify the code quite a lot.

I haven't fully thought through the implications of this, but some changes might be:

In the Pydantic models, each field can be just a single type (instead of a Union of types). So, for example, instead of sat_data: Array = Field(... we can just do sat_data: xr.DataArray = Field(...
We can get rid of the to_numpy function.
For all the modalities which use xarray of pandas data types, we can use dimension names instead of indexes. e.g. seq_length = len(sat_data[-4]) becomes seq_length = len(sat_data.time)
Saving and load data to/from disk becomes super-simple.
We'd no longer need the to_xr_dataset and from_xr_dataset methods (which are quite fiddly)

The text was updated successfully, but these errors were encountered:

JackKelly · 2021-10-07T13:22:41Z

And then, I wonder if DataSourceOutput could become a mixture of a Pydantic model (so we get the typing and the validation) and an xr.Dataset (so we get all the goodies that come with xr.Datasets, like plotting and disk IO?) To be specific, we could do:

class DataSourceOutput(pydantic.BaseModel, xr.Dataset):

I haven't tested this, so I have no idea if it'll actually work!

JackKelly · 2021-10-07T13:24:18Z

Before we can fully implement this, #86 probably needs to be implemented

JackKelly · 2021-10-07T14:28:26Z

I've asked for advice on twitter :)

gsvitak · 2021-10-07T19:26:38Z

@JackKelly Although I do not full understand the use case, I would suggest to check out a Mixin in python. https://www.thedigitalcatonline.com/blog/2020/03/27/mixin-classes-in-python/

peterdudfield · 2021-10-11T17:28:16Z

Looks like you would still need a to_tensor() function that moves the xr datasets to tensors

https://github.com/openclimatefix/nowcasting_dataset/blob/peter/explore/notebooks/2021-10/2021-10-11/xr_tensor.py

Unless there is a better way?

JackKelly · 2021-10-12T07:08:05Z

Yes, I agree, we'd definitely still need a to_tensor() function to go from xr.Dataset to torch.Tensor :)

I've been updating issue #25 with some notes about that; and your approach looks good to me, @peterdudfield !

peterdudfield · 2021-10-12T11:21:00Z

Some more playing here

https://github.com/openclimatefix/nowcasting_dataset/blob/peter/explore/notebooks/2021-10/2021-10-12/Pydantic%20models.ipynb

JackKelly · 2021-10-12T11:35:19Z

Looks good!

Another simplification that we might be able to make is to only represent batches (#212). In other words, we could drop the idea of representing individual "examples" 🙂. I haven't fully thought this through but I think that it was only necessary to represent individual examples back when we wanted nowcasting_dataset to combine multiple modalities into a single batch during pre-processing. But, once #202 is implemented, I think we may be able to move to a world where each DataSource only loads entire batches for pre-processing 🙂

So, in your ipython notebook, we might be able to drop SatelliteBatch becuase Satellite will always represents a whole batch of satellite data :)

Does that sound OK? Or have I gotten myself confused?!

peterdudfield · 2021-10-12T11:53:17Z

That does make sense,

yea I would expect each 'datasource' to make a BatchObject and save it to a file. So in that case 'Satellite' could be ignored.

I agree Example wont really be needed, in this ml pipeline.
But it might be useful to have this for visualisation. I kinda of like the idea of showing the whole example and try understanding to see the behaviour

peterdudfield · 2021-10-12T12:02:43Z

just to see it in a drawing

JackKelly · 2021-10-12T12:33:47Z

Looks good!

And, great point about still needing individual examples for plotting - I'd forgotten about plotting 🙂

BTW, in that diagram, I wonder if we can keep things super-simple at have class Satellite inherit from xr.Dataset instead of BaseModel? Then we don't have to write any custom to_netcdf or from_netcdf methods on the Satellite class. And class Batch would inherit from BaseModel. Something like this:

class Satellite(xr.Dataset):
    # Define the validation methods required by Pydantic...  And that's about it!

class Batch(BaseModel):
    satellite: Satellite
    # ....

peterdudfield · 2021-10-12T12:37:52Z

Just depends were we want to do our validation of satellite data. I was thinking we do it in the Satellite class, i.e check all values are intergers, or none are -1

I thought it needs to be a BaseModel for that to take place, or maybe the Batch can take care of it. But i would be good to validate the salelite date before we save it to netcdf

JackKelly · 2021-10-12T12:49:34Z

I totally agree that the class Satellite should validate the satellite data :)

Don't worry, I'm 99% sure that's possible, and I got it working in this notebook. To be specific: In cell 3, class ImageDataset inherits from xr.Dataset and defines a bunch of validate methods. Pydantic calls these validate methods when an Example object is constructed. For example, cell 8 shows Pydantic complaining when bad Satellite data is passed into the Example constructor. (And class Example inherits from BaseModel)

(That notebook is a bit out-of-date because it still has an Example class! When reading that notebook, maybe rename Example to Batch in your head 🙂 )

peterdudfield · 2021-10-12T13:41:53Z

Yea, but .... hmmm
We wont call Batch until after the object is saved. So something, before we save it to netcdf, has to be a Pydantic Base Model, otherwise no validation is done before we save. Unless we call the validate function ...., but not sure we can.

https://pydantic-docs.helpmanual.io/usage/types/#classes-with-__get_validators__
Then there is a model

class Model(BaseModel):
    post_code: PostCode

where PostCode is made into a pydantic model, extended from a string

JackKelly · 2021-10-12T14:58:11Z

Ah, yes, sorry, I forgot to explain: If we define Batch so every field is optional, like this:

class Batch(BaseModel):
    satellite: Optional[Satellite]
    nwp: Optional[NWP]
    # ....

Then we can call the Batch constructor individually on each individual modality (during saving). That is, we can do:

batch = Batch(satellite=satellite)

And leave out all the other fields.

It's a tiny bit of a hack, perhaps, but it does mean that we can define class Satellite etc. as sub-classes of xr.Dataset, which I'm pretty sure will save us a bunch of boilerplate code 🙂

BTW, I'm happy to tackle this issue (hopefully later this week) as part of #213

peterdudfield · 2021-10-13T11:33:33Z

sorry. should have comment here

@JackKelly wrote

`This design looks great to me!

Please do implement this now (in a branch other than main).

But, unless I've misunderstood, the code in main (as at stands right now) is incompatible with this design, because, right now, the code in main forces each modality to change data type multiple times as it passes through the code (e.g. satellite data starts as an xarray object, then numpy, then a tensor, and finally into the pytorch DataLoader). So Pydantic may complain when Batch.satellite stops being an instance of class Satellite(xr.Dataset) and gets turned into a numpy array, for example. (Or maybe it'll be fine as long as the satellite data is an instance of class Satellite(xr.Dataset) at the time that Batch is instantiated: maybe Pydantic won't notice if, some time after Batch is instantiated, Batch.satellite subsequently changes to a numpy array? Does Pydantic only run its validation checks when Batch is instantiated?

I plan to rip out the pytorch DataLoader over the next few days. So, hopefully, by the end of the week the code will be fully compatible with your design! So, yeah, if you go ahead and implement it in a new branch, while this is on your mind, then hopefully we can merge it later in the week once I've completely finished #86 and have implemented the design I sketched out in #213 (comment) above 🙂.

Or have I misunderstood? 🙂`
#213 (comment)

***** response *****

Yea I will branch off.

Yea I think in 'main',

I will rename Satellite to SatelliteML,
Strip out unnecessary 'to_numpy' code, and maybe doing some tidying
Make Satellite as described above.
Change current 'Batch' to 'BatchML', and make 'Batch'

My feeling is this can be done in parallel to #213 comment

I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object.

JackKelly · 2021-10-13T11:38:54Z

Cool beans, all sounds great! Thank you! I'm genuinely excited to see nowcasting_dataset really mature and become well-designed and easy to modify - thank you for all your work!

JackKelly · 2021-10-13T18:57:31Z

BTW, do you think we should remove time_30 and, instead, always refer to the "time" dimension as time? This is motivated by two recent conversations about nowcasting_dataset:

We might want to support lots of different sample periods (e.g. hourly NWPs, half-hourly GSP-level PV; 15-minutely PVOutput.org data; 5-minutely satellite data; etc.).
Now that each modality ends up in a different NetCDF file, we don't need to worry about multiple sample periods in the same NetCDF file. So, for example, in the GSP NetCDF files, time will be half-hourly. In the satellite NetCDF files, time will be 5-minutely. (Maybe we could add a sample_period metadata field to each NetCDF file, if necessary?)

What do you think?

peterdudfield · 2021-10-14T10:10:50Z

Yea I think we can get rid of time_30, only needed it before as we created one big xr.Dataset
Yea that would be a good metadata field so we know what it is

JackKelly added the refactoring label Oct 7, 2021

JackKelly mentioned this issue Oct 7, 2021

Issue/166 batch pydantic #195

Merged

7 tasks

JackKelly changed the title ~~Can we simplify by always keeping the data in one data type (e.g. xr.DataArray) per modality?~~ Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? Oct 7, 2021

This was referenced Oct 8, 2021

Machine-readable schema & validator for xarray.Dataset #211

Closed

"Big new design" for nowcasting_dataset #213

Closed

peterdudfield mentioned this issue Oct 14, 2021

Issue/209 xrdatarray b #229

Merged

11 tasks

flowirtz added this to Nowcasting Oct 15, 2021

flowirtz moved this to Todo in Nowcasting Oct 15, 2021

peterdudfield closed this as completed in #229 Oct 19, 2021

Repository owner moved this from Todo to Done in Nowcasting Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

JackKelly commented Oct 7, 2021 •

edited

Loading

JackKelly commented Oct 7, 2021 •

edited

Loading

JackKelly commented Oct 7, 2021

JackKelly commented Oct 7, 2021

gsvitak commented Oct 7, 2021

peterdudfield commented Oct 11, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021 •

edited

Loading

peterdudfield commented Oct 12, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 13, 2021 •

edited

Loading

JackKelly commented Oct 13, 2021

JackKelly commented Oct 13, 2021

peterdudfield commented Oct 14, 2021

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Comments

JackKelly commented Oct 7, 2021 • edited Loading

JackKelly commented Oct 7, 2021 • edited Loading

JackKelly commented Oct 7, 2021

JackKelly commented Oct 7, 2021

gsvitak commented Oct 7, 2021

peterdudfield commented Oct 11, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021 • edited Loading

JackKelly commented Oct 12, 2021 • edited Loading

peterdudfield commented Oct 12, 2021 • edited Loading

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 13, 2021 • edited Loading

JackKelly commented Oct 13, 2021

JackKelly commented Oct 13, 2021

peterdudfield commented Oct 14, 2021

JackKelly commented Oct 7, 2021 •

edited

Loading

JackKelly commented Oct 7, 2021 •

edited

Loading

peterdudfield commented Oct 12, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021 •

edited

Loading

peterdudfield commented Oct 12, 2021 •

edited

Loading

peterdudfield commented Oct 13, 2021 •

edited

Loading