Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Closed
Tracked by #213
JackKelly opened this issue Oct 7, 2021 · 19 comments · Fixed by #229

Comments

@JackKelly
Copy link
Member

JackKelly commented Oct 7, 2021

In the past, nowcasting_dataset was designed to feed data on-the-fly into a PyTorch model. Which meant that, as the data flowed through nowcasting_dataset, the data would change type: For example, satellite data would start as an xr.DataArray, then get turned into a numpy array (because PyTorch doesn't know what to do with an xr.DataArray), and then get turned into a torch.Tensor.

But, I think we can safely say now that nowcasting_dataset is just for pre-preparing batches (not for loading data on-the-fly). As such, we can probably simplify the code by keeping data in a single container type per modality. For example, satellite data could always live in an xr.DataArray for its entire life while flowing through nowcasting_dataset.

Sorry, I really should've thought of this earlier! But, yeah, I think this could simplify the code quite a lot.

I haven't fully thought through the implications of this, but some changes might be:

  • In the Pydantic models, each field can be just a single type (instead of a Union of types). So, for example, instead of sat_data: Array = Field(... we can just do sat_data: xr.DataArray = Field(...
  • We can get rid of the to_numpy function.
  • For all the modalities which use xarray of pandas data types, we can use dimension names instead of indexes. e.g. seq_length = len(sat_data[-4]) becomes seq_length = len(sat_data.time)
  • Saving and load data to/from disk becomes super-simple.
  • We'd no longer need the to_xr_dataset and from_xr_dataset methods (which are quite fiddly)
@JackKelly
Copy link
Member Author

JackKelly commented Oct 7, 2021

And then, I wonder if DataSourceOutput could become a mixture of a Pydantic model (so we get the typing and the validation) and an xr.Dataset (so we get all the goodies that come with xr.Datasets, like plotting and disk IO?) To be specific, we could do:

class DataSourceOutput(pydantic.BaseModel, xr.Dataset):

I haven't tested this, so I have no idea if it'll actually work!

@JackKelly
Copy link
Member Author

Before we can fully implement this, #86 probably needs to be implemented

@JackKelly JackKelly mentioned this issue Oct 7, 2021
7 tasks
@JackKelly JackKelly changed the title Can we simplify by always keeping the data in one data type (e.g. xr.DataArray) per modality? Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? Oct 7, 2021
@JackKelly
Copy link
Member Author

@gsvitak
Copy link

gsvitak commented Oct 7, 2021

@JackKelly Although I do not full understand the use case, I would suggest to check out a Mixin in python. https://www.thedigitalcatonline.com/blog/2020/03/27/mixin-classes-in-python/

@peterdudfield
Copy link
Contributor

Looks like you would still need a to_tensor() function that moves the xr datasets to tensors

https://github.com/openclimatefix/nowcasting_dataset/blob/peter/explore/notebooks/2021-10/2021-10-11/xr_tensor.py

Unless there is a better way?

@JackKelly
Copy link
Member Author

Yes, I agree, we'd definitely still need a to_tensor() function to go from xr.Dataset to torch.Tensor :)

I've been updating issue #25 with some notes about that; and your approach looks good to me, @peterdudfield !

@JackKelly
Copy link
Member Author

Looks good!

Another simplification that we might be able to make is to only represent batches (#212). In other words, we could drop the idea of representing individual "examples" 🙂. I haven't fully thought this through but I think that it was only necessary to represent individual examples back when we wanted nowcasting_dataset to combine multiple modalities into a single batch during pre-processing. But, once #202 is implemented, I think we may be able to move to a world where each DataSource only loads entire batches for pre-processing 🙂

So, in your ipython notebook, we might be able to drop SatelliteBatch becuase Satellite will always represents a whole batch of satellite data :)

Does that sound OK? Or have I gotten myself confused?!

@peterdudfield
Copy link
Contributor

That does make sense,

yea I would expect each 'datasource' to make a BatchObject and save it to a file. So in that case 'Satellite' could be ignored.

I agree Example wont really be needed, in this ml pipeline.
But it might be useful to have this for visualisation. I kinda of like the idea of showing the whole example and try understanding to see the behaviour

@peterdudfield
Copy link
Contributor

Screenshot 2021-10-12 at 13 01 46

just to see it in a drawing

@JackKelly
Copy link
Member Author

Looks good!

And, great point about still needing individual examples for plotting - I'd forgotten about plotting 🙂

BTW, in that diagram, I wonder if we can keep things super-simple at have class Satellite inherit from xr.Dataset instead of BaseModel? Then we don't have to write any custom to_netcdf or from_netcdf methods on the Satellite class. And class Batch would inherit from BaseModel. Something like this:

class Satellite(xr.Dataset):
    # Define the validation methods required by Pydantic...  And that's about it!

class Batch(BaseModel):
    satellite: Satellite
    # ....

@peterdudfield
Copy link
Contributor

peterdudfield commented Oct 12, 2021

Just depends were we want to do our validation of satellite data. I was thinking we do it in the Satellite class, i.e check all values are intergers, or none are -1

I thought it needs to be a BaseModel for that to take place, or maybe the Batch can take care of it. But i would be good to validate the salelite date before we save it to netcdf

@JackKelly
Copy link
Member Author

JackKelly commented Oct 12, 2021

I totally agree that the class Satellite should validate the satellite data :)

Don't worry, I'm 99% sure that's possible, and I got it working in this notebook. To be specific: In cell 3, class ImageDataset inherits from xr.Dataset and defines a bunch of validate methods. Pydantic calls these validate methods when an Example object is constructed. For example, cell 8 shows Pydantic complaining when bad Satellite data is passed into the Example constructor. (And class Example inherits from BaseModel)

(That notebook is a bit out-of-date because it still has an Example class! When reading that notebook, maybe rename Example to Batch in your head 🙂 )

@peterdudfield
Copy link
Contributor

peterdudfield commented Oct 12, 2021

Yea, but .... hmmm
We wont call Batch until after the object is saved. So something, before we save it to netcdf, has to be a Pydantic Base Model, otherwise no validation is done before we save. Unless we call the validate function ...., but not sure we can.

https://pydantic-docs.helpmanual.io/usage/types/#classes-with-__get_validators__
Then there is a model

class Model(BaseModel):
    post_code: PostCode

where PostCode is made into a pydantic model, extended from a string

@JackKelly
Copy link
Member Author

Ah, yes, sorry, I forgot to explain: If we define Batch so every field is optional, like this:

class Batch(BaseModel):
    satellite: Optional[Satellite]
    nwp: Optional[NWP]
    # ....

Then we can call the Batch constructor individually on each individual modality (during saving). That is, we can do:

batch = Batch(satellite=satellite)

And leave out all the other fields.

It's a tiny bit of a hack, perhaps, but it does mean that we can define class Satellite etc. as sub-classes of xr.Dataset, which I'm pretty sure will save us a bunch of boilerplate code 🙂

BTW, I'm happy to tackle this issue (hopefully later this week) as part of #213

@peterdudfield
Copy link
Contributor

peterdudfield commented Oct 13, 2021

sorry. should have comment here
Screenshot 2021-10-13 at 09 55 52

@JackKelly wrote

`This design looks great to me!

Please do implement this now (in a branch other than main).

But, unless I've misunderstood, the code in main (as at stands right now) is incompatible with this design, because, right now, the code in main forces each modality to change data type multiple times as it passes through the code (e.g. satellite data starts as an xarray object, then numpy, then a tensor, and finally into the pytorch DataLoader). So Pydantic may complain when Batch.satellite stops being an instance of class Satellite(xr.Dataset) and gets turned into a numpy array, for example. (Or maybe it'll be fine as long as the satellite data is an instance of class Satellite(xr.Dataset) at the time that Batch is instantiated: maybe Pydantic won't notice if, some time after Batch is instantiated, Batch.satellite subsequently changes to a numpy array? Does Pydantic only run its validation checks when Batch is instantiated?

I plan to rip out the pytorch DataLoader over the next few days. So, hopefully, by the end of the week the code will be fully compatible with your design! So, yeah, if you go ahead and implement it in a new branch, while this is on your mind, then hopefully we can merge it later in the week once I've completely finished #86 and have implemented the design I sketched out in #213 (comment) above 🙂.

Or have I misunderstood? 🙂`
#213 (comment)

***** response *****

Yea I will branch off.

Yea I think in 'main',

  1. I will rename Satellite to SatelliteML,
  2. Strip out unnecessary 'to_numpy' code, and maybe doing some tidying
  3. Make Satellite as described above.
  4. Change current 'Batch' to 'BatchML', and make 'Batch'

My feeling is this can be done in parallel to #213 comment

I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object.

@JackKelly
Copy link
Member Author

Cool beans, all sounds great! Thank you! I'm genuinely excited to see nowcasting_dataset really mature and become well-designed and easy to modify - thank you for all your work!

@JackKelly
Copy link
Member Author

BTW, do you think we should remove time_30 and, instead, always refer to the "time" dimension as time? This is motivated by two recent conversations about nowcasting_dataset:

  1. We might want to support lots of different sample periods (e.g. hourly NWPs, half-hourly GSP-level PV; 15-minutely PVOutput.org data; 5-minutely satellite data; etc.).
  2. Now that each modality ends up in a different NetCDF file, we don't need to worry about multiple sample periods in the same NetCDF file. So, for example, in the GSP NetCDF files, time will be half-hourly. In the satellite NetCDF files, time will be 5-minutely. (Maybe we could add a sample_period metadata field to each NetCDF file, if necessary?)

What do you think?

@peterdudfield
Copy link
Contributor

Yea I think we can get rid of time_30, only needed it before as we created one big xr.Dataset
Yea that would be a good metadata field so we know what it is

@flowirtz flowirtz moved this to Todo in Nowcasting Oct 15, 2021
Repository owner moved this from Todo to Done in Nowcasting Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants