Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Example --> Pydantic #166

Closed
Tracked by #213
peterdudfield opened this issue Sep 27, 2021 · 28 comments · Fixed by #195
Closed
Tracked by #213

Example --> Pydantic #166

peterdudfield opened this issue Sep 27, 2021 · 28 comments · Fixed by #195
Labels
data New data source or feature; or modification of existing data source discussion

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Sep 27, 2021

Move Example to Pydantic object

Detailed Description

The Example data (maybe renamed to 'BatchData') is getting a bit bigger, and lacks a bit of structure. It might be good to move it to something with a bit more structure. If we need move to pydantic, then then we can use the

  • internal validators to validate the data nicely.
  • descriptions of data fields
  • more tree like structure, which can grow with more data feeds

We would have to check that is saves and compresses well

Possible Implementation

BatchData

  • General
    • t0_dt
    • batch_size
  • PV
    • pv_yield
    • ....
  • {Next_Data_Source}
@peterdudfield peterdudfield added the enhancement New feature or request label Sep 27, 2021
@peterdudfield
Copy link
Contributor Author

Would be interested in your opinions on this @JackKelly and @jacobbieker and how vital it is

@peterdudfield peterdudfield added data New data source or feature; or modification of existing data source discussion and removed enhancement New feature or request labels Sep 27, 2021
@jacobbieker
Copy link
Member

I don't know if its super urgent, but it sounds really nice to have the extra structure. After #150 there isn't more data sources that I can think of that are missing from what SatFlow had, other than cloud masks. But it would be good to have better structure for adding more sources as time goes on.

@JackKelly
Copy link
Member

Sounds good!

We should check if PyTorch is happy to accept Pydantic objects (if not, I guess we can just convert from Pydantic to dicts). I'm not sure if PyTorch will be happy with nested dicts (but I haven't tried!)

Also, at the moment, Example objects are (confusingly!) used to hold two different things:

  1. Individual examples
  2. Entire batches (i.e. multiple examples)

I'm not sure how best to rename Example to a name that's general enough to capture these two uses.

And, at different stages of the pipeline, each field might be a pd.DataFrame or xr.DataArray, a numpy array, or a Tensor :) It would be good to try to make this more explicit in the code, so we always know exactly which data types we're dealing with.

@jacobbieker
Copy link
Member

For being more explicit what each data type is throughout the code, I've used schema for that, which worked quite well, or apparently, PyDantic has a beta validator as well we could use?

@peterdudfield
Copy link
Contributor Author

Yea, definately are a few different things going on.

I was thinking of not passing pydantic objects to PyTorch, just the individual elements.

Perhaps we could have

  1. Individual examples
  2. Batch

We could then have methods in the objects, that maybe converts to xr.DataArrays, or to numpys

Pydantic validators, are around already,
https://pydantic-docs.helpmanual.io/usage/validators/
But I see the how 'validation_decorator' could be useful. We can always use a few asserts to check things for the moment

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Sep 27, 2021

Oh and pydantic objects already have a basical validator in the for for example

from pydantic import BaseModel
class OneGSP(BaseModel):
    gsp_id: int

g = OneGSP(gsp_id='test_string') 
  • causes an error

gsp_id value is not a valid integer (type=type_error.integer)

@JackKelly
Copy link
Member

Very nice!

@peterdudfield
Copy link
Contributor Author

https://github.com/openclimatefix/nowcasting_dataset/blob/issue/166-batch-pydantic/notebooks/2021-10/2021-10-01/pydantic.py

I did a little test to see if pytorch could handle pydantic objects - it could not, but it could handle nested dictionarys

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Oct 1, 2021

https://github.com/openclimatefix/nowcasting_dataset/tree/issue/166-batch-pydantic/nowcasting_dataset/dataset/model
Ive put a general model, plus a satellite model together. Obviously there is more to add.

I think having it like this will make things really nice, and hopefully easier to work with

I would be interested in some intermediate feedback - rather than doing it all and getting it all wrong. (@JackKelly @jacobbieker )

@JackKelly
Copy link
Member

Ooh, yeah, I really like the idea of splitting the classes! Nice!

A few random thoughts:

Before inventing too much stuff ourselves, let's absolutely convince ourselves that we can't use an off-the-shelf class like an xr.Dataset to hold our batches. I've learnt the hard way that, in the long run, it's a lot less work if we build our core data structures using off-the-shelf classes (and then having a bunch of functions for transforming / plotting those objects), rather than inventing bespoke classes.

I wonder if there's a way to combine Pydantic with xr.Datasets? So we get the validation and self-documenting features of pydantic; but we also get all the saving / loading / plotting / resampling functionality from xr.Dataset?

One final thought: I do continue to worry about naming a class either Batch or Example, and then using it for both single examples, and entire batches. But I'm honestly not sure what the solution is! Other than, perhaps, something like having an Example class, and then having a Batch class which inherits from the Example class and, potentially, overrides absolutely nothing... literally just:

class Batch(Example):
    pass

And then, instead of doing this:

def join_data_to_batch(data=List[Batch]) -> Batch:
    """ Join several single data items together to make a Batch """

Which, is perhaps a bit confusing ("why are they taking a list of batches and outputting a single batch? I don't understand!")

Instead, we could do:

def join_data_to_batch(data=List[Example]) -> Batch:
    """ Join several single data items together to make a Batch """

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Oct 1, 2021

Yea, have to be a bit careful about re-inventing the wheel.

Yea, tricky, I was thinking of just having a 'to_dataset' function, and 'from_dataset', that way we can use those features.

yea I like that suggestion of being clear with 'Batch', and perhaps 'DataItem', 'Example' is a bit unclear for me.

@JackKelly
Copy link
Member

Yeah, I agree Example has so many meanings that it's pretty unclear what an Example class is for!

@JackKelly
Copy link
Member

JackKelly commented Oct 1, 2021

Yea, tricky, I was thinking of just having a 'to_dataset' function, and 'from_dataset', that way we can use those features.

Yeah, that sounds good!

I honestly don't know if we can use xr.Dataset to hold our batches or not (and, if it's possible, I don't know how elegant it would be!) Maybe one path forward would be to try two different draft implementations: One using xr.Dataset (perhaps trying to bolt on some of the nice features of Pydantic) and one using pure Pydantic. I suspect you're right that we're better off buliding our own, custom Pydantic class. But, before doing that, I'd be keen to see what it looks and feels like to try using xr.Dataset. If we can use xr.Dataset then we'd get a bunch of functionality for free, which might be nice :)

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Oct 1, 2021

Perhaps - I need to sketch it out, to try to visualise this a bit more

https://docs.google.com/presentation/d/1jabE_IVi5vWWxWG_dnMItnlwWQmXL2dmQf2gzL58w6c/edit#slide=id.p

@JackKelly
Copy link
Member

Love this sketch! Very happy to go with whatever you think's best!

@peterdudfield
Copy link
Contributor Author

Screenshot 2021-10-01 at 18 10 36

I think this is roughly what it is now

@peterdudfield
Copy link
Contributor Author

Screenshot 2021-10-01 at 18 21 28

@peterdudfield
Copy link
Contributor Author

Screenshot 2021-10-04 at 16 14 25

@peterdudfield
Copy link
Contributor Author

Thoughts @JackKelly?
This should make it very clear what items are coming out of each DataSource. This will still use xr array to hopefully still compress things nicely

The good thing about this design and I'm hopping to do it, so that the old batches are useless.

@JackKelly
Copy link
Member

JackKelly commented Oct 5, 2021

Sounds good to me!

One thought that's just popped into my head: Don't feel constrained to continue using a single NetCDF file for each batch. That was a fairly arbitrary design decision of mine. We could, for example, have separate files for each data source, for each batch. So, for example, on disk, within the train/ directory, we might have train/NWP/, train/satellite/, etc. And, in each of these folders, we'd have NetCDF files, identified by the batch number. And, importantly, train/NWP/1.nc and train/satellite/1.nc would still be perfectly aligned in time and space (just as they currently are).

I haven't really thought through this much. Some quick pros and cons...

Cons:

  • It makes the "ML loading code" a little more complex. But not much more complex.
  • Some 'modalities' (like PV) will have tiny files on disk (a few kBytes?). And tiny files are inefficient to load (both on the cloud and on our local hardware).

Pros:

  • It'd be very easy to use subsets of the data (e.g. we could share just the pre-prepared satellite batches with the MSc students)
  • We can use whatever file format makes most sense for each 'modality'. e.g. satellite images could be stored as GeoTIFFs (which would make them easy to view).
  • The code to write each batch to disk could live in the superclass for GSP and NWP; and could be overridden by the GSP or NWP classes.

But, yeah, I really have no strong feelings either way about whether we continue to store entire batches in a single NetCDF file; or if we use one NetCDF file per batch per modality; or if we use different file formats.

@jacobbieker any strong opinions?!

@jacobbieker
Copy link
Member

I think having them as separate files might make the most sense, although yeah, I don't have super strong opinions of it. Having them as separate files for each modality could also give some more flexibility, like if we want to do #87 we could theoretically just add a second file with the wider context for the satellite imagery, without having to necessarily recreate all the rest of the files. If we split it up, it might make the most sense to have them in different formats, with GeoTIFF for satellite, topographic data is already stored like that, etc.

Potentially to fix the issue with small files for PV or certain modalities, we could store the batches in a database, like SQLite? Could be faster to load into memory and fast to query?

I think having a single NetCDF file is nice in terms of its just one large file to load, but I like the modularity of them being separate.

@peterdudfield
Copy link
Contributor Author

I also like the idea of not saving everything into one batch file.
I'm tempted to put that into a different PR / issue.

  1. Get rid of 'Example' and use a pydantic object
  2. options for saving and loading different data variables

Do you think that makes sense?
I'm aware, sometimes changes grow to big

@JackKelly
Copy link
Member

we could theoretically just add a second file with the wider context for the satellite imagery, without having to necessarily recreate all the rest of the files

ooh, that's a really good point that having separate files means we don't have to recreate the whole thing if we only want to update one "modality". Nice! Although there would be a bit more complexity in the nowcasting_dataset code because, if we just update one modality, then we'd have to ensure the updated modality stays perfectly in sync with the data already on disk (e.g. batch i for the new modality needs to cover the exactly same time and location as batch i already on disk)

I'm tempted to put that into a different PR / issue

Very good idea to keep our PRs as "bite-sized" as possible :)

@jacobbieker
Copy link
Member

Yeah, I agree, keeping the PRs small is good!

As for keeping them in sync, yeah, that could be an issue, which makes it a bit more complicated, but we could read from the modality we are replacing to ensure its the same? As in, as we update a modality, read in the current version to get the info we need to keep it spatially and temporally aligned, and then replace it with the updated version

@JackKelly
Copy link
Member

JackKelly commented Oct 5, 2021

we could read from the modality we are replacing to ensure its the same?

Yeah, that could work!

Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple nowcasting_dataset... To create a new set of pre-prepared batches, we could run through each modality separately, something like:

  1. Create batches of the "main modality" that will define the position in space and time for each example. e.g. sample from the GSP PV data and save to disk (as files or in a database).
  2. Then create batches of the other "modalities", using the "main modality" to dictate the positions of each example in time and space? Each "modality generator" could run as a separate script, and all these scripts can run concurrently in seperate processes. This should give us fairly easy-to-debug concurrent code. And, when our dataset gets really big, we could use multiple machines running in parallel to create the pre-prepared batches.

Any thoughts?

@jacobbieker
Copy link
Member

Yeah, I think that would be a good way to move forward with it!

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Oct 6, 2021

PR is ready - for moving away from Example dict to a pydantic object.

#195

Its a bit of a big one, but some of the changes are quite small.
I've tried to add a readme.md file that helps explain things a bit.

Although this does seem like hard work - i think it will set us up nicely to build on this. I.e saving different ones, or even some nice plotting functions.

Note I still need to sort out the scripts, but that shouldn't be too hard

@JackKelly
Copy link
Member

Great work!

BTW, I've started a separate issue to discuss the idea of splitting different modalities into different files: #202

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants