Example --> Pydantic #166

peterdudfield · 2021-09-27T11:48:11Z

Move Example to Pydantic object

Detailed Description

The Example data (maybe renamed to 'BatchData') is getting a bit bigger, and lacks a bit of structure. It might be good to move it to something with a bit more structure. If we need move to pydantic, then then we can use the

internal validators to validate the data nicely.
descriptions of data fields
more tree like structure, which can grow with more data feeds

We would have to check that is saves and compresses well

Possible Implementation

BatchData

General
- t0_dt
- batch_size
PV
- pv_yield
- ....
{Next_Data_Source}

peterdudfield · 2021-09-27T11:48:50Z

Would be interested in your opinions on this @JackKelly and @jacobbieker and how vital it is

jacobbieker · 2021-09-27T11:56:46Z

I don't know if its super urgent, but it sounds really nice to have the extra structure. After #150 there isn't more data sources that I can think of that are missing from what SatFlow had, other than cloud masks. But it would be good to have better structure for adding more sources as time goes on.

JackKelly · 2021-09-27T12:13:13Z

Sounds good!

We should check if PyTorch is happy to accept Pydantic objects (if not, I guess we can just convert from Pydantic to dicts). I'm not sure if PyTorch will be happy with nested dicts (but I haven't tried!)

Also, at the moment, Example objects are (confusingly!) used to hold two different things:

Individual examples
Entire batches (i.e. multiple examples)

I'm not sure how best to rename Example to a name that's general enough to capture these two uses.

And, at different stages of the pipeline, each field might be a pd.DataFrame or xr.DataArray, a numpy array, or a Tensor :) It would be good to try to make this more explicit in the code, so we always know exactly which data types we're dealing with.

jacobbieker · 2021-09-27T12:59:40Z

For being more explicit what each data type is throughout the code, I've used schema for that, which worked quite well, or apparently, PyDantic has a beta validator as well we could use?

peterdudfield · 2021-09-27T13:02:53Z

Yea, definately are a few different things going on.

I was thinking of not passing pydantic objects to PyTorch, just the individual elements.

Perhaps we could have

Individual examples
Batch

We could then have methods in the objects, that maybe converts to xr.DataArrays, or to numpys

Pydantic validators, are around already,
https://pydantic-docs.helpmanual.io/usage/validators/
But I see the how 'validation_decorator' could be useful. We can always use a few asserts to check things for the moment

peterdudfield · 2021-09-27T13:08:34Z

Oh and pydantic objects already have a basical validator in the for for example

from pydantic import BaseModel
class OneGSP(BaseModel):
    gsp_id: int

g = OneGSP(gsp_id='test_string')

causes an error

gsp_id value is not a valid integer (type=type_error.integer)

JackKelly · 2021-09-27T14:13:16Z

Very nice!

peterdudfield · 2021-10-01T12:55:50Z

https://github.com/openclimatefix/nowcasting_dataset/blob/issue/166-batch-pydantic/notebooks/2021-10/2021-10-01/pydantic.py

I did a little test to see if pytorch could handle pydantic objects - it could not, but it could handle nested dictionarys

peterdudfield · 2021-10-01T13:23:36Z

https://github.com/openclimatefix/nowcasting_dataset/tree/issue/166-batch-pydantic/nowcasting_dataset/dataset/model
Ive put a general model, plus a satellite model together. Obviously there is more to add.

I think having it like this will make things really nice, and hopefully easier to work with

I would be interested in some intermediate feedback - rather than doing it all and getting it all wrong. (@JackKelly @jacobbieker )

JackKelly · 2021-10-01T16:04:44Z

Ooh, yeah, I really like the idea of splitting the classes! Nice!

A few random thoughts:

Before inventing too much stuff ourselves, let's absolutely convince ourselves that we can't use an off-the-shelf class like an xr.Dataset to hold our batches. I've learnt the hard way that, in the long run, it's a lot less work if we build our core data structures using off-the-shelf classes (and then having a bunch of functions for transforming / plotting those objects), rather than inventing bespoke classes.

I wonder if there's a way to combine Pydantic with xr.Datasets? So we get the validation and self-documenting features of pydantic; but we also get all the saving / loading / plotting / resampling functionality from xr.Dataset?

One final thought: I do continue to worry about naming a class either Batch or Example, and then using it for both single examples, and entire batches. But I'm honestly not sure what the solution is! Other than, perhaps, something like having an Example class, and then having a Batch class which inherits from the Example class and, potentially, overrides absolutely nothing... literally just:

class Batch(Example):
    pass

And then, instead of doing this:

def join_data_to_batch(data=List[Batch]) -> Batch:
    """ Join several single data items together to make a Batch """

Which, is perhaps a bit confusing ("why are they taking a list of batches and outputting a single batch? I don't understand!")

Instead, we could do:

def join_data_to_batch(data=List[Example]) -> Batch:
    """ Join several single data items together to make a Batch """

peterdudfield · 2021-10-01T16:20:51Z

Yea, have to be a bit careful about re-inventing the wheel.

Yea, tricky, I was thinking of just having a 'to_dataset' function, and 'from_dataset', that way we can use those features.

yea I like that suggestion of being clear with 'Batch', and perhaps 'DataItem', 'Example' is a bit unclear for me.

JackKelly · 2021-10-01T16:32:56Z

Yeah, I agree Example has so many meanings that it's pretty unclear what an Example class is for!

JackKelly · 2021-10-01T16:35:57Z

Yea, tricky, I was thinking of just having a 'to_dataset' function, and 'from_dataset', that way we can use those features.

Yeah, that sounds good!

I honestly don't know if we can use xr.Dataset to hold our batches or not (and, if it's possible, I don't know how elegant it would be!) Maybe one path forward would be to try two different draft implementations: One using xr.Dataset (perhaps trying to bolt on some of the nice features of Pydantic) and one using pure Pydantic. I suspect you're right that we're better off buliding our own, custom Pydantic class. But, before doing that, I'd be keen to see what it looks and feels like to try using xr.Dataset. If we can use xr.Dataset then we'd get a bunch of functionality for free, which might be nice :)

peterdudfield · 2021-10-01T16:50:34Z

Perhaps - I need to sketch it out, to try to visualise this a bit more

https://docs.google.com/presentation/d/1jabE_IVi5vWWxWG_dnMItnlwWQmXL2dmQf2gzL58w6c/edit#slide=id.p

JackKelly · 2021-10-01T17:00:24Z

Love this sketch! Very happy to go with whatever you think's best!

peterdudfield · 2021-10-01T17:11:04Z

I think this is roughly what it is now

peterdudfield · 2021-10-01T17:22:21Z

peterdudfield · 2021-10-04T15:14:52Z

peterdudfield · 2021-10-04T15:23:26Z

Thoughts @JackKelly?
This should make it very clear what items are coming out of each DataSource. This will still use xr array to hopefully still compress things nicely

The good thing about this design and I'm hopping to do it, so that the old batches are useless.

JackKelly · 2021-10-05T08:08:32Z

Sounds good to me!

One thought that's just popped into my head: Don't feel constrained to continue using a single NetCDF file for each batch. That was a fairly arbitrary design decision of mine. We could, for example, have separate files for each data source, for each batch. So, for example, on disk, within the train/ directory, we might have train/NWP/, train/satellite/, etc. And, in each of these folders, we'd have NetCDF files, identified by the batch number. And, importantly, train/NWP/1.nc and train/satellite/1.nc would still be perfectly aligned in time and space (just as they currently are).

I haven't really thought through this much. Some quick pros and cons...

Cons:

It makes the "ML loading code" a little more complex. But not much more complex.
Some 'modalities' (like PV) will have tiny files on disk (a few kBytes?). And tiny files are inefficient to load (both on the cloud and on our local hardware).

Pros:

It'd be very easy to use subsets of the data (e.g. we could share just the pre-prepared satellite batches with the MSc students)
We can use whatever file format makes most sense for each 'modality'. e.g. satellite images could be stored as GeoTIFFs (which would make them easy to view).
The code to write each batch to disk could live in the superclass for GSP and NWP; and could be overridden by the GSP or NWP classes.

But, yeah, I really have no strong feelings either way about whether we continue to store entire batches in a single NetCDF file; or if we use one NetCDF file per batch per modality; or if we use different file formats.

@jacobbieker any strong opinions?!

jacobbieker · 2021-10-05T08:59:08Z

I think having them as separate files might make the most sense, although yeah, I don't have super strong opinions of it. Having them as separate files for each modality could also give some more flexibility, like if we want to do #87 we could theoretically just add a second file with the wider context for the satellite imagery, without having to necessarily recreate all the rest of the files. If we split it up, it might make the most sense to have them in different formats, with GeoTIFF for satellite, topographic data is already stored like that, etc.

Potentially to fix the issue with small files for PV or certain modalities, we could store the batches in a database, like SQLite? Could be faster to load into memory and fast to query?

I think having a single NetCDF file is nice in terms of its just one large file to load, but I like the modularity of them being separate.

peterdudfield · 2021-10-05T10:00:35Z

I also like the idea of not saving everything into one batch file.
I'm tempted to put that into a different PR / issue.

Get rid of 'Example' and use a pydantic object
options for saving and loading different data variables

Do you think that makes sense?
I'm aware, sometimes changes grow to big

JackKelly · 2021-10-05T10:08:17Z

we could theoretically just add a second file with the wider context for the satellite imagery, without having to necessarily recreate all the rest of the files

ooh, that's a really good point that having separate files means we don't have to recreate the whole thing if we only want to update one "modality". Nice! Although there would be a bit more complexity in the nowcasting_dataset code because, if we just update one modality, then we'd have to ensure the updated modality stays perfectly in sync with the data already on disk (e.g. batch i for the new modality needs to cover the exactly same time and location as batch i already on disk)

I'm tempted to put that into a different PR / issue

Very good idea to keep our PRs as "bite-sized" as possible :)

jacobbieker · 2021-10-05T10:12:42Z

Yeah, I agree, keeping the PRs small is good!

As for keeping them in sync, yeah, that could be an issue, which makes it a bit more complicated, but we could read from the modality we are replacing to ensure its the same? As in, as we update a modality, read in the current version to get the info we need to keep it spatially and temporally aligned, and then replace it with the updated version

JackKelly · 2021-10-05T10:24:47Z

we could read from the modality we are replacing to ensure its the same?

Yeah, that could work!

Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple nowcasting_dataset... To create a new set of pre-prepared batches, we could run through each modality separately, something like:

Create batches of the "main modality" that will define the position in space and time for each example. e.g. sample from the GSP PV data and save to disk (as files or in a database).
Then create batches of the other "modalities", using the "main modality" to dictate the positions of each example in time and space? Each "modality generator" could run as a separate script, and all these scripts can run concurrently in seperate processes. This should give us fairly easy-to-debug concurrent code. And, when our dataset gets really big, we could use multiple machines running in parallel to create the pre-prepared batches.

Any thoughts?

jacobbieker · 2021-10-05T12:35:11Z

Yeah, I think that would be a good way to move forward with it!

peterdudfield · 2021-10-06T08:59:32Z

PR is ready - for moving away from Example dict to a pydantic object.

#195

Its a bit of a big one, but some of the changes are quite small.
I've tried to add a readme.md file that helps explain things a bit.

Although this does seem like hard work - i think it will set us up nicely to build on this. I.e saving different ones, or even some nice plotting functions.

Note I still need to sort out the scripts, but that shouldn't be too hard

JackKelly · 2021-10-06T16:11:35Z

Great work!

BTW, I've started a separate issue to discuss the idea of splitting different modalities into different files: #202

peterdudfield added the enhancement New feature or request label Sep 27, 2021

peterdudfield added data New data source or feature; or modification of existing data source discussion and removed enhancement New feature or request labels Sep 27, 2021

peterdudfield mentioned this issue Sep 29, 2021

Docs: "How to add a data source" #174

Closed

JackKelly mentioned this issue Sep 29, 2021

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

Closed

4 tasks

peterdudfield mentioned this issue Oct 5, 2021

Issue/166 batch pydantic #195

Merged

7 tasks

JackKelly mentioned this issue Oct 6, 2021

Use independent processes for each "modality" #202

Closed

5 tasks

JackKelly mentioned this issue Oct 8, 2021

"Big new design" for nowcasting_dataset #213

Closed

38 tasks

peterdudfield closed this as completed in #195 Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example --> Pydantic #166

Example --> Pydantic #166

peterdudfield commented Sep 27, 2021 •

edited

Loading

peterdudfield commented Sep 27, 2021

jacobbieker commented Sep 27, 2021

JackKelly commented Sep 27, 2021

jacobbieker commented Sep 27, 2021

peterdudfield commented Sep 27, 2021

peterdudfield commented Sep 27, 2021 •

edited

Loading

JackKelly commented Sep 27, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 1, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021

peterdudfield commented Oct 1, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021

JackKelly commented Oct 1, 2021 •

edited

Loading

peterdudfield commented Oct 1, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 4, 2021

peterdudfield commented Oct 4, 2021

JackKelly commented Oct 5, 2021 •

edited

Loading

jacobbieker commented Oct 5, 2021

peterdudfield commented Oct 5, 2021

JackKelly commented Oct 5, 2021

jacobbieker commented Oct 5, 2021

JackKelly commented Oct 5, 2021 •

edited

Loading

jacobbieker commented Oct 5, 2021

peterdudfield commented Oct 6, 2021 •

edited

Loading

JackKelly commented Oct 6, 2021

Example --> Pydantic #166

Example --> Pydantic #166

Comments

peterdudfield commented Sep 27, 2021 • edited Loading

Detailed Description

Possible Implementation

peterdudfield commented Sep 27, 2021

jacobbieker commented Sep 27, 2021

JackKelly commented Sep 27, 2021

jacobbieker commented Sep 27, 2021

peterdudfield commented Sep 27, 2021

peterdudfield commented Sep 27, 2021 • edited Loading

JackKelly commented Sep 27, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 1, 2021 • edited Loading

JackKelly commented Oct 1, 2021

peterdudfield commented Oct 1, 2021 • edited Loading

JackKelly commented Oct 1, 2021

JackKelly commented Oct 1, 2021 • edited Loading

peterdudfield commented Oct 1, 2021 • edited Loading

JackKelly commented Oct 1, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 1, 2021

peterdudfield commented Oct 4, 2021

peterdudfield commented Oct 4, 2021

JackKelly commented Oct 5, 2021 • edited Loading

jacobbieker commented Oct 5, 2021

peterdudfield commented Oct 5, 2021

JackKelly commented Oct 5, 2021

jacobbieker commented Oct 5, 2021

JackKelly commented Oct 5, 2021 • edited Loading

jacobbieker commented Oct 5, 2021

peterdudfield commented Oct 6, 2021 • edited Loading

JackKelly commented Oct 6, 2021

peterdudfield commented Sep 27, 2021 •

edited

Loading

peterdudfield commented Sep 27, 2021 •

edited

Loading

peterdudfield commented Oct 1, 2021 •

edited

Loading

peterdudfield commented Oct 1, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021 •

edited

Loading

peterdudfield commented Oct 1, 2021 •

edited

Loading

JackKelly commented Oct 5, 2021 •

edited

Loading

JackKelly commented Oct 5, 2021 •

edited

Loading

peterdudfield commented Oct 6, 2021 •

edited

Loading