-
-
Notifications
You must be signed in to change notification settings - Fork 6
Example --> Pydantic #166
Comments
Would be interested in your opinions on this @JackKelly and @jacobbieker and how vital it is |
I don't know if its super urgent, but it sounds really nice to have the extra structure. After #150 there isn't more data sources that I can think of that are missing from what SatFlow had, other than cloud masks. But it would be good to have better structure for adding more sources as time goes on. |
Sounds good! We should check if PyTorch is happy to accept Pydantic objects (if not, I guess we can just convert from Pydantic to dicts). I'm not sure if PyTorch will be happy with nested dicts (but I haven't tried!) Also, at the moment,
I'm not sure how best to rename And, at different stages of the pipeline, each field might be a |
Yea, definately are a few different things going on. I was thinking of not passing pydantic objects to PyTorch, just the individual elements. Perhaps we could have
We could then have methods in the objects, that maybe converts to xr.DataArrays, or to numpys Pydantic validators, are around already, |
Oh and pydantic objects already have a basical validator in the for for example from pydantic import BaseModel
class OneGSP(BaseModel):
gsp_id: int
g = OneGSP(gsp_id='test_string')
|
Very nice! |
I did a little test to see if pytorch could handle pydantic objects - it could not, but it could handle nested dictionarys |
https://github.com/openclimatefix/nowcasting_dataset/tree/issue/166-batch-pydantic/nowcasting_dataset/dataset/model I think having it like this will make things really nice, and hopefully easier to work with I would be interested in some intermediate feedback - rather than doing it all and getting it all wrong. (@JackKelly @jacobbieker ) |
Ooh, yeah, I really like the idea of splitting the classes! Nice! A few random thoughts: Before inventing too much stuff ourselves, let's absolutely convince ourselves that we can't use an off-the-shelf class like an I wonder if there's a way to combine Pydantic with xr.Datasets? So we get the validation and self-documenting features of pydantic; but we also get all the saving / loading / plotting / resampling functionality from xr.Dataset? One final thought: I do continue to worry about naming a class either class Batch(Example):
pass And then, instead of doing this: def join_data_to_batch(data=List[Batch]) -> Batch:
""" Join several single data items together to make a Batch """ Which, is perhaps a bit confusing ("why are they taking a list of batches and outputting a single batch? I don't understand!") Instead, we could do: def join_data_to_batch(data=List[Example]) -> Batch:
""" Join several single data items together to make a Batch """ |
Yea, have to be a bit careful about re-inventing the wheel. Yea, tricky, I was thinking of just having a 'to_dataset' function, and 'from_dataset', that way we can use those features. yea I like that suggestion of being clear with 'Batch', and perhaps 'DataItem', 'Example' is a bit unclear for me. |
Yeah, I agree |
Yeah, that sounds good! I honestly don't know if we can use |
Perhaps - I need to sketch it out, to try to visualise this a bit more https://docs.google.com/presentation/d/1jabE_IVi5vWWxWG_dnMItnlwWQmXL2dmQf2gzL58w6c/edit#slide=id.p |
Love this sketch! Very happy to go with whatever you think's best! |
Thoughts @JackKelly? The good thing about this design and I'm hopping to do it, so that the old batches are useless. |
Sounds good to me! One thought that's just popped into my head: Don't feel constrained to continue using a single NetCDF file for each batch. That was a fairly arbitrary design decision of mine. We could, for example, have separate files for each data source, for each batch. So, for example, on disk, within the I haven't really thought through this much. Some quick pros and cons... Cons:
Pros:
But, yeah, I really have no strong feelings either way about whether we continue to store entire batches in a single NetCDF file; or if we use one NetCDF file per batch per modality; or if we use different file formats. @jacobbieker any strong opinions?! |
I think having them as separate files might make the most sense, although yeah, I don't have super strong opinions of it. Having them as separate files for each modality could also give some more flexibility, like if we want to do #87 we could theoretically just add a second file with the wider context for the satellite imagery, without having to necessarily recreate all the rest of the files. If we split it up, it might make the most sense to have them in different formats, with GeoTIFF for satellite, topographic data is already stored like that, etc. Potentially to fix the issue with small files for PV or certain modalities, we could store the batches in a database, like SQLite? Could be faster to load into memory and fast to query? I think having a single NetCDF file is nice in terms of its just one large file to load, but I like the modularity of them being separate. |
I also like the idea of not saving everything into one batch file.
Do you think that makes sense? |
ooh, that's a really good point that having separate files means we don't have to recreate the whole thing if we only want to update one "modality". Nice! Although there would be a bit more complexity in the
Very good idea to keep our PRs as "bite-sized" as possible :) |
Yeah, I agree, keeping the PRs small is good! As for keeping them in sync, yeah, that could be an issue, which makes it a bit more complicated, but we could read from the modality we are replacing to ensure its the same? As in, as we update a modality, read in the current version to get the info we need to keep it spatially and temporally aligned, and then replace it with the updated version |
Yeah, that could work! Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple
Any thoughts? |
Yeah, I think that would be a good way to move forward with it! |
PR is ready - for moving away from Example dict to a pydantic object. Its a bit of a big one, but some of the changes are quite small. Although this does seem like hard work - i think it will set us up nicely to build on this. I.e saving different ones, or even some nice plotting functions. Note I still need to sort out the scripts, but that shouldn't be too hard |
Great work! BTW, I've started a separate issue to discuss the idea of splitting different modalities into different files: #202 |
Move Example to Pydantic object
Detailed Description
The Example data (maybe renamed to 'BatchData') is getting a bit bigger, and lacks a bit of structure. It might be good to move it to something with a bit more structure. If we need move to pydantic, then then we can use the
We would have to check that is saves and compresses well
Possible Implementation
BatchData
The text was updated successfully, but these errors were encountered: