-
-
Notifications
You must be signed in to change notification settings - Fork 6
Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209
Comments
And then, I wonder if class DataSourceOutput(pydantic.BaseModel, xr.Dataset): I haven't tested this, so I have no idea if it'll actually work! |
Before we can fully implement this, #86 probably needs to be implemented |
@JackKelly Although I do not full understand the use case, I would suggest to check out a Mixin in python. https://www.thedigitalcatonline.com/blog/2020/03/27/mixin-classes-in-python/ |
Looks like you would still need a to_tensor() function that moves the xr datasets to tensors Unless there is a better way? |
Yes, I agree, we'd definitely still need a I've been updating issue #25 with some notes about that; and your approach looks good to me, @peterdudfield ! |
Looks good! Another simplification that we might be able to make is to only represent batches (#212). In other words, we could drop the idea of representing individual "examples" 🙂. I haven't fully thought this through but I think that it was only necessary to represent individual examples back when we wanted So, in your ipython notebook, we might be able to drop Does that sound OK? Or have I gotten myself confused?! |
That does make sense, yea I would expect each 'datasource' to make a BatchObject and save it to a file. So in that case 'Satellite' could be ignored. I agree Example wont really be needed, in this ml pipeline. |
Looks good! And, great point about still needing individual examples for plotting - I'd forgotten about plotting 🙂 BTW, in that diagram, I wonder if we can keep things super-simple at have class Satellite(xr.Dataset):
# Define the validation methods required by Pydantic... And that's about it!
class Batch(BaseModel):
satellite: Satellite
# .... |
Just depends were we want to do our validation of satellite data. I was thinking we do it in the Satellite class, i.e check all values are intergers, or none are -1 I thought it needs to be a BaseModel for that to take place, or maybe the Batch can take care of it. But i would be good to validate the salelite date before we save it to netcdf |
I totally agree that the Don't worry, I'm 99% sure that's possible, and I got it working in this notebook. To be specific: In cell 3, (That notebook is a bit out-of-date because it still has an |
Yea, but .... hmmm https://pydantic-docs.helpmanual.io/usage/types/#classes-with-__get_validators__ class Model(BaseModel):
post_code: PostCode where PostCode is made into a pydantic model, extended from a string |
Ah, yes, sorry, I forgot to explain: If we define class Batch(BaseModel):
satellite: Optional[Satellite]
nwp: Optional[NWP]
# .... Then we can call the batch = Batch(satellite=satellite) And leave out all the other fields. It's a tiny bit of a hack, perhaps, but it does mean that we can define BTW, I'm happy to tackle this issue (hopefully later this week) as part of #213 |
sorry. should have comment here @JackKelly wrote `This design looks great to me! Please do implement this now (in a branch other than main). But, unless I've misunderstood, the code in main (as at stands right now) is incompatible with this design, because, right now, the code in main forces each modality to change data type multiple times as it passes through the code (e.g. satellite data starts as an xarray object, then numpy, then a tensor, and finally into the pytorch DataLoader). So Pydantic may complain when Batch.satellite stops being an instance of class Satellite(xr.Dataset) and gets turned into a numpy array, for example. (Or maybe it'll be fine as long as the satellite data is an instance of class Satellite(xr.Dataset) at the time that Batch is instantiated: maybe Pydantic won't notice if, some time after Batch is instantiated, Batch.satellite subsequently changes to a numpy array? Does Pydantic only run its validation checks when Batch is instantiated? I plan to rip out the pytorch DataLoader over the next few days. So, hopefully, by the end of the week the code will be fully compatible with your design! So, yeah, if you go ahead and implement it in a new branch, while this is on your mind, then hopefully we can merge it later in the week once I've completely finished #86 and have implemented the design I sketched out in #213 (comment) above 🙂. Or have I misunderstood? 🙂` ***** response ***** Yea I will branch off. Yea I think in 'main',
My feeling is this can be done in parallel to #213 comment I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object. |
Cool beans, all sounds great! Thank you! I'm genuinely excited to see |
BTW, do you think we should remove
What do you think? |
Yea I think we can get rid of time_30, only needed it before as we created one big xr.Dataset |
In the past,
nowcasting_dataset
was designed to feed data on-the-fly into a PyTorch model. Which meant that, as the data flowed throughnowcasting_dataset
, the data would change type: For example, satellite data would start as anxr.DataArray
, then get turned into anumpy array
(because PyTorch doesn't know what to do with anxr.DataArray
), and then get turned into atorch.Tensor
.But, I think we can safely say now that
nowcasting_dataset
is just for pre-preparing batches (not for loading data on-the-fly). As such, we can probably simplify the code by keeping data in a single container type per modality. For example, satellite data could always live in anxr.DataArray
for its entire life while flowing throughnowcasting_dataset
.Sorry, I really should've thought of this earlier! But, yeah, I think this could simplify the code quite a lot.
I haven't fully thought through the implications of this, but some changes might be:
Union
of types). So, for example, instead ofsat_data: Array = Field(...
we can just dosat_data: xr.DataArray = Field(...
to_numpy
function.seq_length = len(sat_data[-4])
becomesseq_length = len(sat_data.time)
to_xr_dataset
andfrom_xr_dataset
methods (which are quite fiddly)The text was updated successfully, but these errors were encountered: