-
-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
17fb0fb
to
ef7589c
Compare
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Outdated
Show resolved
Hide resolved
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Outdated
Show resolved
Hide resolved
nowcasting_dataset/data_sources/satellite/satellite_data_source.py
Outdated
Show resolved
Hide resolved
682870a
to
9ca6f47
Compare
def test_get_example(optical_flow_data_source, x, y, left, right, top, bottom): # noqa: D103 | ||
optical_flow_data_source.open() | ||
t0_dt = pd.Timestamp("2019-01-01T13:00") | ||
sat_data = optical_flow_data_source.get_example( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename variable?
assert top == sat_data.y.values[0] | ||
assert bottom == sat_data.y.values[-1] | ||
assert len(sat_data.x) == pytest.IMAGE_SIZE_PIXELS | ||
assert len(sat_data.y) == pytest.IMAGE_SIZE_PIXELS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add an assert of what shat '.data' is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im sure you thught about it, but will this then load the sat data twice?
I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?
Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source. |
If we do on the fly, does it take long? roughly how long does one example take? |
I just ran it with timeit as the following:
And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds |
Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time. What would it look like if you added it to satellite data? |
So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader |
This was computed with
|
With xr. Dataset you can have lots of data varaibles, See Sun, this has azimuth and elevation in it. yea, I agree, either with satelite data source, or dataloader |
Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all |
So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield? |
There is another benefit to using the original projected EUMETSAT data, I think to cover the same amount of area, the image is smaller, so with the new dataset, it might actually not make the file size any bigger. |
Yea, seems sensible to put it in the dataloader then. How much bigger were you thinking of saving the satellite images? I think at the moment they are 64 by 64? Or How much bigger do we need them for optical flow to work? |
This discussion sounds good! Like @peterdudfield I'm a little worried about adding 1.6 seconds per batch to data_loader... Can we compute the optical flow for each example in parallel across multiple CPU cores? A while ago, I wrote some code to compute optical flow in parallel, using Python 3.8's inter-process shared memory feature (otherwise we get slowed down a lot by pickling large image sequences). It'll see if I can dig it out... |
See the |
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Show resolved
Hide resolved
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Show resolved
Hide resolved
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Show resolved
Hide resolved
nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good Jack>
I think its good keeping it as a DataSource - i.e not a dervived data source. This keeps things nice and simple at the moment.
There's a few minor points, and I'll do a few of them. (ill start at the bottom)
# Conflicts: # nowcasting_dataset/config/model.py # tests/test_manager.py
Pull Request
Description
This adds an Optical Flow Data Source. A corresponding PR in nowasting-dataloader is in openclimatefix/nowcasting_dataloader#39
This PR:
prepare_ml
script to call derived data sourcesFixes #96
Fixes #446
How Has This Been Tested?
Unit tests
Checklist: