Add Optical Flow Data Source #314

jacobbieker · 2021-11-01T14:06:10Z

Pull Request

Description

This adds an Optical Flow Data Source. A corresponding PR in nowasting-dataloader is in openclimatefix/nowcasting_dataloader#39

This PR:

Computes Optical Flow for all datetimes up to t0
Future Satellite images from applying Optical Flow from t0 to forecast time
Add DerivedDataSource class for data sources that are derived from other ones
Update configuration model for OpticalFlowDataSource
Update prepare_ml script to call derived data sources
Update Manager to create derived batches
Some log entries are repeated #446

Fixes #96
Fixes #446

How Has This Been Tested?

Unit tests

No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

nowcasting_dataset/data_sources/fake.py

nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py

nowcasting_dataset/data_sources/satellite/satellite_data_source.py

peterdudfield · 2021-11-03T16:13:21Z

tests/data_sources/optical_flow/test_optical_flow_data_source.py

+def test_get_example(optical_flow_data_source, x, y, left, right, top, bottom):  # noqa: D103
+    optical_flow_data_source.open()
+    t0_dt = pd.Timestamp("2019-01-01T13:00")
+    sat_data = optical_flow_data_source.get_example(


rename variable?

peterdudfield · 2021-11-03T16:13:42Z

tests/data_sources/optical_flow/test_optical_flow_data_source.py

+    assert top == sat_data.y.values[0]
+    assert bottom == sat_data.y.values[-1]
+    assert len(sat_data.x) == pytest.IMAGE_SIZE_PIXELS
+    assert len(sat_data.y) == pytest.IMAGE_SIZE_PIXELS


could add an assert of what shat '.data' is

peterdudfield

Im sure you thught about it, but will this then load the sat data twice?
I.e could this process be done in the satellite data source, then the data would only have be loaded once?

The dis advantage is that is not as module then?

jacobbieker · 2021-11-03T16:18:37Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?

The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

peterdudfield · 2021-11-03T16:20:19Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

jacobbieker · 2021-11-03T16:41:21Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:

import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))

And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

peterdudfield · 2021-11-03T16:46:40Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:
import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))
And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.

What would it look like if you added it to satellite data?
I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

jacobbieker · 2021-11-03T16:53:17Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:
import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))
And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds
Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.

What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

jacobbieker · 2021-11-03T16:58:22Z

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.
What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

This was computed with

import timeit

setup = """import numpy as np
import cv2

"""

code = """
previous_image = np.random.random((128,128))
t0_image = np.random.random((128,128))
cv2.calcOpticalFlowFarneback(
    prev=previous_image,
    next=t0_image,
    flow=None,
    pyr_scale=0.5,
    levels=2,
    winsize=40,
    iterations=3,
    poly_n=5,
    poly_sigma=0.7,
    flags=cv2.OPTFLOW_FARNEBACK_GAUSSIAN,
    )
"""



result = timeit.timeit(setup = setup, stmt = code, number = 10000)
print(result)
print(result/10000)

peterdudfield · 2021-11-03T17:07:25Z

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:
import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))
And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds
Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.
What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.
So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

With xr. Dataset you can have lots of data varaibles,
At the moment we have data, but we could also have optical flow.

See Sun, this has azimuth and elevation in it.

yea, I agree, either with satelite data source, or dataloader

jacobbieker · 2021-11-04T08:45:02Z

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.

See Sun, this has azimuth and elevation in it.

yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

jacobbieker · 2021-11-04T08:54:37Z

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

jacobbieker · 2021-11-04T08:59:38Z

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

There is another benefit to using the original projected EUMETSAT data, I think to cover the same amount of area, the image is smaller, so with the new dataset, it might actually not make the file size any bigger.

peterdudfield · 2021-11-04T09:10:40Z

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

There is another benefit to using the original projected EUMETSAT data, I think to cover the same amount of area, the image is smaller, so with the new dataset, it might actually not make the file size any bigger.

Yea, seems sensible to put it in the dataloader then.
For #87 - I thought you you would keep a batch region of satellite, and then just add a satellite (perhaps at lower resolution) for the wider context.

How much bigger were you thinking of saving the satellite images? I think at the moment they are 64 by 64? Or How much bigger do we need them for optical flow to work?

JackKelly · 2021-11-04T09:17:21Z

This discussion sounds good!

Like @peterdudfield I'm a little worried about adding 1.6 seconds per batch to data_loader...

Can we compute the optical flow for each example in parallel across multiple CPU cores?

A while ago, I wrote some code to compute optical flow in parallel, using Python 3.8's inter-process shared memory feature (otherwise we get slowed down a lot by pickling large image sequences). It'll see if I can dig it out...

JackKelly · 2021-11-04T09:21:18Z

See the compute_optical_flow function in this notebook... it computes optical flow in parallel using SharedMemoryManager: https://github.com/openclimatefix/predict_pv_yield/blob/main/notebooks/16_maxpool.ipynb

nowcasting_dataset/filesystem/utils.py

nowcasting_dataset/config/model.py

nowcasting_dataset/data_sources/fake.py

nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py

nowcasting_dataset/data_sources/satellite/satellite_data_source.py

nowcasting_dataset/dataset/batch.py

nowcasting_dataset/manager.py

nowcasting_dataset/utils.py

peterdudfield

Looks very good Jack>

I think its good keeping it as a DataSource - i.e not a dervived data source. This keeps things nice and simple at the moment.

There's a few minor points, and I'll do a few of them. (ill start at the bottom)

# Conflicts: # nowcasting_dataset/config/model.py # tests/test_manager.py

jacobbieker added enhancement New feature or request data New data source or feature; or modification of existing data source labels Nov 1, 2021

jacobbieker self-assigned this Nov 1, 2021

jacobbieker force-pushed the jacob/optical-flow-datasource branch 2 times, most recently from 17fb0fb to ef7589c Compare November 3, 2021 13:18

jacobbieker marked this pull request as ready for review November 3, 2021 15:56

jacobbieker requested review from peterdudfield, JackKelly and flowirtz November 3, 2021 15:56

peterdudfield reviewed Nov 3, 2021

View reviewed changes

nowcasting_dataset/data_sources/fake.py Outdated Show resolved Hide resolved

peterdudfield reviewed Nov 3, 2021

View reviewed changes

nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py Outdated Show resolved Hide resolved

peterdudfield reviewed Nov 3, 2021

View reviewed changes

nowcasting_dataset/data_sources/optical_flow/optical_flow_data_source.py Outdated Show resolved Hide resolved

peterdudfield reviewed Nov 3, 2021

View reviewed changes

nowcasting_dataset/data_sources/satellite/satellite_data_source.py Outdated Show resolved Hide resolved

jacobbieker force-pushed the jacob/optical-flow-datasource branch from 682870a to 9ca6f47 Compare November 3, 2021 16:04

peterdudfield reviewed Nov 3, 2021

View reviewed changes

JackKelly added 8 commits December 3, 2021 15:08

I the PV test is fixed. Not entirely sure

5d9efc0

revert back to using np.random.randn

b0ee333

no need for Manager to convert local_temp_path to Path

84055b7

use Path as default for local_temp_path

cade179

update description for local_temp_path

1d9f02d

avoid divide by zero

cf4b18e

raise numpy errors for division

765ae1f

fix test_load_yaml_configuration

36eee38

JackKelly reviewed Dec 3, 2021

View reviewed changes

nowcasting_dataset/filesystem/utils.py Outdated Show resolved Hide resolved