Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Add Optical Flow Data Source #314

Merged
merged 213 commits into from
Dec 6, 2021
Merged

Conversation

jacobbieker
Copy link
Member

@jacobbieker jacobbieker commented Nov 1, 2021

Pull Request

Description

This adds an Optical Flow Data Source. A corresponding PR in nowasting-dataloader is in openclimatefix/nowcasting_dataloader#39

This PR:

  • Computes Optical Flow for all datetimes up to t0
  • Future Satellite images from applying Optical Flow from t0 to forecast time
  • Add DerivedDataSource class for data sources that are derived from other ones
  • Update configuration model for OpticalFlowDataSource
  • Update prepare_ml script to call derived data sources
  • Update Manager to create derived batches
  • Some log entries are repeated #446

Fixes #96
Fixes #446

How Has This Been Tested?

Unit tests

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@jacobbieker jacobbieker added enhancement New feature or request data New data source or feature; or modification of existing data source labels Nov 1, 2021
@jacobbieker jacobbieker self-assigned this Nov 1, 2021
@jacobbieker jacobbieker force-pushed the jacob/optical-flow-datasource branch 2 times, most recently from 17fb0fb to ef7589c Compare November 3, 2021 13:18
@jacobbieker jacobbieker marked this pull request as ready for review November 3, 2021 15:56
@jacobbieker jacobbieker force-pushed the jacob/optical-flow-datasource branch from 682870a to 9ca6f47 Compare November 3, 2021 16:04
def test_get_example(optical_flow_data_source, x, y, left, right, top, bottom): # noqa: D103
optical_flow_data_source.open()
t0_dt = pd.Timestamp("2019-01-01T13:00")
sat_data = optical_flow_data_source.get_example(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename variable?

assert top == sat_data.y.values[0]
assert bottom == sat_data.y.values[-1]
assert len(sat_data.x) == pytest.IMAGE_SIZE_PIXELS
assert len(sat_data.y) == pytest.IMAGE_SIZE_PIXELS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add an assert of what shat '.data' is

Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im sure you thught about it, but will this then load the sat data twice?
I.e could this process be done in the satellite data source, then the data would only have be loaded once?

The dis advantage is that is not as module then?

@jacobbieker
Copy link
Member Author

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?

The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

@peterdudfield
Copy link
Contributor

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

@jacobbieker
Copy link
Member Author

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:

import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))

And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

@peterdudfield
Copy link
Contributor

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:

import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))

And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.

What would it look like if you added it to satellite data?
I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

@jacobbieker
Copy link
Member Author

jacobbieker commented Nov 3, 2021

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:

import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))

And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.

What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

@jacobbieker
Copy link
Member Author

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.
What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

This was computed with

import timeit

setup = """import numpy as np
import cv2

"""

code = """
previous_image = np.random.random((128,128))
t0_image = np.random.random((128,128))
cv2.calcOpticalFlowFarneback(
    prev=previous_image,
    next=t0_image,
    flow=None,
    pyr_scale=0.5,
    levels=2,
    winsize=40,
    iterations=3,
    poly_n=5,
    poly_sigma=0.7,
    flags=cv2.OPTFLOW_FARNEBACK_GAUSSIAN,
    )
"""



result = timeit.timeit(setup = setup, stmt = code, number = 10000)
print(result)
print(result/10000)

@peterdudfield
Copy link
Contributor

Im sure you thught about it, but will this then load the sat data twice? I.e could this process be done in the satellite data source, then the data would only have be loaded once?
The dis advantage is that is not as module then?

Yeah, this would load the satellite data twice, it is to keep it modular, though. The main other option I think if we want things to be more modular is to do all the processing in dataloader instead, and not have it as a Data Source.

If we do on the fly, does it take long? roughly how long does one example take?

I just ran it with timeit as the following:

import timeit

setup = """import numpy as np
import pandas as pd
from pathlib import Path
from nowcasting_dataset.data_sources import OpticalFlowDataSource

optical_flow_data_source = OpticalFlowDataSource(
    image_size_pixels=128,
    zarr_path="/home/jacob/Development/nowcasting_dataset/tests/data/sat_data.zarr",
    history_minutes=0,
    forecast_minutes=5,
    channels=("HRV",),
    )

optical_flow_data_source.open()

"""

code = """
t0_dt = pd.Timestamp("2019-01-01T13:00")
optical_flow_data = optical_flow_data_source.get_example(
    t0_dt=t0_dt, x_meters_center=0, y_meters_center=0
    )
"""

print(timeit.timeit(setup = setup, stmt = code, number = 10000))

And got this as an output 528 seconds to do the 10000 iterations, or 0.05282874276 seconds per example for computing the optical flow. So might be fast enough for doing on the fly, if we go with 32 examples per batch, in serial it would be 1.69 seconds

Thanks for doing that, - that could potentially slow down ML process. I think some of the models I was running on the order a second for a batch, so this would double the training time.
What would it look like if you added it to satellite data? I'm a little bit worried if if it loads Satellite data twice, the preparing data script would increaste by 50%, assume sat and nwp take about the same time at the moment.

So for just the actual calculation itself of computing the flow, that only takes 0.005079008183200131 seconds per pair of images, or roughly 0.16 seconds per 32 example batch, so if it is included as part of the satellite data source then yeah, it might be a bit faster. It makes the output of the SatelliteDataSource not as nice though, as we'd have to put the data somewhere for the output, while we would still need the actual future satellite images too. So if we don't want it as a separate data source, I would still be inclined to then put it all in the dataloader

With xr. Dataset you can have lots of data varaibles,
At the moment we have data, but we could also have optical flow.

See Sun, this has azimuth and elevation in it.

yea, I agree, either with satelite data source, or dataloader

@jacobbieker
Copy link
Member Author

jacobbieker commented Nov 4, 2021

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.

See Sun, this has azimuth and elevation in it.

yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

@jacobbieker
Copy link
Member Author

jacobbieker commented Nov 4, 2021

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

@jacobbieker
Copy link
Member Author

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

There is another benefit to using the original projected EUMETSAT data, I think to cover the same amount of area, the image is smaller, so with the new dataset, it might actually not make the file size any bigger.

@peterdudfield
Copy link
Contributor

With xr. Dataset you can have lots of data varaibles, At the moment we have data, but we could also have optical flow.
See Sun, this has azimuth and elevation in it.
yea, I agree, either with satelite data source, or dataloader

Okay, yeah, I think I'd then go with updating the dataloader. If the optical flow is tied to the satellite data, if we want to change how the flow is computed we then have to remake all the satellite data. Instead, saving the satellite data as a slightly oversized image, which we then crop in the dataloader after computing optical flow, might be easier and a bit more flexible, while not taking too much longer than if we just precomputed it all

So then for this, we'd need to change the SatelliteDataSource to save out images with extra area, possibly combined with the extra spatial extant of #87? So then the next dataset could have an extra large satellite area, which then has the dual benefit of allowing optical flow to work across the whole central image, and include satellite data from father away from te center. But this would also increase the batch file size. What do you think @peterdudfield?

There is another benefit to using the original projected EUMETSAT data, I think to cover the same amount of area, the image is smaller, so with the new dataset, it might actually not make the file size any bigger.

Yea, seems sensible to put it in the dataloader then.
For #87 - I thought you you would keep a batch region of satellite, and then just add a satellite (perhaps at lower resolution) for the wider context.

How much bigger were you thinking of saving the satellite images? I think at the moment they are 64 by 64? Or How much bigger do we need them for optical flow to work?

@JackKelly
Copy link
Member

This discussion sounds good!

Like @peterdudfield I'm a little worried about adding 1.6 seconds per batch to data_loader...

Can we compute the optical flow for each example in parallel across multiple CPU cores?

A while ago, I wrote some code to compute optical flow in parallel, using Python 3.8's inter-process shared memory feature (otherwise we get slowed down a lot by pickling large image sequences). It'll see if I can dig it out...

@JackKelly
Copy link
Member

See the compute_optical_flow function in this notebook... it computes optical flow in parallel using SharedMemoryManager: https://github.com/openclimatefix/predict_pv_yield/blob/main/notebooks/16_maxpool.ipynb

Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good Jack>

I think its good keeping it as a DataSource - i.e not a dervived data source. This keeps things nice and simple at the moment.

There's a few minor points, and I'll do a few of them. (ill start at the bottom)

@peterdudfield peterdudfield merged commit dc0e383 into main Dec 6, 2021
@peterdudfield peterdudfield deleted the jacob/optical-flow-datasource branch December 6, 2021 14:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Some log entries are repeated Optical flow: Predict future PV yield
6 participants