You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you want to measure anything which has to do with the spatial relationship of features, you eventually find an issue of needing to go across chunk boundaries (assuming spatially chunked ddf). A typical example is a spatial lag, i.e. the mean of values on neighbouring (touching) features.
The way raster analysis deals with it is using dask.array.map_overlap. That essentially copies a bit of neighbouring data to each chunk to create overlaps so you can do your spatial lag fully within a chunk. See https://docs.dask.org/en/latest/array-overlap.html
I believe that we need the analogy of map_overlap for vector data. It is naturally a significantly more complex issue since we do not know how the data look like (it is not a grid). But I believe it is doable and could be a massive game-changer. For example, I would be able to base 90% of momepy on dask-geopandas.
The trick is to define which features should be overlapping. For that, you need to know how far you have to go for each particular operation but we can specify:
a distance threshold (everything within n meters from the chunk boundary)
topological threshold (everything within n steps of contiguity)
I have actually already tested this approach with topological threshold, with custom single-core functions and it works well.
We obviously first need spatial re-chunking and spatial indexing, but this is something I'd like to put on a roadmap (maybe for GSoC?).
If you want to measure anything which has to do with the spatial relationship of features, you eventually find an issue of needing to go across chunk boundaries (assuming spatially chunked ddf). A typical example is a spatial lag, i.e. the mean of values on neighbouring (touching) features.
The way raster analysis deals with it is using
dask.array.map_overlap
. That essentially copies a bit of neighbouring data to each chunk to create overlaps so you can do your spatial lag fully within a chunk. See https://docs.dask.org/en/latest/array-overlap.htmlI believe that we need the analogy of
map_overlap
for vector data. It is naturally a significantly more complex issue since we do not know how the data look like (it is not a grid). But I believe it is doable and could be a massive game-changer. For example, I would be able to base 90% of momepy on dask-geopandas.The trick is to define which features should be overlapping. For that, you need to know how far you have to go for each particular operation but we can specify:
I have actually already tested this approach with topological threshold, with custom single-core functions and it works well.
We obviously first need spatial re-chunking and spatial indexing, but this is something I'd like to put on a roadmap (maybe for GSoC?).
dask.array implementation - https://docs.dask.org/en/latest/array-overlap.html?highlight=map_overlap#dask.array.map_overlap
dask.dataframe implementation - https://docs.dask.org/en/latest/dataframe-api.html?highlight=map_overlap#dask.dataframe.DataFrame.map_overlap
The text was updated successfully, but these errors were encountered: