|
1 | 1 | (algorithms)=
|
2 | 2 | # Parallel Algorithms
|
3 | 3 |
|
4 |
| -`flox` outsources the core GroupBy operation to the vectorized implementations in |
5 |
| -[numpy_groupies](https://github.com/ml31415/numpy-groupies). |
6 |
| - |
7 |
| -Running an efficient groupby reduction in parallel is hard, and strongly depends on how the |
8 |
| -groups are distributed amongst the blocks of an array. |
| 4 | +`flox` outsources the core GroupBy operation to the vectorized implementations controlled by the |
| 5 | +[`engine` kwarg](engines.md). Applying these implementations on a parallel array type like dask |
| 6 | +can be hard. Performance strongly depends on how the groups are distributed amongst the blocks of an array. |
9 | 7 |
|
10 | 8 | `flox` implements 4 strategies for grouped reductions, each is appropriate for a particular distribution of groups
|
11 | 9 | among the blocks of a dask array. Switch between the various strategies by passing `method`
|
12 |
| -and/or `reindex` to either {py:func}`flox.core.groupby_reduce` or `xarray_reduce`. |
| 10 | +and/or `reindex` to either {py:func}`flox.groupby_reduce` or {py:func}`flox.xarray.xarray_reduce`. |
13 | 11 |
|
14 | 12 | Your options are:
|
15 | 13 | 1. `method="map-reduce"` with `reindex=False`
|
16 | 14 | 1. `method="map-reduce"` with `reindex=True`
|
17 |
| -1. `method="blockwise"` |
18 |
| -1. `method="cohorts"` |
| 15 | +1. [`method="blockwise"`](method-blockwise) |
| 16 | +1. [`method="cohorts"`](method-cohorts) |
19 | 17 |
|
20 | 18 | The most appropriate strategy for your problem will depend on the chunking of your dataset,
|
21 | 19 | and the distribution of group labels across those chunks.
|
22 | 20 |
|
| 21 | +```{tip} |
| 22 | +Currently these strategieis are implemented for dask. We would like to generalize to other parallel array types |
| 23 | +as appropriate (e.g. Ramba, cubed, arkouda). Please open an issue to discuss if you are interested. |
| 24 | +``` |
| 25 | + |
23 | 26 | (xarray-split)=
|
24 | 27 | ## Background: Xarray's current GroupBy strategy
|
25 | 28 |
|
@@ -82,9 +85,10 @@ A bigger advantagee is that this approach allows grouping by a dask array so gro
|
82 | 85 | For example, consider `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.
|
83 | 86 | 
|
84 | 87 | With `reindex=True`, each block will become 3x its original size at the blockwise step: input blocks have 4 timesteps while output block
|
85 |
| -has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](cohorts) below. |
| 88 | +has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](method-cohorts) below. |
86 | 89 |
|
87 | 90 |
|
| 91 | +(method-blockwise)= |
88 | 92 | ## `method="blockwise"`
|
89 | 93 |
|
90 | 94 | One case where `method="map-reduce"` doesn't work well is the case of "resampling" reductions. An
|
@@ -113,6 +117,7 @@ so that all members of a group are in a single block. Then, the groupby operatio
|
113 | 117 | 1. Works better when multiple groups are already in a single block; so that the intial
|
114 | 118 | rechunking only involves a small amount of communication.
|
115 | 119 |
|
| 120 | +(method-cohorts)= |
116 | 121 | ## `method="cohorts"`
|
117 | 122 |
|
118 | 123 | The `map-reduce` strategy is quite effective but can involve some unnecessary communication. It can be possible to exploit
|
|
0 commit comments