You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/aggregations.md
+23-3
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,27 @@
1
-
# Custom reductions
1
+
# Aggregations
2
2
3
-
`flox` implements all common reductions provided by `numpy_groupies` in `aggregations.py`.
4
-
It also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
3
+
`flox` implements all common reductions provided by `numpy_groupies` in `aggregations.py`. Control this by passing
4
+
the `func` kwarg:
5
+
6
+
-`"sum"`, `"nansum"`
7
+
-`"prod"`, `"nanprod"`
8
+
-`"count"` - number of non-NaN elements by group
9
+
-`"mean"`, `"nanmean"`
10
+
-`"var"`, `"nanvar"`
11
+
-`"std"`, `"nanstd"`
12
+
-`"argmin"`
13
+
-`"argmax"`
14
+
-`"first"`
15
+
-`"last"`
16
+
17
+
18
+
```{tip}
19
+
We would like to add support for `cumsum`, `cumprod` ([issue](https://github.com/xarray-contrib/flox/issues/91)). Contributions are welcome!
20
+
```
21
+
22
+
## Custom Aggregations
23
+
24
+
`flox` also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
5
25
though this might not be fully functional at the moment. See `aggregations.py` for examples.
6
26
7
27
See the ["Custom Aggregations"](user-stories/custom-aggregations.ipynb) user story for a more user-friendly example.
Aggregating over other array types will work if the array types supports the following methods, [ufunc.reduceat](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html) or [ufunc.at](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html)
`flox` provides multiple options, using the `engine` kwarg, for computing the core GroupBy reduction on numpy or other array types other than dask.
5
+
6
+
1.`engine="numpy"` wraps `numpy_groupies.aggregate_numpy`. This uses indexing tricks and functions like `np.bincount`, or the ufunc `.at` methods
7
+
(.e.g `np.maximum.at`) to provided reasonably performant aggregations.
8
+
1.`engine="numba"` wraps `numpy_groupies.aggregate_numba`. This uses `numba` kernels for the core aggregation.
9
+
1.`engine="flox"` uses the `ufunc.reduceat` method after first argsorting the array so that all group members occur sequentially. This was copied from
10
+
a [gist by Stephan Hoyer](https://gist.github.com/shoyer/f538ac78ae904c936844)
11
+
12
+
See [](arrays) for more details.
13
+
14
+
## Tradeoffs
15
+
16
+
For the common case of reducing a nD array by a 1D array of group labels (e.g. `groupby("time.month")`), `engine="flox"` *can* be faster.
17
+
The reason is that `numpy_groupies` converts all groupby problems to a 1D problem, this can involve [some overhead](https://github.com/ml31415/numpy-groupies/pull/46).
18
+
It is possible to optimize this a bit in `flox` or `numpy_groupies`, but the work has not been done yet.
19
+
The advantage of `engine="numpy"` is that it tends to work for more array types, since it appears to be more common to implement `np.bincount`, and not `np.add.reduceat`.
20
+
21
+
```{tip}
22
+
Other potential engines we could add are [`numbagg`](https://github.com/numbagg/numbagg) ([stalled PR here](https://github.com/xarray-contrib/flox/pull/72)) and [`datashader`](https://github.com/xarray-contrib/flox/issues/142).
23
+
Both use numba for high-performance aggregations. Contributions or discussion is very welcome!
Because a chunksize of 4 evenly divides the number of groups (12) all we need to do is index out blocks
154
+
With `method="map-reduce", reindex=True`, each block will become 3x its original size at the blockwise step: input blocks have 4 timesteps while output block
155
+
has a value for all 12 months. Note that the blockwise groupby-reduction *does not reduce* the data since there is only one element in each
156
+
group. In addition, since `map-reduce` will make the final result have only one chunk of size 12 along the new `month`
157
+
dimension, the final result has chunk sizes 3x that of the input, which may not be ideal.
158
+
159
+
However, because a chunksize of 4 evenly divides the number of groups (12) all we need to do is index out blocks
88
160
0, 3, 7 and then apply the `"map-reduce"` strategy to form the final result for months Jan-Apr. Repeat for the
89
-
remaining groups of months (May-Aug; Sep-Dec) and then concatenate.
161
+
remaining groups of months (May-Aug; Sep-Dec) and then concatenate. This is the essence of `method="cohorts"`
162
+
163
+
164
+
### Summary
165
+
166
+
We can generalize this idea for more complicated problems (inspired by the ``split_out``kwarg in `dask.dataframe.groupby`)
167
+
We first apply the groupby-reduction blockwise, then split and reindex blocks to create a new array with which we complete the reduction
168
+
using `map-reduce`. Because the split or shuffle step occurs after the blockwise reduction, we *sometimes* communicate a significantly smaller
169
+
amount of data than if we split or shuffled the input array.
170
+
171
+
```{image} /../diagrams/new-cohorts-annotated.svg
172
+
:alt: cohorts-strategy-schematic
173
+
:width: 100%
174
+
```
175
+
176
+
### Tradeoffs
177
+
1. Group labels must be known at graph construction time, so this only works for numpy arrays.
178
+
1. This does require more tasks and a more complicated graph, but the communication overhead can be significantly lower.
179
+
1. The detection of "cohorts" is currrently slow but could be improved.
180
+
1. The extra effort of detecting cohorts and mutiple copying of intermediate blocks may be worthwhile only if the chunk sizes are small
181
+
relative to the approximate period of group labels, or small relative to the size of spatially localized groups.
182
+
183
+
184
+
### Example : sensitivity to chunking
185
+
186
+
One annoyance is that if the chunksize doesn't evenly divide the number of groups, we still end up splitting a number of chunks.
187
+
Consider our earlier example, `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.
We find 8 cohorts (note the original xarray strategy is equivalent to constructing 12 cohorts).
108
203
109
-
It's possible that some initial rechunking makes the situation better (just rechunk from 5-4), but it isn't an obvious improvement.
204
+
We find 8 cohorts (note the original xarray strategy is equivalent to constructing 12 cohorts).
205
+
In this case, it seems to better to rechunk to a size of `4` along `time`.
110
206
If you have ideas for improving this case, please open an issue.
111
207
112
-
*Tradeoffs*
113
-
1. Generalizes well; when there's exactly one groups per chunk, this replicates Xarray's
114
-
strategy which is optimal. For resampling type reductions, aslongas the array
115
-
is chunked appropriately ({py:func}`flox.core.rechunk_for_blockwise`, {py:func}`flox.xarray.rechunk_for_blockwise`), `method="cohorts"`is equivalent to `method="blockwise"`!
116
-
1. Group labels must be known at graph construction time, so this only works for numpy arrays
117
-
1. Currenltly implemented for grouping by 1D arrays. An nD generalization seems possible,
0 commit comments