Skip to content

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
emmaai opened this issue Jul 27, 2022 · 11 comments · Fixed by #7105 or #7830
Closed
4 tasks done

Comments

@emmaai
Copy link

emmaai commented Jul 27, 2022

What happened?

run the code block below with 2022.6.0

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

output:

In [15]: mda.groupby("one").groups
Out[15]: 
{('a', 0): [0],
 ('a', 1): [1],
 ('b', 0): [2],
 ('b', 1): [3],
 ('c', 0): [4],
 ('c', 1): [5]}

What did you expect to happen?

as it was with 2022.3.0

In [6]: mda.groupby("one").groups
Out[6]: {'a': [0, 1], 'b': [2, 3], 'c': [4, 5]}

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import xarray as XR

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

N/A

Anything else we need to know?

N/A

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.11.0-1025-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.22.4
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2022.04.1
distributed: 2022.4.1
matplotlib: 3.5.1
cartopy: 0.20.3
seaborn: 0.11.2
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 45.2.0
pip: 22.2
conda: None
pytest: 7.1.2
IPython: 7.31.0
sphinx: None

@emmaai emmaai added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022
@dcherian dcherian changed the title groupby not working correctly on a multi-indexed DataArray or DataSet groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet Jul 27, 2022
@dcherian dcherian added topic-groupby regression and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022
@dcherian
Copy link
Contributor

dcherian commented Aug 11, 2022

@benbovy I tracked this down to

>>> mda.one.to_index()
# v2022.06.0
MultiIndex([('a', 0),
            ('a', 1),
            ('b', 0),
            ('b', 1),
            ('c', 0),
            ('c', 1)],
           names=['one', 'two'])

# v2022.03.0
Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object', name='x')

We call to_index here in safe_cast_to_index:

xarray/xarray/core/utils.py

Lines 115 to 140 in f8fee90

def safe_cast_to_index(array: Any) -> pd.Index:
"""Given an array, safely cast it to a pandas.Index.
If it is already a pandas.Index, return it unchanged.
Unlike pandas.Index, if the array has dtype=object or dtype=timedelta64,
this function will not attempt to do automatic type conversion but will
always return an index with dtype=object.
"""
if isinstance(array, pd.Index):
index = array
elif hasattr(array, "to_index"):
# xarray Variable
index = array.to_index()
elif hasattr(array, "to_pandas_index"):
# xarray Index
index = array.to_pandas_index()
elif hasattr(array, "array") and isinstance(array.array, pd.Index):
# xarray PandasIndexingAdapter
index = array.array
else:
kwargs = {}
if hasattr(array, "dtype") and array.dtype.kind == "O":
kwargs["dtype"] = object
index = pd.Index(np.asarray(array), **kwargs)
return _maybe_cast_to_cftimeindex(index)

Not sure if the fix should be only in the GroupBy specifically or more generally in safe_cast_to_index

The GroupBy context is

group_as_index = safe_cast_to_index(group)

@FabianHofmann
Copy link
Contributor

After trying to dig down further into the code, I saw that grouping over levels seems to be broken generally (up-to-date main branch at time of writing), i.e.

import pandas as pd
import numpy as np
import xarray as xr

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))
mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])
mda.groupby("one").sum()

raises:

  File ".../xarray/xarray/core/_reductions.py", line 5055, in sum
    return self.reduce(

  File ".../xarray/xarray/core/groupby.py", line 1191, in reduce
    return self.map(reduce_array, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1095, in map
    return self._combine(applied, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1127, in _combine
    index, index_vars = create_default_index_implicit(coord)

  File ".../xarray/xarray/core/indexes.py", line 974, in create_default_index_implicit
    index = PandasMultiIndex(array, name)

  File ".../xarray/xarray/core/indexes.py", line 552, in __init__
    raise ValueError(

ValueError: conflicting multi-index level name 'one' with dimension 'one'

in the function create_default_index_implicit. I am still a bit puzzled how to approach this. Any idea @benbovy?

@benbovy
Copy link
Member

benbovy commented Sep 22, 2022

Thanks @emmaai for the issue report and thanks @dcherian and @FabianHofmann for tracking it down.

There is a lot of complexity related to pandas.MultiIndex special cases and it's been difficult to avoid new issues arising during the index refactor.

create_default_index_implicit has some hacks to create xarray objects directly from pandas.MultiIndex instances (e.g., xr.Dataset(coords={"x": pd_midx})) or even from xarray objects wrapping multi-indexes. The error raised here suggests that the issue should fixed before this call... Probably in safe_cast_to_index indeed.

We should probably avoid using .to_index() internally, or should we even deprecate it? The fact that mda.one.to_index() (in v2022.3.0) doesn't return the same result than mda.indexes["one"] adds more confusion than it adds value. Actually, in the long-term I'd be for deprecating all pandas.MultiIndex special cases in Xarray.

jjpr-mit added a commit to brain-score/brainio that referenced this issue Sep 26, 2022
mschrimpf added a commit to brain-score/brainio that referenced this issue Sep 27, 2022
* unpin xarray, numpy, pandas, netcdf4

* Fix deprecation "TypeError: Using a DataArray object to construct a variable is ambiguous, please extract the data using the .data property."  See pydata/xarray#6508.  There are other errors, however.

* See Unidata/netcdf4-python#1175 "Regression in createVariable between 1.5.8 and 1.6.0".

* Testing on Python 3.7 only covers through xarray 0.20.2.  This is an experiment.

* Fix get_metadata to account for the internal restructuring of indexes in xarray 2022.06.  Update tests for same.

* Fixed bug where we were stripping off attrs inadvertently when writing netCDF.

* TestPlainGroupby.test_on_data_array fails with Python = 3.8.13, numpy = 1.23.2, xarray = 2022.06.0.  That means it has nothing to do with BrainIO.

* test_on_data_array should not involve any BrainIO classes.  This is to test for bugs in xarray.  With xarray==2022.06.0, this test fails.

* xarray 2022.06.0 has a bug which breaks BrainIO:  pydata/xarray#6836.

* Adapt get_metadata to the change in the index API between 2022.03.0 and 2022.06.0.  Now test_get_metadata passes under 2022.03.0 and 2022.06.0.

* Getting an error from tests on Travis (but not locally):  RuntimeError: NetCDF: Filter error: bad id or parameters or duplicate filter.  This might fix it?

* Compression test failed:  assert 614732 > 615186.  This might fix it.

* Travis doesn't offer python 3.10 yet.  Make sample assembly bigger so compression has an effect.

* Bump minor version.

Authored-by: Jonathan Prescott-Roy <[email protected]> and Martin Schrimpf <[email protected]>
mschrimpf added a commit to brain-score/brainio that referenced this issue Oct 11, 2022
it still has the bug introduced in 2022.6.0 pydata/xarray#6836
mschrimpf added a commit to brain-score/brainio that referenced this issue Oct 11, 2022
it still has the bug introduced in 2022.6.0 pydata/xarray#6836
@benbovy
Copy link
Member

benbovy commented Nov 14, 2022

From #7282 it looks like we need to convert the multi-index level to a single index when casting the group to an index. And from #7105 we can fix that in safe_cast_to_index() (sometimes the full multi-index is expected) so we probably need a special case in groupby.

@benbovy benbovy reopened this Nov 14, 2022
@benbovy
Copy link
Member

benbovy commented Nov 14, 2022

we can fix that in safe_cast_to_index()

...we cannot fix that in safe_cast_to_index() (or we can add a parameter to specify the desired result).

@mschrimpf
Copy link

Is there hope for groupby working on multi-indexed DataArrays again in the future? We -- and from the issue history it looks like others too -- are currently pinning xarray<2022.6 even though we would love to use newer versions.

@dcherian
Copy link
Contributor

dcherian commented Apr 5, 2023

I think we could special-case extracting a multiindex level here:

group = obj[group]

group at that stage should have values

['a', 'a', 'b', 'b', 'c', 'c']

@mschrimpf Can you try that and send in a PR?

@benbovy
Copy link
Member

benbovy commented Apr 12, 2023

A special-case sounds reasonable to me as well as a temporary fix before looking into if/how we can refactor groupby so that it works with multiple kinds of built-in and/or custom indexes.

@emmaai
Copy link
Author

emmaai commented Apr 13, 2023

I solved it temporarily by reset_index to groupby and set_xindex after, if anyone is looking.

dcherian added a commit to dcherian/xarray that referenced this issue May 8, 2023
dcherian added a commit that referenced this issue Jun 6, 2023
* Fix .groupby(multi index level)

Closes #6836

* Update xarray/tests/test_groupby.py

* mypy: Add _DummyGroup.to_index

---------

Co-authored-by: Illviljan <[email protected]>
dstansby pushed a commit to dstansby/xarray that referenced this issue Jun 28, 2023
* Fix .groupby(multi index level)

Closes pydata#6836

* Update xarray/tests/test_groupby.py

* mypy: Add _DummyGroup.to_index

---------

Co-authored-by: Illviljan <[email protected]>
@carynbear
Copy link

This is happening again :(

@max-sixty
Copy link
Collaborator

@carynbear please could you post a reproducible example? A new bug report would be ideal...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants