groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

emmaai · 2022-07-27T04:06:59Z

What happened?

run the code block below with 2022.6.0

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

output:

In [15]: mda.groupby("one").groups
Out[15]: 
{('a', 0): [0],
 ('a', 1): [1],
 ('b', 0): [2],
 ('b', 1): [3],
 ('c', 0): [4],
 ('c', 1): [5]}

What did you expect to happen?

as it was with 2022.3.0

In [6]: mda.groupby("one").groups
Out[6]: {'a': [0, 1], 'b': [2, 3], 'c': [4, 5]}

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import xarray as XR

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

N/A

Anything else we need to know?

N/A

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.11.0-1025-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.22.4
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2022.04.1
distributed: 2022.4.1
matplotlib: 3.5.1
cartopy: 0.20.3
seaborn: 0.11.2
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 45.2.0
pip: 22.2
conda: None
pytest: 7.1.2
IPython: 7.31.0
sphinx: None

The text was updated successfully, but these errors were encountered:

dcherian · 2022-08-11T16:28:10Z

@benbovy I tracked this down to

>>> mda.one.to_index()
# v2022.06.0
MultiIndex([('a', 0),
            ('a', 1),
            ('b', 0),
            ('b', 1),
            ('c', 0),
            ('c', 1)],
           names=['one', 'two'])

# v2022.03.0
Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object', name='x')

We call to_index here in safe_cast_to_index:

xarray/xarray/core/utils.py

Lines 115 to 140 in f8fee90

    
           def safe_cast_to_index(array: Any) -> pd.Index: 
        
               """Given an array, safely cast it to a pandas.Index. 
        
               If it is already a pandas.Index, return it unchanged. 
        
               Unlike pandas.Index, if the array has dtype=object or dtype=timedelta64, 
        
               this function will not attempt to do automatic type conversion but will 
        
               always return an index with dtype=object. 
        
               """ 
        
               if isinstance(array, pd.Index): 
        
                   index = array 
        
               elif hasattr(array, "to_index"): 
        
                   # xarray Variable 
        
                   index = array.to_index() 
        
               elif hasattr(array, "to_pandas_index"): 
        
                   # xarray Index 
        
                   index = array.to_pandas_index() 
        
               elif hasattr(array, "array") and isinstance(array.array, pd.Index): 
        
                   # xarray PandasIndexingAdapter 
        
                   index = array.array 
        
               else: 
        
                   kwargs = {} 
        
                   if hasattr(array, "dtype") and array.dtype.kind == "O": 
        
                       kwargs["dtype"] = object 
        
                   index = pd.Index(np.asarray(array), **kwargs) 
        
               return _maybe_cast_to_cftimeindex(index)

Not sure if the fix should be only in the GroupBy specifically or more generally in safe_cast_to_index

The GroupBy context is

xarray/xarray/core/groupby.py

Line 434 in f8fee90

group_as_index = safe_cast_to_index(group)

FabianHofmann · 2022-09-22T12:03:52Z

After trying to dig down further into the code, I saw that grouping over levels seems to be broken generally (up-to-date main branch at time of writing), i.e.

import pandas as pd
import numpy as np
import xarray as xr

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))
mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])
mda.groupby("one").sum()

raises:

  File ".../xarray/xarray/core/_reductions.py", line 5055, in sum
    return self.reduce(

  File ".../xarray/xarray/core/groupby.py", line 1191, in reduce
    return self.map(reduce_array, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1095, in map
    return self._combine(applied, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1127, in _combine
    index, index_vars = create_default_index_implicit(coord)

  File ".../xarray/xarray/core/indexes.py", line 974, in create_default_index_implicit
    index = PandasMultiIndex(array, name)

  File ".../xarray/xarray/core/indexes.py", line 552, in __init__
    raise ValueError(

ValueError: conflicting multi-index level name 'one' with dimension 'one'

in the function create_default_index_implicit. I am still a bit puzzled how to approach this. Any idea @benbovy?

benbovy · 2022-09-22T14:39:24Z

Thanks @emmaai for the issue report and thanks @dcherian and @FabianHofmann for tracking it down.

There is a lot of complexity related to pandas.MultiIndex special cases and it's been difficult to avoid new issues arising during the index refactor.

create_default_index_implicit has some hacks to create xarray objects directly from pandas.MultiIndex instances (e.g., xr.Dataset(coords={"x": pd_midx})) or even from xarray objects wrapping multi-indexes. The error raised here suggests that the issue should fixed before this call... Probably in safe_cast_to_index indeed.

We should probably avoid using .to_index() internally, or should we even deprecate it? The fact that mda.one.to_index() (in v2022.3.0) doesn't return the same result than mda.indexes["one"] adds more confusion than it adds value. Actually, in the long-term I'd be for deprecating all pandas.MultiIndex special cases in Xarray.

* unpin xarray, numpy, pandas, netcdf4 * Fix deprecation "TypeError: Using a DataArray object to construct a variable is ambiguous, please extract the data using the .data property." See pydata/xarray#6508. There are other errors, however. * See Unidata/netcdf4-python#1175 "Regression in createVariable between 1.5.8 and 1.6.0". * Testing on Python 3.7 only covers through xarray 0.20.2. This is an experiment. * Fix get_metadata to account for the internal restructuring of indexes in xarray 2022.06. Update tests for same. * Fixed bug where we were stripping off attrs inadvertently when writing netCDF. * TestPlainGroupby.test_on_data_array fails with Python = 3.8.13, numpy = 1.23.2, xarray = 2022.06.0. That means it has nothing to do with BrainIO. * test_on_data_array should not involve any BrainIO classes. This is to test for bugs in xarray. With xarray==2022.06.0, this test fails. * xarray 2022.06.0 has a bug which breaks BrainIO: pydata/xarray#6836. * Adapt get_metadata to the change in the index API between 2022.03.0 and 2022.06.0. Now test_get_metadata passes under 2022.03.0 and 2022.06.0. * Getting an error from tests on Travis (but not locally): RuntimeError: NetCDF: Filter error: bad id or parameters or duplicate filter. This might fix it? * Compression test failed: assert 614732 > 615186. This might fix it. * Travis doesn't offer python 3.10 yet. Make sample assembly bigger so compression has an effect. * Bump minor version. Authored-by: Jonathan Prescott-Roy <[email protected]> and Martin Schrimpf <[email protected]>

it still has the bug introduced in 2022.6.0 pydata/xarray#6836

benbovy · 2022-11-14T13:49:47Z

From #7282 it looks like we need to convert the multi-index level to a single index when casting the group to an index. And from #7105 we can fix that in safe_cast_to_index() (sometimes the full multi-index is expected) so we probably need a special case in groupby.

benbovy · 2022-11-14T13:55:02Z

we can fix that in safe_cast_to_index()

...we cannot fix that in safe_cast_to_index() (or we can add a parameter to specify the desired result).

mschrimpf · 2023-04-05T18:23:51Z

Is there hope for groupby working on multi-indexed DataArrays again in the future? We -- and from the issue history it looks like others too -- are currently pinning xarray<2022.6 even though we would love to use newer versions.

dcherian · 2023-04-05T20:35:20Z

I think we could special-case extracting a multiindex level here:

xarray/xarray/core/groupby.py

Line 469 in d4db166

group = obj[group]

group at that stage should have values

['a', 'a', 'b', 'b', 'c', 'c']

@mschrimpf Can you try that and send in a PR?

benbovy · 2023-04-12T09:42:39Z

A special-case sounds reasonable to me as well as a temporary fix before looking into if/how we can refactor groupby so that it works with multiple kinds of built-in and/or custom indexes.

emmaai · 2023-04-13T06:53:56Z

I solved it temporarily by reset_index to groupby and set_xindex after, if anyone is looking.

Closes pydata#6836

* Fix .groupby(multi index level) Closes #6836 * Update xarray/tests/test_groupby.py * mypy: Add _DummyGroup.to_index --------- Co-authored-by: Illviljan <[email protected]>

* Fix .groupby(multi index level) Closes pydata#6836 * Update xarray/tests/test_groupby.py * mypy: Add _DummyGroup.to_index --------- Co-authored-by: Illviljan <[email protected]>

carynbear · 2024-08-20T22:56:04Z

This is happening again :(

max-sixty · 2024-08-20T22:58:38Z

@carynbear please could you post a reproducible example? A new bug report would be ideal...

emmaai added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022

emmaai mentioned this issue Jul 27, 2022

Fix issues in plugins and version update opendatacube/odc-stats#60

Merged

nunupeke mentioned this issue Jul 27, 2022

sel by slice not working for multi-index containing float-values #6838

Open

4 tasks

dcherian changed the title ~~groupby not working correctly on a multi-indexed DataArray or DataSet~~ groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet Jul 27, 2022

dcherian added topic-groupby regression and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022

dcherian mentioned this issue Jul 27, 2022

test against xarray HEAD opendatacube/odc-stats#61

Open

benbovy mentioned this issue Aug 10, 2022

Add CRS/projection information to xarray objects #2288

Open

benbovy self-assigned this Aug 31, 2022

FabianHofmann mentioned this issue Sep 6, 2022

Pypsa-native constraints as class objects? PyPSA/PyPSA#437

Closed

jjpr-mit added a commit to brain-score/brainio that referenced this issue Sep 26, 2022

xarray 2022.06.0 has a bug which breaks BrainIO: pydata/xarray#6836.

66affb8

benbovy mentioned this issue Sep 27, 2022

Explicit indexes: next steps #6293

Open

49 tasks

benbovy mentioned this issue Sep 29, 2022

Fix to_index(): return multiindex level as single index #7105

Merged

3 tasks

mschrimpf added a commit to brain-score/brainio that referenced this issue Oct 11, 2022

disallow xarray 2022.9.0

349f46c

it still has the bug introduced in 2022.6.0 pydata/xarray#6836

mschrimpf mentioned this issue Oct 11, 2022

disallow xarray 2022.9.0 brain-score/brainio#29

Merged

mschrimpf added a commit to brain-score/brainio that referenced this issue Oct 11, 2022

disallow xarray 2022.9.0 (#29)

7dc287a

it still has the bug introduced in 2022.6.0 pydata/xarray#6836

dcherian closed this as completed in #7105 Oct 12, 2022

This was referenced Nov 10, 2022

disallow xarray 2022.11 brain-score/brainio#34

Merged

groupby and mean on a MultiIndex level raises ValueError #7282

Closed

benbovy reopened this Nov 14, 2022

mschrimpf mentioned this issue Dec 5, 2022

pin xarray at <2022.6 brain-score/brainio#36

Merged

FabianHofmann mentioned this issue Dec 13, 2022

constraints.py: circumvent xarray groupby with multiindex PyPSA/PyPSA#522

Merged

5 tasks

benbovy mentioned this issue Dec 14, 2022

groupby+map performance regression on MultiIndex dataset #7376

Closed

4 tasks

Ariana-B mentioned this issue Dec 16, 2022

Incompatibilities with xarray > 2022.03 opendatacube/datacube-core#1373

Closed

SpacemanPaul mentioned this issue Dec 20, 2022

pin xarray version < 2022.6 opendatacube/datacube-core#1374

Merged

3 tasks

emmaai mentioned this issue Feb 2, 2023

fix xarray groupby opendatacube/odc-tools#541

Merged

dcherian added a commit to dcherian/xarray that referenced this issue May 8, 2023

Fix .groupby(multi index level)

89880b5

Closes pydata#6836

dcherian mentioned this issue May 8, 2023

Fix .groupby(multi index level) #7830

Merged

3 tasks

dcherian closed this as completed in #7830 Jun 6, 2023

dcherian added a commit that referenced this issue Jun 6, 2023

Fix .groupby(multi index level) (#7830)

c668582

* Fix .groupby(multi index level) Closes #6836 * Update xarray/tests/test_groupby.py * mypy: Add _DummyGroup.to_index --------- Co-authored-by: Illviljan <[email protected]>

carynbear mentioned this issue Aug 20, 2024

groupby on multi-index level is grouping by the multi-index (reappearance of #6836) #9388

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

emmaai commented Jul 27, 2022

INSTALLED VERSIONS

dcherian commented Aug 11, 2022 •

edited

Loading

FabianHofmann commented Sep 22, 2022

benbovy commented Sep 22, 2022

benbovy commented Nov 14, 2022

benbovy commented Nov 14, 2022

mschrimpf commented Apr 5, 2023

dcherian commented Apr 5, 2023

benbovy commented Apr 12, 2023

emmaai commented Apr 13, 2023

carynbear commented Aug 20, 2024

max-sixty commented Aug 20, 2024

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

Comments

emmaai commented Jul 27, 2022

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

dcherian commented Aug 11, 2022 • edited Loading

FabianHofmann commented Sep 22, 2022

benbovy commented Sep 22, 2022

benbovy commented Nov 14, 2022

benbovy commented Nov 14, 2022

mschrimpf commented Apr 5, 2023

dcherian commented Apr 5, 2023

benbovy commented Apr 12, 2023

emmaai commented Apr 13, 2023

carynbear commented Aug 20, 2024

max-sixty commented Aug 20, 2024

dcherian commented Aug 11, 2022 •

edited

Loading