"cannot reindex from a duplicate axis" when groupby().apply() on MultiIndex columns #16646

bdforbes · 2017-06-09T05:58:27Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.ones([6, 4], dtype=int),
    columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)

(
    df
    .groupby(level=0, axis=1)
    .apply(
        lambda df: pd.concat(
            [df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')], 
            axis=1
        )
    )
)

Problem description

The code above produces the error:

cannot reindex from a duplicate axis

I believe this is a bug because, as described below, essentially the same operations can be successfully run along the other axis.

Expected Output

It should, as far as I can tell, produce the following output:

The desired output can be obtained when working on the transposed DataFrame along the index rather than the columns:

(
    df
    .T
    .groupby(level=0)
    .apply(
        lambda df: pd.concat(
            [df.xs(df.name), df.sum().to_frame('Total').T]
        )
    )
    .T
)

Output of `pd.show_versions()`

pd.show_versions()
pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 34.3.3
Cython: None
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

bdforbes · 2017-06-09T07:01:43Z

I've just found a simpler MWE. This fails:

(
    df
    .groupby(level=0, axis=1)
    .apply(
        lambda df: 2*df.xs(df.name, axis=1)
    )
)

This succeeds:

(
    df.T
    .groupby(level=0)
    .apply(
        lambda df: 2*df.xs(df.name)
    )
)

jreback · 2017-06-09T10:33:01Z

this is very very odd to odd. what are you actually trying to do?

jreback · 2017-06-09T10:35:41Z

for example your 'failing' ones:

Out[11]: 
A    [[1, 1, 2], [1, 1, 2], [1, 1, 2], [1, 1, 2], [...
B    [[1, 1, 2], [1, 1, 2], [1, 1, 2], [1, 1, 2], [...
dtype: object

In [10]: (
    ...:     df
    ...:     .groupby(level=0, axis=1)
    ...:     .apply(
    ...:         lambda df: 2*df.xs(df.name, axis=1).values
    ...:     )
    ...: )
Out[10]: 
A    [[2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]]
B    [[2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]]
dtype: object

Makes no sense to do this. You can certainly iterate over the groups yourself and concat (though you will find exactly the same issue if you don't drop the indexes). By definition pandas wants to align return objects.

bdforbes · 2017-06-09T11:39:24Z

I'm actually trying to adapt the code at https://stackoverflow.com/a/38964596/336001, which recursively adds subtotal rows into a DataFrame, to work along columns instead. Here is the original code, which adds subtotal rows:

def append_tot(df):
    if hasattr(df, 'name') and df.name is not None:
        xs = df.xs(df.name)
    else:
        xs = df
    gb = xs.groupby(level=0)
    n = xs.index.nlevels
    name = tuple('Total' if i == 0 else '' for i in range(n))
    tot = gb.sum().sum().rename(name).to_frame().T
    if n > 1:
        sm = gb.apply(append_tot)
    else:
        sm = gb.sum()
    return pd.concat([sm, tot])

The adapted version to add subtotal columns looks like this:

def append_tot(df):
    if hasattr(df, 'name') and df.name is not None:
        xs = df.xs(df.name, axis=1)
    else:
        xs = df
    gb = xs.groupby(level=0, axis=1)
    n = xs.columns.nlevels
    name = tuple('Total' if i == 0 else '' for i in range(n))
    tot = gb.sum().sum(axis=1).rename(name).to_frame()
    if n > 1:
        sm = gb.apply(append_tot)
    else:
        sm = gb.sum()
    return pd.concat([sm, tot], axis=1)

This code will produce the "cannot reindex from a duplicate index" error, whereas the original code provided in the Stack Overflow thread works fine.

The minimal working example I've found was just the simplest way to reproduce the error.

kernc · 2017-09-01T10:04:08Z

I'm facing the same issue when trying to interpolate each series within a group:

>>> df = pd.DataFrame({
...     'node': [59, 59, 59, 314, 314, 314, 59],
...     'ping': [116, np.nan, 106, 87, 80, np.nan, 118],
...     'mode': ['2G', np.nan, '4G', '3G', np.nan, '3G', '2G']},
...     columns=['node', 'ping', 'mode'],
...     index=pd.to_datetime(['2017-07-13 00:30:00',
...                           '2017-07-13 00:30:00',
...                           '2017-07-13 00:30:00',
...                           '2017-07-13 00:30:00',
...                           '2017-07-13 01:00:00',
...                           '2017-07-13 01:00:00',
...                           '2017-07-13 01:00:00']))
>>> df
                     node   ping mode
2017-07-13 00:30:00    59  116.0   2G
2017-07-13 00:30:00    59    NaN  NaN
2017-07-13 00:30:00    59  106.0   4G
2017-07-13 00:30:00   314   87.0   3G
2017-07-13 01:00:00   314   80.0  NaN
2017-07-13 01:00:00   314    NaN   3G
2017-07-13 01:00:00    59  118.0   2G

>>> def interpolator(series):  # ffill for categoricals, linear otherwise
...     if series.dtype == object:
...         return series.ffill()
...     return series.interpolate()

>>> df.groupby('node').apply(lambda subdf: subdf.apply(interpolator))
---------------------------------------------------------------------------
ValueError: cannot reindex from a duplicate axis

Anything obvious?

Edit: It works, however, if I df.reset_index(inplace=True) before grouping.

bdforbes · 2017-09-01T22:44:46Z

I never resolved my earlier problem, I managed to get around it by performing my operations along the other axis and transposing. It doesn't appear that your issue fits that pattern however.

dayuhuang · 2017-10-05T21:46:29Z

I have the same issue here. This seems to have something to do with index of the data frame. Please see the following example to reproduce the error:

data = pd.DataFrame({'x':[1,2,3],'y':[0,1,0]}, index=[0,0,1])
data.groupby(by='y').apply(lambda u: u)

Error: ValueError: cannot reindex from a duplicate axis

However, the following code which only differs by one element in the index will execute without producing the error:

data = pd.DataFrame({'x':[1,2,3],'y':[0,1,0]}, index=[0,1,1])
data.groupby(by='y').apply(lambda u: u)

Note that the only change is the index of the second row.

python 3.5
pandas 0.20.3

jreback · 2017-10-06T10:30:42Z

I think you can just set the index in this case (this is internally where the error happens), rather than re-index, as its a transformation. PR's welcome.

TomAugspurger · 2018-01-31T15:21:00Z

Duplicate of #19437 (that's for duplicate index, this is for duplicate columns with axis=1). Should be the same fix.

jtlz2 · 2018-05-16T08:01:43Z

@kernc Thank you thank you for the amazing workaround!

i.e.

It works, however, if I df.reset_index(inplace=True) before grouping.

bnaul · 2020-03-13T19:03:12Z

@TomAugspurger I'm not sure this was actually fixed along with #19437, I'm seeing the same issue on 1.0.1:

[nav] In [62]: data = pd.DataFrame({'x':[1,2,3],'y':[0,1,0]}, index=[0,0,1])
          ...: data.groupby(by='y').apply(lambda u: u)
---------------------------------------------------------------------------
...
~/model/.venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

If I understand the above comments correctly it seems that this should still be considered a bug (as opposed to unsupported behavior), right?

TomAugspurger · 2020-03-20T19:56:53Z

Yeah, seems to not be a duplicate.

smithto1 · 2020-07-28T20:45:45Z

take

bdforbes · 2020-08-08T07:52:58Z

Thanks, the fix in #35441 appears to work.

To test it, I installed pandas 1.1.0 in a virtual environment and manually patched the lines in core/groupby/ops.py following #35441. Is there an easier way to test a PR that has just gone into master? Do I have to clone the repo and build it from scratch? I'm on Windows and I haven't tried this before.

smithto1 · 2020-08-08T20:49:36Z

@bdforbes If you follow the instructions on website, it's not too hard.

https://pandas.pydata.org/docs/development/contributing.html#creating-a-python-environment

There is an hour or two to set up a C-compiler, git etc, but after that you can clone the repo and work current master/PR branches.

bdforbes · 2020-08-08T22:34:40Z

Thanks @smithto1, the instructions look pretty clear.

pwwang · 2021-07-15T21:33:11Z

Another interesting case:

df = DataFrame(dict(a=[1,1,2,2], b=[1,2,3,4])).groupby('a')
def _applied(subdf):
    subdf.reset_index(drop=True, inplace=True)
    return DataFrame(dict(c=subdf.b+1))
df.apply(_applied)  # ValueError: cannot reindex from a duplicate axis

def _applied2(subdf):
    subdf = subdf.reset_index(drop=True)
    return DataFrame(dict(c=subdf.b+1))
df.apply(_applied) # works

pandas: v1.3.0

rhshadrach · 2021-07-15T22:01:17Z

@pwwang - mutating a DataFrame in apply is not supported, see here:

https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods

jreback added Apply Apply, Aggregate, Transform, Map Bug Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Oct 6, 2017

jreback added this to the Next Major Release milestone Oct 6, 2017

TomAugspurger closed this as completed Jan 31, 2018

TomAugspurger modified the milestones: Next Major Release, No action Jan 31, 2018

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jan 31, 2018

TomAugspurger reopened this Mar 20, 2020

mroeschke removed the Duplicate Report Duplicate issue or pull request label Apr 3, 2020

github-actions bot assigned smithto1 Jul 28, 2020

smithto1 mentioned this issue Jul 28, 2020

BUG: GroupBy.apply() throws erroneous ValueError with duplicate axes #35441

Merged

5 tasks

jreback removed this from the No action milestone Aug 3, 2020

jreback added this to the 1.2 milestone Aug 3, 2020

jreback closed this as completed in #35441 Aug 6, 2020

Uh oh!

"cannot reindex from a duplicate axis" when groupby().apply() on MultiIndex columns #16646

"cannot reindex from a duplicate axis" when groupby().apply() on MultiIndex columns #16646

Comments

bdforbes commented Jun 9, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

bdforbes commented Jun 9, 2017

Uh oh!

jreback commented Jun 9, 2017

Uh oh!

jreback commented Jun 9, 2017

Uh oh!

bdforbes commented Jun 9, 2017

Uh oh!

kernc commented Sep 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdforbes commented Sep 1, 2017

Uh oh!

dayuhuang commented Oct 5, 2017

Uh oh!

jreback commented Oct 6, 2017

Uh oh!

TomAugspurger commented Jan 31, 2018

Uh oh!

jtlz2 commented May 16, 2018

Uh oh!

bnaul commented Mar 13, 2020

Uh oh!

TomAugspurger commented Mar 20, 2020

Uh oh!

smithto1 commented Jul 28, 2020

Uh oh!

bdforbes commented Aug 8, 2020

Uh oh!

smithto1 commented Aug 8, 2020

Uh oh!

bdforbes commented Aug 8, 2020

Uh oh!

pwwang commented Jul 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Jul 15, 2021

Uh oh!

Output of `pd.show_versions()`

kernc commented Sep 1, 2017 •

edited

Loading

pwwang commented Jul 15, 2021 •

edited

Loading