Skip to content

KeyError for crosstab on Series with same name. #6319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
theandygross opened this issue Feb 10, 2014 · 17 comments
Closed

KeyError for crosstab on Series with same name. #6319

theandygross opened this issue Feb 10, 2014 · 17 comments
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@theandygross
Copy link
Contributor

Doing a crosstab on two Series with the same name throws an error. This is due to a dictionary (indexed by the series name) in the crosstab function being used to store the data. Not sure if this is a feature or a bug, but a default similar to the behavior when Series without name are compared would be desirable to me.

In [56]:

s1 = pd.Series([1,1,2,2,3,3], name='s')
s2 = pd.Series([1,1,1,2,2,2], name='s')

pd.crosstab(s1, s2)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-56-9d16b2abac9f> in <module>()
      2 s2 = pd.Series([1,1,1,2,2,2], name='s')
      3 
----> 4 pd.crosstab(s1, s2)

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/tools/pivot.pyc in crosstab(rows, cols, values, rownames, colnames, aggfunc, margins, dropna)
    368         df['__dummy__'] = 0
    369         table = df.pivot_table('__dummy__', rows=rownames, cols=colnames,
--> 370                                aggfunc=len, margins=margins, dropna=dropna)
    371         return table.fillna(0).astype(np.int64)
    372     else:

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/tools/pivot.pyc in pivot_table(data, values, rows, cols, aggfunc, fill_value, margins, dropna)
    108         to_unstack = [agged.index.names[i]
    109                       for i in range(len(rows), len(keys))]
--> 110         table = agged.unstack(to_unstack)
    111 
    112     if not dropna:

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in unstack(self, level)
   3339         """
   3340         from pandas.core.reshape import unstack
-> 3341         return unstack(self, level)
   3342 
   3343     #----------------------------------------------------------------------

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/core/reshape.pyc in unstack(obj, level)
    416 def unstack(obj, level):
    417     if isinstance(level, (tuple, list)):
--> 418         return _unstack_multiple(obj, level)
    419 
    420     if isinstance(obj, DataFrame):

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/core/reshape.pyc in _unstack_multiple(data, clocs)
    275     index = data.index
    276 
--> 277     clocs = [index._get_level_number(i) for i in clocs]
    278 
    279     rlocs = [i for i in range(index.nlevels) if i not in clocs]

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.13.0_247_g82bcbb8-py2.7-linux-x86_64.egg/pandas/core/index.pyc in _get_level_number(self, level)
   2197         except ValueError:
   2198             if not isinstance(level, int):
-> 2199                 raise KeyError('Level %s not found' % str(level))
   2200             elif level < 0:
   2201                 level += self.nlevels

KeyError: 'Level s not found'
@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

I am not sure this is a bug, what would you expect this to do?

@theandygross
Copy link
Contributor Author

I can't remember if the previous versions added to the column labels or just dropped them. It would look something like either:

s_col  1  2
s_row      
1      2  0
2      1  1
3      0  2

or

col_0  1  2
row_0      
1      2  0
2      1  1
3      0  2

@johnhess
Copy link

I'm having the exact same issue manifested in a slightly different way. In my case, I'm using DataFrame.pivot_table and specifying the same column as rows and as cols.

With a simple dataframe like this

>>> df=pd.DataFrame([[1,2],[3,4],[5,6]], columns=["a","b"])
>>> df
   a  b
0  1  2
1  3  4
2  5  6

When we pivot a x b we get

>>> df.pivot_table(rows="a", cols="b", aggfunc='count')["a"]
b   2   4   6
a            
1   1 NaN NaN
3 NaN   1 NaN
5 NaN NaN   1

[3 rows x 3 columns]

which makes perfect sense. Similarly, I would expect that if I pivoted a x a that I would get

>>> df.pivot_table(rows="a", cols="a", aggfunc='count')["a"]
a   1   3   5
a            
1   1 NaN NaN
3 NaN   1 NaN
5 NaN NaN   1

[3 rows x 3 columns]

@jreback jreback added this to the 0.15.0 milestone Mar 25, 2014
@TomAugspurger
Copy link
Contributor

@johnhess for your problem you can workaround with something like

In [18]: pd.DataFrame(np.diag(df.a), index=df.index, columns=df.index)
Out[18]: 
   0  1  2
0  1  0  0
1  0  3  0
2  0  0  5

[3 rows x 3 columns]

It looks like these are cause by the same issue: how unstack handles indices with duplicate names:

ipdb> agged
     __dummy__
s s           
1 1          3
2 3          3

I don't think that agged.unstack('s') is well defined here (which 's' gets put in the columns?). So that correctly raises an error (the error message could be clearer though).
But in the examples for this issue we already know that the s that came from the second argument to crosstab goes in the column (and for pivot_table the cols argument goes to the column).

@jreback any objection to having checks in crosstab and pivot_table to check for these special cases? I can do it this weekend.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2014

@TomAugspurger that sounds reasonable. maybe a ValueError or something with an explanation. In theory could have a suffix argument but this seems a special case.

@johnhess
Copy link

@TomAugspurger Thanks for taking the time to look into it! I have another workaround in at the moment, so I'm safe, but I worry that others will expect pandas to crosstab any two valid series and end up with surprise errors. In my case, users of my application have an interface to crosstab any two columns and I hadn't realized I needed a special case when they have the same name.

@hayd
Copy link
Contributor

hayd commented Mar 28, 2014

👍 to a better error message, imo better to raise than infer here.

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Mar 28, 2014
@TomAugspurger
Copy link
Contributor

@hayd I addressed the error msg in #6738

I agree that df.unstack(level='level') and df.stack(level='level') should both raise when its ambiguous what level is because of duplicates names in the index.

However, there in the cases of pivot_table(rows='level', columns='level') and crosstab(s1, s1), there isn't any ambiguity. The first arg/rows arg goes in the rows and the second/cols arg goes in the columns. The current issue of raising is just the implementation not handling that case. Are you ok with inferring there?

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

is might make sense in pivot to collapse the index (just droplevel(1))
when row and column refer to a single name

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@TomAugspurger for 0.14? or since you fixed the error can bump?

@TomAugspurger
Copy link
Contributor

I'm not seeing a quick fix here. The current pivot_table implementation depends on not having any repeats in index or columns. So I guess bump for now unless I come up with a clever fix.

@TomAugspurger TomAugspurger modified the milestones: 0.15.0, 0.14.0 Apr 21, 2014
@hayd
Copy link
Contributor

hayd commented Apr 22, 2014

The current issue of raising is just the implementation not handling that case

can you raise a NotImplementedError for this part?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

this appears fixed. if someone could locate the reference we can close.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Dec 29, 2017
@mroeschke
Copy link
Member

Closed by #16028

@jreback jreback closed this as completed Jan 1, 2018
@kasuteru
Copy link
Contributor

kasuteru commented Jul 6, 2018

This is not fixed for me, using pandas.version 0.23.1:

print(pd.__version__)
df = pd.DataFrame(data={"a":[1,2,3,4]})
a1 = df["a"]
a2 = df["a"]
pd.crosstab(a1,a2)

still raises an error for me.

Edit: Submitted as issue #21765

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 6, 2018 via email

@kasuteru
Copy link
Contributor

kasuteru commented Jul 6, 2018

I submitted it as #21765. Turns out that the example provided here also fails, so I used that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants