pd.crosstab, categorical data and missing instances #16367

peisenha · 2017-05-16T12:43:37Z

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 2.8.7
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.12.1
scipy: 0.17.0
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-05-16T12:52:06Z

The docstring even has an example further down

>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                        # but they still will be counted in the output
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

which is not what I get, so defiantly a bug somewhere. I suspect that #15511 may be related, since

In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there.

peisenha · 2017-05-16T13:00:33Z

Thanks a lot for the swift response. I will give it a try.

…

On Tue, May 16, 2017 at 2:52 PM, Tom Augspurger ***@***.***> wrote: The docstring even has an example further down >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data, # but they still will be counted in the output col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0 which is not what I get, so defiantly a bug somewhere. I suspect that #15511 <#15511> may be related, since In [14]: crosstab(foo, bar, dropna=False) Out[14]: col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0 does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#16367 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADcMJ6Was7MAnYZ3x3ITFPtYjAvz_GVGks5r6ZwOgaJpZM4Ncb6_> .

-- *Philipp Eisenhauer* Economist Mail [email protected] Web www.eisenhauer.io Repository https://github.com/peisenha

peisenha · 2017-05-16T14:00:42Z

#15193 and #15511 are two related issues. Looking at the source code and the discussion, it seems to me that dropping empty columns is the desired behavior for dropna=True (default).

This is the relevant code in pivot_table()

    # GH 15193 Makse sure empty columns are removed if dropna=True
    if isinstance(table, DataFrame) and dropna:
        table = table.dropna(how='all', axis=1)

If you agree, just let me know and I will be glad to adjust the documentation accordingly.

TomAugspurger · 2017-05-16T14:14:14Z

It seems like the resolution from #12298 was that all the categories should be present in the output. #15511 seems to go against that... So I think this is a regression and not just a doc issue.

I think the issue is that the meaning of dropna becomes ambiguous when you have a Categorical. With a categorical that isn't observed, you're going to introduce NaNs by definition. I hate to make the dropna argument to pivot_table even more confusing, but I think its behavior should depend on the type of the index or columns. This may be tricky to do correctly, as columns in pivot_table can be a list :/

peisenha · 2017-05-16T14:33:36Z

Alright, if you would like me to take a crack at it, let me know. I will be glad to provide a fix and a regression test ... As this would be my first contribution to the library, I will probably need some guidance in the process.

TomAugspurger · 2017-05-16T14:40:57Z

Yeah, it'd be great if you can take a shot.

But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating dropna differently for Categoricals is a good idea.

jorisvandenbossche · 2017-05-16T15:28:13Z

Focusing on crosstab for a moment, what is an example when you get a column with NaNs? (trying to understand the usecase of the dropna method)

TomAugspurger · 2017-05-16T15:35:04Z

In this case, crosstab is a pivot_table followed up by a fillna(0):

        table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
                               aggfunc=len, margins=margins, dropna=dropna)
        table = table.fillna(0).astype(np.int64)

so the change to pivot_table's NaN handling affected this downstream.

jorisvandenbossche · 2017-05-16T15:38:55Z

So the dropna in crosstab is actually also a bit confusing explantion, as you are not dropping all NaN columns, but all 0 columns :-)
But apart from that, I am still wondering when you get such a column. Because the columns are created from the values, so how is it possible (apart from using categoricals) that a value is not present?

TomAugspurger · 2017-05-16T15:51:55Z

so how is it possible (apart from using categoricals) that a value is not present?

I guess it can happen if you have multiple levels, some of which aren't observed:

In [21]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 1, 3], 'C': ['a', 'a', 'a', 'b']})

In [22]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[22]:
B  1     2     3
C  a  b  a  b  a  b
A
1  1  0  1  0  0  0
2  1  0  0  0  0  1

jreback · 2017-06-01T11:15:17Z

from the example above.

In [7]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[7]: 
B  1     2     3   
C  a  b  a  b  a  b
A                  
1  1  0  1  0  0  0
2  1  0  0  0  0  1

In [8]: pd.crosstab(df.A, [df.B, df.C], dropna=True)
Out[8]: 
B  1  2  3
C  a  a  b
A         
1  1  1  0
2  1  0  1

I think the issue is that the meaning of dropna is just confusing. maybe have strings instead?
are there 3 cases here that are useful?

jorisvandenbossche · 2018-09-01T08:48:03Z

Could a solution to this problem be to change the default of dropna to None instead of True?
So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

AbhayGoyal · 2021-01-24T00:10:38Z

Hey, what needs to be done for this, maybe I can give it a try?

Yoshishiro · 2021-02-10T12:03:59Z

Hello! I am interested to help in this issue.

MarcoGorelli · 2021-02-22T15:16:40Z

I just tried this on master and got

>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0      
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

which seems correct and in accordance with the description given alongside the example in the docs.

The only part which strikes me as not correct is that the docs still read

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

So, if that line in the docs is changed to

When using dropna=False, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

then can we close the issue? Changing the default type of dropna would be a breaking change, and I'm not sure it would be worth it

TomAugspurger added Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode Effort Medium labels May 16, 2017

TomAugspurger added this to the 0.20.2 milestone May 16, 2017

jreback modified the milestones: 0.21.0, 0.20.2 Jun 2, 2017

gfyoung mentioned this issue Jul 20, 2017

Groupby inconsistency with categorical values #17032

Closed

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

kokes mentioned this issue Aug 28, 2018

crosstab in 0.23.4 does not respect categorical variables #22453

Closed

jbrockmendel removed the Effort Medium label Oct 21, 2019

mroeschke added Bug Categorical Categorical Data Type labels May 11, 2020

mroeschke added Docs and removed Bug Regression Functionality that used to work in a prior pandas version labels Jun 12, 2021

simonjayhawkins mentioned this issue May 29, 2022

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.crosstab, categorical data and missing instances #16367

pd.crosstab, categorical data and missing instances #16367

peisenha commented May 16, 2017

TomAugspurger commented May 16, 2017

peisenha commented May 16, 2017 via email

peisenha commented May 16, 2017

TomAugspurger commented May 16, 2017

peisenha commented May 16, 2017

TomAugspurger commented May 16, 2017

jorisvandenbossche commented May 16, 2017 •

edited

Loading

TomAugspurger commented May 16, 2017

jorisvandenbossche commented May 16, 2017

TomAugspurger commented May 16, 2017

jreback commented Jun 1, 2017

jorisvandenbossche commented Sep 1, 2018

AbhayGoyal commented Jan 24, 2021

Yoshishiro commented Feb 10, 2021

MarcoGorelli commented Feb 22, 2021

pd.crosstab, categorical data and missing instances #16367

pd.crosstab, categorical data and missing instances #16367

Comments

peisenha commented May 16, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented May 16, 2017

peisenha commented May 16, 2017 via email

peisenha commented May 16, 2017

TomAugspurger commented May 16, 2017

peisenha commented May 16, 2017

TomAugspurger commented May 16, 2017

jorisvandenbossche commented May 16, 2017 • edited Loading

TomAugspurger commented May 16, 2017

jorisvandenbossche commented May 16, 2017

TomAugspurger commented May 16, 2017

jreback commented Jun 1, 2017

jorisvandenbossche commented Sep 1, 2018

AbhayGoyal commented Jan 24, 2021

Yoshishiro commented Feb 10, 2021

MarcoGorelli commented Feb 22, 2021

Output of `pd.show_versions()`

jorisvandenbossche commented May 16, 2017 •

edited

Loading