Skip to content

pd.crosstab, categorical data and missing instances #16367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
peisenha opened this issue May 16, 2017 · 15 comments
Open

pd.crosstab, categorical data and missing instances #16367

peisenha opened this issue May 16, 2017 · 15 comments
Labels
Categorical Categorical Data Type Docs good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@peisenha
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 2.8.7
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.12.1
scipy: 0.17.0
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

The docstring even has an example further down

>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                        # but they still will be counted in the output
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

which is not what I get, so defiantly a bug somewhere. I suspect that #15511 may be related, since

In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there.

@TomAugspurger TomAugspurger added Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode Effort Medium labels May 16, 2017
@TomAugspurger TomAugspurger added this to the 0.20.2 milestone May 16, 2017
@peisenha
Copy link
Author

peisenha commented May 16, 2017 via email

@peisenha
Copy link
Author

#15193 and #15511 are two related issues. Looking at the source code and the discussion, it seems to me that dropping empty columns is the desired behavior for dropna=True (default).

This is the relevant code in pivot_table()

    # GH 15193 Makse sure empty columns are removed if dropna=True
    if isinstance(table, DataFrame) and dropna:
        table = table.dropna(how='all', axis=1)

If you agree, just let me know and I will be glad to adjust the documentation accordingly.

@TomAugspurger
Copy link
Contributor

It seems like the resolution from #12298 was that all the categories should be present in the output. #15511 seems to go against that... So I think this is a regression and not just a doc issue.

I think the issue is that the meaning of dropna becomes ambiguous when you have a Categorical. With a categorical that isn't observed, you're going to introduce NaNs by definition. I hate to make the dropna argument to pivot_table even more confusing, but I think its behavior should depend on the type of the index or columns. This may be tricky to do correctly, as columns in pivot_table can be a list :/

@peisenha
Copy link
Author

Alright, if you would like me to take a crack at it, let me know. I will be glad to provide a fix and a regression test ... As this would be my first contribution to the library, I will probably need some guidance in the process.

@TomAugspurger
Copy link
Contributor

Yeah, it'd be great if you can take a shot.

But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating dropna differently for Categoricals is a good idea.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 16, 2017

Focusing on crosstab for a moment, what is an example when you get a column with NaNs? (trying to understand the usecase of the dropna method)

@TomAugspurger
Copy link
Contributor

In this case, crosstab is a pivot_table followed up by a fillna(0):

        table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
                               aggfunc=len, margins=margins, dropna=dropna)
        table = table.fillna(0).astype(np.int64)

so the change to pivot_table's NaN handling affected this downstream.

@jorisvandenbossche
Copy link
Member

So the dropna in crosstab is actually also a bit confusing explantion, as you are not dropping all NaN columns, but all 0 columns :-)
But apart from that, I am still wondering when you get such a column. Because the columns are created from the values, so how is it possible (apart from using categoricals) that a value is not present?

@TomAugspurger
Copy link
Contributor

so how is it possible (apart from using categoricals) that a value is not present?

I guess it can happen if you have multiple levels, some of which aren't observed:

In [21]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 1, 3], 'C': ['a', 'a', 'a', 'b']})

In [22]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[22]:
B  1     2     3
C  a  b  a  b  a  b
A
1  1  0  1  0  0  0
2  1  0  0  0  0  1

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

from the example above.

In [7]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[7]: 
B  1     2     3   
C  a  b  a  b  a  b
A                  
1  1  0  1  0  0  0
2  1  0  0  0  0  1

In [8]: pd.crosstab(df.A, [df.B, df.C], dropna=True)
Out[8]: 
B  1  2  3
C  a  a  b
A         
1  1  1  0
2  1  0  1

I think the issue is that the meaning of dropna is just confusing. maybe have strings instead?
are there 3 cases here that are useful?

@jorisvandenbossche
Copy link
Member

Could a solution to this problem be to change the default of dropna to None instead of True?
So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

@mroeschke mroeschke added Bug Categorical Categorical Data Type labels May 11, 2020
@AbhayGoyal
Copy link

Hey, what needs to be done for this, maybe I can give it a try?

@Yoshishiro
Copy link

Hello! I am interested to help in this issue.

@MarcoGorelli
Copy link
Member

I just tried this on master and got

>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0      
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

which seems correct and in accordance with the description given alongside the example in the docs.

The only part which strikes me as not correct is that the docs still read

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

So, if that line in the docs is changed to

When using dropna=False, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

then can we close the issue? Changing the default type of dropna would be a breaking change, and I'm not sure it would be worth it

@mroeschke mroeschke added Docs and removed Bug Regression Functionality that used to work in a prior pandas version labels Jun 12, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

9 participants