-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pd.crosstab, categorical data and missing instances #16367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The docstring even has an example further down >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,
# but they still will be counted in the output
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0 which is not what I get, so defiantly a bug somewhere. I suspect that #15511 may be related, since In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0 does produce the correct output. |
Thanks a lot for the swift response. I will give it a try.
…On Tue, May 16, 2017 at 2:52 PM, Tom Augspurger ***@***.***> wrote:
The docstring even has an example further down
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,
# but they still will be counted in the output
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
which is not what I get, so defiantly a bug somewhere. I suspect that
#15511 <#15511> may be related,
since
In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
does produce the correct output. crosstab is defined in
pandas/core/reshape/pivot.py, if you want to start there.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16367 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADcMJ6Was7MAnYZ3x3ITFPtYjAvz_GVGks5r6ZwOgaJpZM4Ncb6_>
.
--
*Philipp Eisenhauer*
Economist
Mail [email protected]
Web www.eisenhauer.io
Repository https://github.com/peisenha
|
#15193 and #15511 are two related issues. Looking at the source code and the discussion, it seems to me that dropping empty columns is the desired behavior for dropna=True (default). This is the relevant code in pivot_table()
If you agree, just let me know and I will be glad to adjust the documentation accordingly. |
It seems like the resolution from #12298 was that all the categories should be present in the output. #15511 seems to go against that... So I think this is a regression and not just a doc issue. I think the issue is that the meaning of |
Alright, if you would like me to take a crack at it, let me know. I will be glad to provide a fix and a regression test ... As this would be my first contribution to the library, I will probably need some guidance in the process. |
Yeah, it'd be great if you can take a shot. But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating |
Focusing on |
In this case, crosstab is a table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
aggfunc=len, margins=margins, dropna=dropna)
table = table.fillna(0).astype(np.int64) so the change to |
So the dropna in crosstab is actually also a bit confusing explantion, as you are not dropping all NaN columns, but all 0 columns :-) |
I guess it can happen if you have multiple levels, some of which aren't observed: In [21]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 1, 3], 'C': ['a', 'a', 'a', 'b']})
In [22]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[22]:
B 1 2 3
C a b a b a b
A
1 1 0 1 0 0 0
2 1 0 0 0 0 1 |
from the example above.
I think the issue is that the meaning of |
Could a solution to this problem be to change the default of |
Hey, what needs to be done for this, maybe I can give it a try? |
Hello! I am interested to help in this issue. |
I just tried this on master and got >>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0 d e
row_0
a 1 0
b 0 1
>>> pd.crosstab(foo, bar, dropna=False)
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0 which seems correct and in accordance with the description given alongside the example in the docs. The only part which strikes me as not correct is that the docs still read
So, if that line in the docs is changed to
then can we close the issue? Changing the default type of |
Code Sample, a copy-pastable example if possible
Problem description
This is from the documentation:
However, f is not included in the table while c is.
Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.
Thanks a lot in advance!
Expected Output
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
Output of
pd.show_versions()
pandas: 0.20.1
pytest: 2.8.7
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.12.1
scipy: 0.17.0
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: