-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like a bug. Interested in investigating where it's coming from? IIRC, groupby may ask the index for the output values, which is where CategoricalIndex says "here's all the categories, including unobserved". But MultiIndex wouldn't have that. |
Ran into what seems to be the same issue: here's a reproducible example, with some different fixes in the answers: https://stackoverflow.com/q/57385009/7954504. df = pd.DataFrame({
"state": pd.Categorical(["AK", "AL", "AK", "AL"]),
"gender": pd.Categorical(["M", "M", "M", "F"]),
"name": list("abcd"),
}) Incorrect result: >>> df.groupby(["state", "gender"])["name"].count()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64 Should be: state gender
AK M 2
F 0
AL F 1
M 1
Name: name, dtype: int64 |
please check on 0.25 |
>>> import pandas as pd
>>> pd.__version__
'0.25.0'
>>> df = pd.DataFrame({
... "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
... "gender": pd.Categorical(["M", "M", "M", "F"]),
... "name": list("abcd"),
... })
>>> df.groupby(["state", "gender"])["name"].count()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64 I'd be willing to take a stab at investigating this one when I have some time next week. |
Super-Summary
The change may need to be made in (Pdb) self = grouper
(Pdb) all_labels = [ping.labels for ping in self.groupings]
(Pdb) p all_labels
[array([0, 1, 0, 1], dtype=int8), array([1, 1, 1, 0], dtype=int8)]
(Pdb) from pandas.core.sorting import get_group_index
(Pdb) self.shape
(2, 2)
(Pdb) group_index = get_group_index(all_labels, self.shape, sort=True, xnull=True)
(Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) from pandas.core.sorting import compress_group_index
(Pdb) compress_group_index(group_index, sort=self.sort)
(array([0, 2, 0, 1]), array([1, 2, 3])) # BAD Given the two arrays from
With That is partly because
I.e. in
You have:
After which everything boils down to And last but not least, Line 368 in 486ade0
This is the underlying call that produces the (Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) table.get_labels_groupby(group_index)
(array([0, 1, 0, 2]), array([1, 3, 2])) Feel free to read further, but everything below this point may be outdated as it logs how I got from A to Z to arrive at the above. OUTDATEDI suspect that there may be several problems in That is, the grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated) Here is a recreation of
while the individual Here is the resulting (Pdb) p self.grouper
[AK, AL, AK, AL]
Categories (2, object): [AK, AL]
(Pdb) p self._group_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category')
(Pdb) p self._labels
array([0, 1, 0, 1], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category') And here is the resulting (Pdb) p self.grouper
[M, M, M, F]
Categories (2, object): [F, M]
(Pdb) p self._group_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')
(Pdb) p self._labels
array([1, 1, 1, 0], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')
So to take a first stab at this:
(Pdb) grouper.recons_labels
[array([0, 1, 1]), array([1, 0, 1])]
(Pdb) grouper.group_info
(array([0, 2, 0, 1]), array([1, 2, 3]), 3)
(Pdb) grouper._get_compressed_labels()
(array([0, 2, 0, 1]), array([1, 2, 3])) Called in pandas/pandas/core/groupby/ops.py Line 311 in 486ade0
In [35]: df = pd.DataFrame({
...: "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
...: "gender": pd.Categorical(["M", "M", "M", "F"]),
...: "name": list("abcd"),
...: })
In [36]: df.groupby(["state", "gender"]).grouper.result_index
Out[36]:
MultiIndex([('AK', 'M'),
('AL', 'F'),
('AL', 'M')],
names=['state', 'gender']) It's worth mentioning that this does seem to be a problem with
First observation: it seems like this occurs for some, but not all, methods of the >>> bn = df.groupby(["state", "gender"])["name"]
>>> bn.size()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64
>>> bn.count()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64
>>> bn.nunique()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64
>>> bn.first()
state gender
AK F NaN
M a
AL F d
M b
Name: name, dtype: object The next thing I did was to step into In (Pdb) list 1310,1331
1310 """
1311 Compute count of group, excluding missing values.
1312
1313 Returns
1314 -------
1315 Series
1316 Count of values within each group.
1317 """
1318 -> ids, _, ngroups = self.grouper.group_info
1319 val = self.obj._internal_get_values()
1320
1321 mask = (ids != -1) & ~isna(val)
1322 ids = ensure_platform_int(ids)
1323 minlength = ngroups or 0
1324 out = np.bincount(ids[mask], minlength=minlength)
1325
1326 return Series(
1327 out,
1328 index=self.grouper.result_index,
1329 name=self._selection_name,
1330 dtype="int64",
1331 )
(Pdb) unt 1325
> ...lib/python3.7/site-packages/pandas/core/groupby/generic.py(1326)count()
-> return Series(
(Pdb) p ids
array([0, 2, 0, 1])
(Pdb) p ngroups
3
(Pdb) p mask
array([ True, True, True, True])
(Pdb) p out
array([2, 1, 1]) Namely:
>>> bn.groups
{('AK', 'M'): Int64Index([0, 2], dtype='int64'),
('AL', 'F'): Int64Index([3], dtype='int64'),
('AL', 'M'): Int64Index([1], dtype='int64')} But what seems to stand out most is >>> bn.grouper.result_index
MultiIndex([('AK', 'M'),
('AL', 'F'),
('AL', 'M')],
names=['state', 'gender']) Regardless of whether Something similar happens for (Pdb) list 267, 277
267 """
268 Compute group sizes
269
270 """
271 -> ids, _, ngroup = self.group_info
272 ids = ensure_platform_int(ids)
273 if ngroup:
274 out = np.bincount(ids[ids != -1], minlength=ngroup)
275 else:
276 out = []
277 return Series(out, index=self.result_index, dtype="int64")
(Pdb) unt 277
> ...lib/python3.7/site-packages/pandas/core/groupby/ops.py(277)size()
-> return Series(out, index=self.result_index, dtype="int64")
(Pdb) p out
array([2, 1, 1])
(Pdb) p self.result_index
MultiIndex([('AK', 'M'),
('AL', 'F'),
('AL', 'M')],
names=['state', 'gender']) So it would appear initially that both ... all of the above led me to (Pdb) p grouper.result_index
MultiIndex([('AK', 'M'),
('AL', 'F'),
('AL', 'M')],
names=['state', 'gender']) Stepping further down, this is what has been passed to (Pdb) args
obj = state gender name
0 AK M a
1 AL M b
2 AK M c
3 AL F d
key = ['state', 'gender']
axis = 0
level = None
sort = True
observed = False
mutated = False
validate = True Skipping ahead to 547 Pandas determines that this is not a single column name str but a list of them. Not much to see there; keys is Then we enter this big loop on 606. At this point (0th iteration): (Pdb) gpr
'state'
(Pdb) p is_in_axis(gpr) # df.groupby('name')
True
(Pdb) p is_categorical_dtype(gpr) # even though the *Series* is categorical, gpr is just str
False Finally (Pdb) p type(self.grouper)
<class 'pandas.core.arrays.categorical.Categorical'> Then (Pdb) p is_categorical_dtype(self.grouper)
True |
I think this is the same issue as #23865 |
This seems to be fixed in versions >=1.0.0 |
great love to have a PR with validation tests (note we likely have some of these examples already) |
take |
@jreback This issue can also be closed. It is addressed by the linked Pull Request. (The PR wasn't linked at the time it was merged so this wasn't done automatically.) |
thanks @smithto1 |
Code Sample, a copy-pastable example if possible
Problem description
groupby([cols]) gives back a result for all categories if only one column that is categorical is provided (e.g. ['A']), but it only shows the observed combinations if multiple categorical columns are provided ['A', 'B'], regardless of the setting of observed. I would expect that I would get a result for all combinations of the categorical columns.
Expected Output
A result for all combinations of the categorical categories of the groupby columns. For the example above:
A B
0 10 1
1 11 1
2 12 0
0 10 0
1 11 0
2 12 0
0 10 0
1 11 0
2 12 0
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None
The text was updated successfully, but these errors were encountered: