Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

harmbuisman · 2019-06-27T13:43:34Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df_test = pd.DataFrame()
df_test['A'] = pd.Series(np.arange(0,2), dtype='category').cat.set_categories(list(range(0,3)))
df_test['B'] = pd.Series(np.arange(10,12), dtype='category').cat.set_categories(list(range(10,13)))

print("Test DF:")
print(df_test)

print("\nThe following are as expected, unobserved categories have size = 0:")
print(df_test.groupby('A').size())
print(df_test.groupby('B').size())

print("\nThe following does not consider categories, I would expect 9 result lines here:")
print(df_test.groupby(['A','B']).size())

print("\nExpected:")
print(pd.DataFrame({'A':list(range(0,3))*3, 'B':list(range(10,13))*3, '':[1]*2+[0]*7 }).set_index(['A','B']))

Problem description

groupby([cols]) gives back a result for all categories if only one column that is categorical is provided (e.g. ['A']), but it only shows the observed combinations if multiple categorical columns are provided ['A', 'B'], regardless of the setting of observed. I would expect that I would get a result for all combinations of the categorical columns.

Expected Output

A result for all combinations of the categorical categories of the groupby columns. For the example above:
A B
0 10 1
1 11 1
2 12 0
0 10 0
1 11 0
2 12 0
0 10 0
1 11 0
2 12 0

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-08T15:26:22Z

Looks like a bug. Interested in investigating where it's coming from? IIRC, groupby may ask the index for the output values, which is where CategoricalIndex says "here's all the categories, including unobserved". But MultiIndex wouldn't have that.

bsolomon1124 · 2019-08-07T01:31:04Z

Ran into what seems to be the same issue: here's a reproducible example, with some different fixes in the answers: https://stackoverflow.com/q/57385009/7954504.

df = pd.DataFrame({
    "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    "gender": pd.Categorical(["M", "M", "M", "F"]),
    "name": list("abcd"),
})

Incorrect result:

>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

Should be:

state  gender
AK     M         2
       F         0
AL     F         1
       M         1
Name: name, dtype: int64

jreback · 2019-08-07T01:32:53Z

please check on 0.25

bsolomon1124 · 2019-08-07T01:34:53Z

@jreback

please check on 0.25

>>> import pandas as pd
>>> pd.__version__
'0.25.0'
>>> df = pd.DataFrame({
...     "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
...     "gender": pd.Categorical(["M", "M", "M", "F"]),
...     "name": list("abcd"),
... })
>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

I'd be willing to take a stab at investigating this one when I have some time next week.

bsolomon1124 · 2019-08-07T04:30:30Z

Super-Summary

BaseGrouper._get_compressed_labels() does not account for observed=False. observed=False is never passed through from _get_grouper(). The individual Grouping instances are OK, but BaseGrouper fails to handle the cartesian product when observed=False.

The change may need to be made in compress_group_index() or get_group_index(), which are called from BaseGrouper._get_compressed_labels() in the case of their being type(keys) == list.

(Pdb) self = grouper
(Pdb) all_labels = [ping.labels for ping in self.groupings]
(Pdb) p all_labels
[array([0, 1, 0, 1], dtype=int8), array([1, 1, 1, 0], dtype=int8)]
(Pdb) from pandas.core.sorting import get_group_index
(Pdb) self.shape
(2, 2)
(Pdb) group_index = get_group_index(all_labels, self.shape, sort=True, xnull=True)
(Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) from pandas.core.sorting import compress_group_index
(Pdb) compress_group_index(group_index, sort=self.sort)
(array([0, 2, 0, 1]), array([1, 2, 3]))  # BAD

Given the two arrays from all_labels, there are, as expected only 3 unique groups:

(0, 1) occurs twice
(1, 1) and (1, 0) occur once

With observed=False, the missing (0, 0) pair needs to be accounted for but is not.

That is partly because get_group_index()

gets the offsets into the hypothetical list representing the totally ordered cartesian product of all possible label combinations.

I.e. in

(Pdb) get_group_index(all_labels, self.shape, sort=True, xnull=False)
array([1, 3, 1, 2])

You have:

index	pair	exists
0	0,0	no
1	0,1	yes (twice)
2	1,0	yes
3	1,1	yes

After which everything boils down to compress_group_index(): its return is return comp_ids, obs_group_ids. In this example, obs_group_ids only has length 3, but should have length 4.

And last but not least, compress_group_index() calls the Cython method Int64HashTable.get_labels_groupby().

pandas/pandas/core/sorting.py

Line 368 in 486ade0

comp_ids, obs_group_ids = table.get_labels_groupby(group_index)

This is the underlying call that produces the obs_group_ids of length 3 rather than 4:

(Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) table.get_labels_groupby(group_index)
(array([0, 1, 0, 2]), array([1, 3, 2]))

Feel free to read further, but everything below this point may be outdated as it logs how I got from A to Z to arrive at the above.

OUTDATED

I suspect that there may be several problems in pandas.core.groupby.ops.BaseGrouper and how it handles the interaction between multiple categorical groupers.

That is, the Grouper class handles each individual column OK in isolation, but then things go south at:

grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated)

Here is a recreation of BaseGrouper.result_index():

(Pdb) from pandas.core.index import MultiIndex
(Pdb) levels = [ping.result_index for ping in grouper.groupings]
(Pdb) MultiIndex(levels=levels, codes=grouper.recons_labels, verify_integrity=False, names=grouper.names)
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

while the individual Grouping members themselves seem OK.

Here is the resulting Grouping for state:

(Pdb) p self.grouper
[AK, AL, AK, AL]
Categories (2, object): [AK, AL]
(Pdb) p self._group_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category')
(Pdb) p self._labels
array([0, 1, 0, 1], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category')

And here is the resulting Grouping for gender:

(Pdb) p self.grouper
[M, M, M, F]
Categories (2, object): [F, M]
(Pdb) p self._group_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')
(Pdb) p self._labels
array([1, 1, 1, 0], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')

BaseGrouper does not even seem to have any concept of observed=False and therefore cannot handle the interaction between each Grouping instance and create the Cartesian product of their indices.

So to take a first stab at this:

The Grouping instances are OK. They are handling categorical + observed like they should.
BaseGrouper probably needs an observed parameter that _get_grouper() supplies to it as an argument, at

pandas/pandas/core/groupby/grouper.py

Line 664 in 486ade0

grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated)

.
BaseGrouper.recons_labels is wrong; it gets used in BaseGrouper.result_index, which passed length-3 codes will produce the MultiIndex that is missing an element. (Note: MultiIndex.from_product(levels) might be an alternative here; I'm not sure.)
There is a chain of issues; BaseGrouper.recons_labels uses BaseGrouper.group_info, which in turn uses BaseGrouper._get_compressed_labels(), which is also off if observed=False. (Again, none of these have any idea that observed is False.)

(Pdb) grouper.recons_labels
[array([0, 1, 1]), array([1, 0, 1])]
(Pdb) grouper.group_info
(array([0, 2, 0, 1]), array([1, 2, 3]), 3)
(Pdb) grouper._get_compressed_labels()
(array([0, 2, 0, 1]), array([1, 2, 3]))

Called in

pandas/pandas/core/groupby/ops.py

Line 311 in 486ade0

def _get_compressed_labels(self):

; this is OK:

(Pdb) [ping.labels for ping in grouper.groupings]
[array([0, 1, 0, 1], dtype=int8), array([1, 1, 1, 0], dtype=int8)]

The observed arg gets passed all the way down to Grouping.__init__(), within a loop. This constructor gets called for each column str name in keys.
.result_index excludes the categorical labels that it should include when observed=False
This happens in groupby() itself before any further methods are called, namely in pandas.core.groupby.grouper._get_grouper which calls Grouping.__init__().
Calling .groupby('a') (a single column name where the Series is Categorical) does work (see below). This might draw attention then to grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated) in pandas/core/groupby/grouper.py, in that perhaps BaseGrouper cannot infer that the resulting index needs to be a Cartesian product.

In [35]: df = pd.DataFrame({
    ...:     "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    ...:     "gender": pd.Categorical(["M", "M", "M", "F"]),
    ...:     "name": list("abcd"),
    ...: })

In [36]: df.groupby(["state", "gender"]).grouper.result_index
Out[36]:
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

It's worth mentioning that this does seem to be a problem with by being a list of keys rather than single key. I.e. for a single by key, there is a (correct) difference between observed=False and observed=True:

In [4]: s = pd.Series(pd.Categorical(["M", "M", "M"], categories=["M", "F"]))
In [6]: s.groupby(s).count()  # observed=False
Out[6]:
M    3
F    0
dtype: int64
In [7]: s.groupby(s, observed=True).count()
Out[7]:
M    3
dtype: int64

First observation: it seems like this occurs for some, but not all, methods of the GroupBy object:

>>> bn = df.groupby(["state", "gender"])["name"]

>>> bn.size()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.nunique()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.first()
state  gender
AK     F         NaN
       M           a
AL     F           d
       M           b
Name: name, dtype: object

The next thing I did was to step into df.groupby(["state", "gender"])["name"].count() and see what looked off, via pdb.run(bn.count()).

In pandas/core/groupby/generic.py, in the call to SeriesGroupBy.count(), several things stand out:

(Pdb) list 1310,1331
1310            """
1311            Compute count of group, excluding missing values.
1312
1313            Returns
1314            -------
1315            Series
1316                Count of values within each group.
1317            """
1318 ->         ids, _, ngroups = self.grouper.group_info
1319            val = self.obj._internal_get_values()
1320
1321            mask = (ids != -1) & ~isna(val)
1322            ids = ensure_platform_int(ids)
1323            minlength = ngroups or 0
1324            out = np.bincount(ids[mask], minlength=minlength)
1325
1326            return Series(
1327                out,
1328                index=self.grouper.result_index,
1329                name=self._selection_name,
1330                dtype="int64",
1331            )
(Pdb) unt 1325
> ...lib/python3.7/site-packages/pandas/core/groupby/generic.py(1326)count()
-> return Series(
(Pdb) p ids
array([0, 2, 0, 1])
(Pdb) p ngroups
3
(Pdb) p mask
array([ True,  True,  True,  True])
(Pdb) p out
array([2, 1, 1])

Namely:

bn.ngroups is 3, not 4
Same for bn.groups:

>>> bn.groups
{('AK', 'M'): Int64Index([0, 2], dtype='int64'),
 ('AL', 'F'): Int64Index([3], dtype='int64'),
 ('AL', 'M'): Int64Index([1], dtype='int64')}

But what seems to stand out most is .grouper.result_index:

>>> bn.grouper.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

Regardless of whether ngroups and groups are "right" (e.g. should observed control these? that seems undefined), the ultimate output is determined by the final Series constructor and it never stood a chance because it gets built with a MultiIndex of length 3, not 4, and the corresponding out which is length 3.

Something similar happens for .size(); here is pdb.run('bn.grouper.size()'):

(Pdb) list 267, 277
267             """
268             Compute group sizes
269
270             """
271  ->         ids, _, ngroup = self.group_info
272             ids = ensure_platform_int(ids)
273             if ngroup:
274                 out = np.bincount(ids[ids != -1], minlength=ngroup)
275             else:
276                 out = []
277             return Series(out, index=self.result_index, dtype="int64")
(Pdb) unt 277
> ...lib/python3.7/site-packages/pandas/core/groupby/ops.py(277)size()
-> return Series(out, index=self.result_index, dtype="int64")
(Pdb) p out
array([2, 1, 1])
(Pdb) p self.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

So it would appear initially that both out and result_index need fixed.

... all of the above led me to _GroupBy.__init__(), which calls pandas.core.groupby.grouper._get_grouper. Stepping into that, the index is already created as a result of grouper, exclusions, obj = _get_grouper(...):

(Pdb) p grouper.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

Stepping further down, this is what has been passed to _get_grouper():

(Pdb) args
obj =   state gender name
0    AK      M    a
1    AL      M    b
2    AK      M    c
3    AL      F    d
key = ['state', 'gender']
axis = 0
level = None
sort = True
observed = False
mutated = False
validate = True

Skipping ahead to 547 Pandas determines that this is not a single column name str but a list of them. Not much to see there; keys is ['state', 'gender'].

Then we enter this big loop on 606. At this point (0th iteration):

(Pdb) gpr
'state'
(Pdb) p is_in_axis(gpr)  # df.groupby('name')
True
(Pdb) p is_categorical_dtype(gpr)  # even though the *Series* is categorical, gpr is just str
False

Finally Grouping.__init__() gets called for each gpr str. Within Grouping.__init__():

(Pdb) p type(self.grouper)
<class 'pandas.core.arrays.categorical.Categorical'>

Then elif is_categorical_dtype(self.grouper) evals to True.

(Pdb) p is_categorical_dtype(self.grouper)
True

mojones · 2019-11-28T09:38:10Z

I think this is the same issue as #23865

mojones · 2020-03-20T10:57:16Z

This seems to be fixed in versions >=1.0.0

jreback · 2020-03-20T10:59:17Z

great love to have a PR with validation tests (note we likely have some of these examples already)

smithto1 · 2020-06-26T16:28:00Z

take

smithto1 · 2020-07-08T21:46:10Z

@jreback This issue can also be closed. It is addressed by the linked Pull Request. (The PR wasn't linked at the time it was merged so this wasn't done automatically.)

jreback · 2020-07-08T21:47:42Z

thanks @smithto1

TomAugspurger added Categorical Categorical Data Type Groupby labels Jul 8, 2019

TomAugspurger added this to the Contributions Welcome milestone Jul 8, 2019

TomAugspurger added Difficulty Intermediate labels Jul 8, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

rhshadrach added the Needs Tests Unit test(s) needed to prevent regressions label Jun 14, 2020

github-actions bot assigned smithto1 Jun 26, 2020

smithto1 mentioned this issue Jun 26, 2020

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.1 Jul 8, 2020

jreback closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

harmbuisman commented Jun 27, 2019

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Jul 8, 2019

bsolomon1124 commented Aug 7, 2019 •

edited

Loading

jreback commented Aug 7, 2019

bsolomon1124 commented Aug 7, 2019 •

edited

Loading

bsolomon1124 commented Aug 7, 2019 •

edited

Loading

mojones commented Nov 28, 2019

mojones commented Mar 20, 2020

jreback commented Mar 20, 2020

smithto1 commented Jun 26, 2020

smithto1 commented Jul 8, 2020

jreback commented Jul 8, 2020

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

Comments

harmbuisman commented Jun 27, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Jul 8, 2019

bsolomon1124 commented Aug 7, 2019 • edited Loading

jreback commented Aug 7, 2019

bsolomon1124 commented Aug 7, 2019 • edited Loading

bsolomon1124 commented Aug 7, 2019 • edited Loading

Super-Summary

OUTDATED

mojones commented Nov 28, 2019

mojones commented Mar 20, 2020

jreback commented Mar 20, 2020

smithto1 commented Jun 26, 2020

smithto1 commented Jul 8, 2020

jreback commented Jul 8, 2020

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

bsolomon1124 commented Aug 7, 2019 •

edited

Loading

bsolomon1124 commented Aug 7, 2019 •

edited

Loading

bsolomon1124 commented Aug 7, 2019 •

edited

Loading