Skip to content

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
harmbuisman opened this issue Jun 27, 2019 · 11 comments · Fixed by #35022
Assignees
Labels
Categorical Categorical Data Type Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@harmbuisman
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df_test = pd.DataFrame()
df_test['A'] = pd.Series(np.arange(0,2), dtype='category').cat.set_categories(list(range(0,3)))
df_test['B'] = pd.Series(np.arange(10,12), dtype='category').cat.set_categories(list(range(10,13)))

print("Test DF:")
print(df_test)

print("\nThe following are as expected, unobserved categories have size = 0:")
print(df_test.groupby('A').size())
print(df_test.groupby('B').size())

print("\nThe following does not consider categories, I would expect 9 result lines here:")
print(df_test.groupby(['A','B']).size())

print("\nExpected:")
print(pd.DataFrame({'A':list(range(0,3))*3, 'B':list(range(10,13))*3, '':[1]*2+[0]*7 }).set_index(['A','B']))

image

Problem description

groupby([cols]) gives back a result for all categories if only one column that is categorical is provided (e.g. ['A']), but it only shows the observed combinations if multiple categorical columns are provided ['A', 'B'], regardless of the setting of observed. I would expect that I would get a result for all combinations of the categorical columns.

Expected Output

A result for all combinations of the categorical categories of the groupby columns. For the example above:
A B
0 10 1
1 11 1
2 12 0
0 10 0
1 11 0
2 12 0
0 10 0
1 11 0
2 12 0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

@TomAugspurger
Copy link
Contributor

Looks like a bug. Interested in investigating where it's coming from? IIRC, groupby may ask the index for the output values, which is where CategoricalIndex says "here's all the categories, including unobserved". But MultiIndex wouldn't have that.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Groupby labels Jul 8, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jul 8, 2019
@bsolomon1124
Copy link

bsolomon1124 commented Aug 7, 2019

Ran into what seems to be the same issue: here's a reproducible example, with some different fixes in the answers: https://stackoverflow.com/q/57385009/7954504.

df = pd.DataFrame({
    "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    "gender": pd.Categorical(["M", "M", "M", "F"]),
    "name": list("abcd"),
})

Incorrect result:

>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

Should be:

state  gender
AK     M         2
       F         0
AL     F         1
       M         1
Name: name, dtype: int64

@jreback
Copy link
Contributor

jreback commented Aug 7, 2019

please check on 0.25

@bsolomon1124
Copy link

bsolomon1124 commented Aug 7, 2019

@jreback

please check on 0.25

>>> import pandas as pd
>>> pd.__version__
'0.25.0'
>>> df = pd.DataFrame({
...     "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
...     "gender": pd.Categorical(["M", "M", "M", "F"]),
...     "name": list("abcd"),
... })
>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

I'd be willing to take a stab at investigating this one when I have some time next week.

@bsolomon1124
Copy link

bsolomon1124 commented Aug 7, 2019

Super-Summary

BaseGrouper._get_compressed_labels() does not account for observed=False. observed=False is never passed through from _get_grouper(). The individual Grouping instances are OK, but BaseGrouper fails to handle the cartesian product when observed=False.

The change may need to be made in compress_group_index() or get_group_index(), which are called from BaseGrouper._get_compressed_labels() in the case of their being type(keys) == list.

(Pdb) self = grouper
(Pdb) all_labels = [ping.labels for ping in self.groupings]
(Pdb) p all_labels
[array([0, 1, 0, 1], dtype=int8), array([1, 1, 1, 0], dtype=int8)]
(Pdb) from pandas.core.sorting import get_group_index
(Pdb) self.shape
(2, 2)
(Pdb) group_index = get_group_index(all_labels, self.shape, sort=True, xnull=True)
(Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) from pandas.core.sorting import compress_group_index
(Pdb) compress_group_index(group_index, sort=self.sort)
(array([0, 2, 0, 1]), array([1, 2, 3]))  # BAD

Given the two arrays from all_labels, there are, as expected only 3 unique groups:

  • (0, 1) occurs twice
  • (1, 1) and (1, 0) occur once

With observed=False, the missing (0, 0) pair needs to be accounted for but is not.

That is partly because get_group_index()

gets the offsets into the hypothetical list representing the totally ordered cartesian product of all possible label combinations.

I.e. in

(Pdb) get_group_index(all_labels, self.shape, sort=True, xnull=False)
array([1, 3, 1, 2])

You have:

index pair exists
0 0,0 no
1 0,1 yes (twice)
2 1,0 yes
3 1,1 yes

After which everything boils down to compress_group_index(): its return is return comp_ids, obs_group_ids. In this example, obs_group_ids only has length 3, but should have length 4.

And last but not least, compress_group_index() calls the Cython method Int64HashTable.get_labels_groupby().

comp_ids, obs_group_ids = table.get_labels_groupby(group_index)

This is the underlying call that produces the obs_group_ids of length 3 rather than 4:

(Pdb) p group_index
array([1, 3, 1, 2])
(Pdb) table.get_labels_groupby(group_index)
(array([0, 1, 0, 2]), array([1, 3, 2]))

Feel free to read further, but everything below this point may be outdated as it logs how I got from A to Z to arrive at the above.


OUTDATED

I suspect that there may be several problems in pandas.core.groupby.ops.BaseGrouper and how it handles the interaction between multiple categorical groupers.

That is, the Grouper class handles each individual column OK in isolation, but then things go south at:

grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated)

Here is a recreation of BaseGrouper.result_index():

(Pdb) from pandas.core.index import MultiIndex
(Pdb) levels = [ping.result_index for ping in grouper.groupings]
(Pdb) MultiIndex(levels=levels, codes=grouper.recons_labels, verify_integrity=False, names=grouper.names)
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

while the individual Grouping members themselves seem OK.

Here is the resulting Grouping for state:

(Pdb) p self.grouper
[AK, AL, AK, AL]
Categories (2, object): [AK, AL]
(Pdb) p self._group_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category')
(Pdb) p self._labels
array([0, 1, 0, 1], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['AK', 'AL'], categories=['AK', 'AL'], ordered=False, dtype='category')

And here is the resulting Grouping for gender:

(Pdb) p self.grouper
[M, M, M, F]
Categories (2, object): [F, M]
(Pdb) p self._group_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')
(Pdb) p self._labels
array([1, 1, 1, 0], dtype=int8)
(Pdb) self.result_index
CategoricalIndex(['F', 'M'], categories=['F', 'M'], ordered=False, dtype='category')

BaseGrouper does not even seem to have any concept of observed=False and therefore cannot handle the interaction between each Grouping instance and create the Cartesian product of their indices.

So to take a first stab at this:

  1. The Grouping instances are OK. They are handling categorical + observed like they should.
  2. BaseGrouper probably needs an observed parameter that _get_grouper() supplies to it as an argument, at
    grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated)
    .
  3. BaseGrouper.recons_labels is wrong; it gets used in BaseGrouper.result_index, which passed length-3 codes will produce the MultiIndex that is missing an element. (Note: MultiIndex.from_product(levels) might be an alternative here; I'm not sure.)
  4. There is a chain of issues; BaseGrouper.recons_labels uses BaseGrouper.group_info, which in turn uses BaseGrouper._get_compressed_labels(), which is also off if observed=False. (Again, none of these have any idea that observed is False.)
(Pdb) grouper.recons_labels
[array([0, 1, 1]), array([1, 0, 1])]
(Pdb) grouper.group_info
(array([0, 2, 0, 1]), array([1, 2, 3]), 3)
(Pdb) grouper._get_compressed_labels()
(array([0, 2, 0, 1]), array([1, 2, 3]))

Called in

def _get_compressed_labels(self):
; this is OK:

(Pdb) [ping.labels for ping in grouper.groupings]
[array([0, 1, 0, 1], dtype=int8), array([1, 1, 1, 0], dtype=int8)]
  • The observed arg gets passed all the way down to Grouping.__init__(), within a loop. This constructor gets called for each column str name in keys.
  • .result_index excludes the categorical labels that it should include when observed=False
  • This happens in groupby() itself before any further methods are called, namely in pandas.core.groupby.grouper._get_grouper which calls Grouping.__init__().
  • Calling .groupby('a') (a single column name where the Series is Categorical) does work (see below). This might draw attention then to grouper = BaseGrouper(group_axis, groupings, sort=sort, mutated=mutated) in pandas/core/groupby/grouper.py, in that perhaps BaseGrouper cannot infer that the resulting index needs to be a Cartesian product.
In [35]: df = pd.DataFrame({
    ...:     "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    ...:     "gender": pd.Categorical(["M", "M", "M", "F"]),
    ...:     "name": list("abcd"),
    ...: })

In [36]: df.groupby(["state", "gender"]).grouper.result_index
Out[36]:
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

It's worth mentioning that this does seem to be a problem with by being a list of keys rather than single key. I.e. for a single by key, there is a (correct) difference between observed=False and observed=True:

In [4]: s = pd.Series(pd.Categorical(["M", "M", "M"], categories=["M", "F"]))
In [6]: s.groupby(s).count()  # observed=False
Out[6]:
M    3
F    0
dtype: int64
In [7]: s.groupby(s, observed=True).count()
Out[7]:
M    3
dtype: int64

First observation: it seems like this occurs for some, but not all, methods of the GroupBy object:

>>> bn = df.groupby(["state", "gender"])["name"]

>>> bn.size()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.nunique()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

>>> bn.first()
state  gender
AK     F         NaN
       M           a
AL     F           d
       M           b
Name: name, dtype: object

The next thing I did was to step into df.groupby(["state", "gender"])["name"].count() and see what looked off, via pdb.run(bn.count()).

In pandas/core/groupby/generic.py, in the call to SeriesGroupBy.count(), several things stand out:

(Pdb) list 1310,1331
1310            """
1311            Compute count of group, excluding missing values.
1312
1313            Returns
1314            -------
1315            Series
1316                Count of values within each group.
1317            """
1318 ->         ids, _, ngroups = self.grouper.group_info
1319            val = self.obj._internal_get_values()
1320
1321            mask = (ids != -1) & ~isna(val)
1322            ids = ensure_platform_int(ids)
1323            minlength = ngroups or 0
1324            out = np.bincount(ids[mask], minlength=minlength)
1325
1326            return Series(
1327                out,
1328                index=self.grouper.result_index,
1329                name=self._selection_name,
1330                dtype="int64",
1331            )
(Pdb) unt 1325
> ...lib/python3.7/site-packages/pandas/core/groupby/generic.py(1326)count()
-> return Series(
(Pdb) p ids
array([0, 2, 0, 1])
(Pdb) p ngroups
3
(Pdb) p mask
array([ True,  True,  True,  True])
(Pdb) p out
array([2, 1, 1])

Namely:

  • bn.ngroups is 3, not 4
  • Same for bn.groups:
>>> bn.groups
{('AK', 'M'): Int64Index([0, 2], dtype='int64'),
 ('AL', 'F'): Int64Index([3], dtype='int64'),
 ('AL', 'M'): Int64Index([1], dtype='int64')}

But what seems to stand out most is .grouper.result_index:

>>> bn.grouper.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

Regardless of whether ngroups and groups are "right" (e.g. should observed control these? that seems undefined), the ultimate output is determined by the final Series constructor and it never stood a chance because it gets built with a MultiIndex of length 3, not 4, and the corresponding out which is length 3.

Something similar happens for .size(); here is pdb.run('bn.grouper.size()'):

(Pdb) list 267, 277
267             """
268             Compute group sizes
269
270             """
271  ->         ids, _, ngroup = self.group_info
272             ids = ensure_platform_int(ids)
273             if ngroup:
274                 out = np.bincount(ids[ids != -1], minlength=ngroup)
275             else:
276                 out = []
277             return Series(out, index=self.result_index, dtype="int64")
(Pdb) unt 277
> ...lib/python3.7/site-packages/pandas/core/groupby/ops.py(277)size()
-> return Series(out, index=self.result_index, dtype="int64")
(Pdb) p out
array([2, 1, 1])
(Pdb) p self.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

So it would appear initially that both out and result_index need fixed.


... all of the above led me to _GroupBy.__init__(), which calls pandas.core.groupby.grouper._get_grouper. Stepping into that, the index is already created as a result of grouper, exclusions, obj = _get_grouper(...):

(Pdb) p grouper.result_index
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M')],
           names=['state', 'gender'])

Stepping further down, this is what has been passed to _get_grouper():

(Pdb) args
obj =   state gender name
0    AK      M    a
1    AL      M    b
2    AK      M    c
3    AL      F    d
key = ['state', 'gender']
axis = 0
level = None
sort = True
observed = False
mutated = False
validate = True

Skipping ahead to 547 Pandas determines that this is not a single column name str but a list of them. Not much to see there; keys is ['state', 'gender'].

Then we enter this big loop on 606. At this point (0th iteration):

(Pdb) gpr
'state'
(Pdb) p is_in_axis(gpr)  # df.groupby('name')
True
(Pdb) p is_categorical_dtype(gpr)  # even though the *Series* is categorical, gpr is just str
False

Finally Grouping.__init__() gets called for each gpr str. Within Grouping.__init__():

(Pdb) p type(self.grouper)
<class 'pandas.core.arrays.categorical.Categorical'>

Then elif is_categorical_dtype(self.grouper) evals to True.

(Pdb) p is_categorical_dtype(self.grouper)
True

@mojones
Copy link
Contributor

mojones commented Nov 28, 2019

I think this is the same issue as #23865

@mojones
Copy link
Contributor

mojones commented Mar 20, 2020

This seems to be fixed in versions >=1.0.0

@jreback
Copy link
Contributor

jreback commented Mar 20, 2020

great love to have a PR with validation tests (note we likely have some of these examples already)

@rhshadrach rhshadrach added the Needs Tests Unit test(s) needed to prevent regressions label Jun 14, 2020
@smithto1
Copy link
Member

take

@smithto1
Copy link
Member

smithto1 commented Jul 8, 2020

@jreback This issue can also be closed. It is addressed by the linked Pull Request. (The PR wasn't linked at the time it was merged so this wasn't done automatically.)

@jreback jreback closed this as completed Jul 8, 2020
@jreback
Copy link
Contributor

jreback commented Jul 8, 2020

thanks @smithto1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants