Skip to content

BUG: groupby.groups with NA categories fails #61364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -806,6 +806,7 @@ Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Bug in :meth:`.DataFrameGroupBy.__len__` and :meth:`.SeriesGroupBy.__len__` would raise when the grouping contained NA values and ``dropna=False`` (:issue:`58644`)
- Bug in :meth:`.DataFrameGroupBy.any` that returned True for groups where all Timedelta values are NaT. (:issue:`59712`)
- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupBy.groups` would fail when the groups were :class:`Categorical` with an NA value (:issue:`61356`)
- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby argument ``dropna`` (:issue:`55919`)
- Bug in :meth:`.DataFrameGroupBy.median` where nat values gave an incorrect result. (:issue:`57926`)
- Bug in :meth:`.DataFrameGroupBy.quantile` when ``interpolation="nearest"`` is inconsistent with :meth:`DataFrame.quantile` (:issue:`47942`)
Expand Down
20 changes: 17 additions & 3 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,16 @@

import numpy as np

from pandas._libs import (
algos as libalgos,
)
from pandas._libs.tslibs import OutOfBoundsDatetime
from pandas.errors import InvalidIndexError
from pandas.util._decorators import cache_readonly

from pandas.core.dtypes.common import (
ensure_int64,
ensure_platform_int,
is_list_like,
is_scalar,
)
Expand All @@ -38,7 +43,10 @@
)
from pandas.core.series import Series

from pandas.io.formats.printing import pprint_thing
from pandas.io.formats.printing import (
PrettyDict,
pprint_thing,
)

if TYPE_CHECKING:
from collections.abc import (
Expand Down Expand Up @@ -668,8 +676,14 @@ def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]:
def groups(self) -> dict[Hashable, Index]:
codes, uniques = self._codes_and_uniques
uniques = Index._with_infer(uniques, name=self.name)
cats = Categorical.from_codes(codes, uniques, validate=False)
return self._index.groupby(cats)

r, counts = libalgos.groupsort_indexer(ensure_platform_int(codes), len(uniques))
counts = ensure_int64(counts).cumsum()
_result = (r[start:end] for start, end in zip(counts, counts[1:]))
# map to the label
result = {k: self._index.take(v) for k, v in zip(uniques, _result)}

return PrettyDict(result)

@property
def observed_grouping(self) -> Grouping:
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,23 @@ def test_observed_groups(observed):
tm.assert_dict_equal(result, expected)


def test_groups_na_category(dropna, observed):
# https://github.com/pandas-dev/pandas/issues/61356
df = DataFrame(
{"cat": Categorical(["a", np.nan, "a"], categories=list("adb"))},
index=list("xyz"),
)
g = df.groupby("cat", observed=observed, dropna=dropna)

result = g.groups
expected = {"a": Index(["x", "z"])}
if not dropna:
expected |= {np.nan: Index(["y"])}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both arguments are False, should NaN come after non-observed groups? That seems more intuitive to me, especially for an ordered categorical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - if you do an operation like sum the order here matches the order in that result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm getting on both main and 2.2.3.

>>> df = DataFrame(
...         {"cat": Categorical(["a", np.nan, "a"], categories=list("adb"))},
...         index=list("xyz"),
...     )
>>> df["val"] = [1, 2, 3]
>>> g = df.groupby("cat", observed=False, dropna=False)
>>> g.sum()
     val
cat
a      4
d      0
b      0
NaN    2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, tm.assert_dict_equal appears to be order-invariant, so it doesn't matter for the test.

Copy link
Member Author

@rhshadrach rhshadrach Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now. I was correct in that the order was the same, but I failed to notice that the test added the groups in the incorrect order. I do wonder if assert_dict_equal should default to checking the order (perhaps with an argument to ignore order).

if not observed:
expected |= {"b": Index([]), "d": Index([])}
tm.assert_dict_equal(result, expected)


@pytest.mark.parametrize(
"keys, expected_values, expected_index_levels",
[
Expand Down
Loading