Skip to content

Commit 602cba4

Browse files
committed
review comments
1 parent c61318d commit 602cba4

File tree

14 files changed

+197
-89
lines changed

14 files changed

+197
-89
lines changed

Diff for: doc/source/groupby.rst

+51-23
Original file line numberDiff line numberDiff line change
@@ -91,10 +91,10 @@ The mapping can be specified many different ways:
9191
- A Python function, to be called on each of the axis labels.
9292
- A list or NumPy array of the same length as the selected axis.
9393
- A dict or ``Series``, providing a ``label -> group name`` mapping.
94-
- For ``DataFrame`` objects, a string indicating a column to be used to group.
94+
- For ``DataFrame`` objects, a string indicating a column to be used to group.
9595
Of course ``df.groupby('A')`` is just syntactic sugar for
9696
``df.groupby(df['A'])``, but it makes life simpler.
97-
- For ``DataFrame`` objects, a string indicating an index level to be used to
97+
- For ``DataFrame`` objects, a string indicating an index level to be used to
9898
group.
9999
- A list of any of the above things.
100100

@@ -120,7 +120,7 @@ consider the following ``DataFrame``:
120120
'D' : np.random.randn(8)})
121121
df
122122
123-
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
123+
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
124124
We could naturally group by either the ``A`` or ``B`` columns, or both:
125125

126126
.. ipython:: python
@@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
360360
DataFrame column selection in GroupBy
361361
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
362362

363-
Once you have created the GroupBy object from a DataFrame, you might want to do
364-
something different for each of the columns. Thus, using ``[]`` similar to
363+
Once you have created the GroupBy object from a DataFrame, you might want to do
364+
something different for each of the columns. Thus, using ``[]`` similar to
365365
getting a column from a DataFrame, you can do:
366366

367367
.. ipython:: python
@@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
421421
Selecting a group
422422
-----------------
423423

424-
A single group can be selected using
424+
A single group can be selected using
425425
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
426426

427427
.. ipython:: python
@@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
444444
:ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
445445
and :ref:`resample API <timeseries.aggregate>`.
446446

447-
An obvious one is aggregation via the
448-
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
447+
An obvious one is aggregation via the
448+
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
449449
:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:
450450

451451
.. ipython:: python
@@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
517517
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
518518
:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
519519
:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
520-
521520

522-
The aggregating functions above will exclude NA values. Any function which
521+
522+
The aggregating functions above will exclude NA values. Any function which
523523
reduces a :class:`Series` to a scalar value is an aggregation function and will work,
524524
a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
525-
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
525+
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
526526
filter, see :ref:`here <groupby.nth>`.
527527

528528
.. _groupby.aggregate.multifunc:
@@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
732732
.. note::
733733

734734
Some functions will automatically transform the input when applied to a
735-
GroupBy object, but returning an object of the same shape as the original.
735+
GroupBy object, but returning an object of the same shape as the original.
736736
Passing ``as_index=False`` will not affect these transformation methods.
737737

738738
For example: ``fillna, ffill, bfill, shift.``.
@@ -926,7 +926,7 @@ The dimension of the returned result can also change:
926926

927927
In [11]: grouped.apply(f)
928928

929-
``apply`` on a Series can operate on a returned value from the applied function,
929+
``apply`` on a Series can operate on a returned value from the applied function,
930930
that is itself a series, and possibly upcast the result to a DataFrame:
931931

932932
.. ipython:: python
@@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:
984984
985985
df.groupby('A').std()
986986
987-
Note that ``df.groupby('A').colname.std().`` is more efficient than
987+
Note that ``df.groupby('A').colname.std().`` is more efficient than
988988
``df.groupby('A').std().colname``, so if the result of an aggregation function
989-
is only interesting over one column (here ``colname``), it may be filtered
989+
is only interesting over one column (here ``colname``), it may be filtered
990990
*before* applying the aggregation function.
991991

992+
.. _groupby.observed:
993+
994+
observed hanlding
995+
~~~~~~~~~~~~~~~~~
996+
997+
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
998+
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
999+
that are observed groupers (``observed=True``). The ``observed`` keyword will default to ``True`` in the future.
1000+
1001+
Show only the observed values:
1002+
1003+
.. ipython:: python
1004+
1005+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
1006+
1007+
Show all values:
1008+
1009+
.. ipython:: python
1010+
1011+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1012+
1013+
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
1014+
1015+
.. ipython:: python
1016+
1017+
s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1018+
s.index.dtype
1019+
9921020
.. _groupby.missing:
9931021

9941022
NA and NaT group handling
9951023
~~~~~~~~~~~~~~~~~~~~~~~~~
9961024

997-
If there are any NaN or NaT values in the grouping key, these will be
998-
automatically excluded. In other words, there will never be an "NA group" or
999-
"NaT group". This was not the case in older versions of pandas, but users were
1000-
generally discarding the NA group anyway (and supporting it was an
1025+
If there are any NaN or NaT values in the grouping key, these will be
1026+
automatically excluded. In other words, there will never be an "NA group" or
1027+
"NaT group". This was not the case in older versions of pandas, but users were
1028+
generally discarding the NA group anyway (and supporting it was an
10011029
implementation headache).
10021030

10031031
Grouping with ordered factors
@@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
10841112
Taking the nth row of each group
10851113
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10861114

1087-
To select from a DataFrame or Series the nth item, use
1088-
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
1115+
To select from a DataFrame or Series the nth item, use
1116+
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
10891117
will return a single row (or no row) per group if you pass an int for n:
10901118

10911119
.. ipython:: python
@@ -1153,7 +1181,7 @@ Enumerate groups
11531181
.. versionadded:: 0.20.2
11541182

11551183
To see the ordering of the groups (as opposed to the order of rows
1156-
within a group given by ``cumcount``) you can use
1184+
within a group given by ``cumcount``) you can use
11571185
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
11581186

11591187

@@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
12731301
Multi-column factorization
12741302
~~~~~~~~~~~~~~~~~~~~~~~~~~
12751303
1276-
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
1304+
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
12771305
information about the groups in a way similar to :func:`factorize` (as described
12781306
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
12791307
naturally to multiple columns of mixed type and different

Diff for: doc/source/whatsnew/v0.23.0.txt

+30-3
Original file line numberDiff line numberDiff line change
@@ -548,20 +548,47 @@ change to ``observed=True`` in the future. (:issue:`14942`, :issue:`8138`, :issu
548548
df['C'] = ['foo', 'bar'] * 2
549549
df
550550

551-
Previous Behavior (show all values):
551+
``observed`` must now be passed when grouping by categoricals, or a
552+
``FutureWarning`` will show:
553+
554+
.. ipython:: python
555+
:okwarning:
556+
557+
df.groupby(['A', 'B', 'C']).count()
558+
559+
560+
To suppress the warning, with previous Behavior (show all values):
552561

553562
.. ipython:: python
554563

555-
.. code-block:: python
556564
df.groupby(['A', 'B', 'C'], observed=False).count()
557565

558566

559-
New Behavior (show only observed values):
567+
Future Behavior (show only observed values):
560568

561569
.. ipython:: python
562570

563571
df.groupby(['A', 'B', 'C'], observed=True).count()
564572

573+
For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:
574+
575+
.. ipython:: python
576+
577+
cat1 = pd.Categorical(["a", "a", "b", "b"],
578+
categories=["a", "b", "z"], ordered=True)
579+
cat2 = pd.Categorical(["c", "d", "c", "d"],
580+
categories=["c", "d", "y"], ordered=True)
581+
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
582+
df
583+
584+
.. ipython:: python
585+
586+
pd.pivot_table(df, values='values', index=['A', 'B'],
587+
dropna=True)
588+
pd.pivot_table(df, values='values', index=['A', 'B'],
589+
dropna=False)
590+
591+
565592
.. _whatsnew_0230.api_breaking.deprecate_panel:
566593

567594
Deprecate Panel

Diff for: pandas/conftest.py

+9
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,15 @@ def ip():
6666
return InteractiveShell()
6767

6868

69+
@pytest.fixture(params=[True, False])
70+
def observed(request):
71+
""" pass in the observed keyword to groupby for [True, False]
72+
This indicates whether categoricals should return values for
73+
values which are not in the grouper [False], or only values which
74+
appear in the grouper [True] """
75+
return request.param
76+
77+
6978
@pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
7079
pytest.param('xz', marks=td.skip_if_no_lzma)])
7180
def compression(request):

Diff for: pandas/core/arrays/categorical.py

+20-2
Original file line numberDiff line numberDiff line change
@@ -635,7 +635,7 @@ def _set_categories(self, categories, fastpath=False):
635635

636636
self._dtype = new_dtype
637637

638-
def _codes_for_groupby(self, sort):
638+
def _codes_for_groupby(self, sort, observed):
639639
"""
640640
If sort=False, return a copy of self, coded with categories as
641641
returned by .unique(), followed by any categories not appearing in
@@ -649,6 +649,8 @@ def _codes_for_groupby(self, sort):
649649
----------
650650
sort : boolean
651651
The value of the sort parameter groupby was called with.
652+
observed : boolean
653+
Account only for the observed values
652654
653655
Returns
654656
-------
@@ -659,6 +661,22 @@ def _codes_for_groupby(self, sort):
659661
categories in the original order.
660662
"""
661663

664+
# we only care about observed values
665+
if observed:
666+
unique_codes = unique1d(self.codes)
667+
cat = self.copy()
668+
669+
take_codes = unique_codes[unique_codes != -1]
670+
if self.ordered:
671+
take_codes = np.sort(take_codes)
672+
673+
# we recode according to the uniques
674+
cat._categories = self.categories.take(take_codes)
675+
cat._codes = _recode_for_categories(self.codes,
676+
self.categories,
677+
cat._categories)
678+
return cat
679+
662680
# Already sorted according to self.categories; all is fine
663681
if sort:
664682
return self
@@ -2117,7 +2135,7 @@ def unique(self):
21172135
# exclude nan from indexer for categories
21182136
take_codes = unique_codes[unique_codes != -1]
21192137
if self.ordered:
2120-
take_codes = sorted(take_codes)
2138+
take_codes = np.sort(take_codes)
21212139
return cat.set_categories(cat.categories.take(take_codes))
21222140

21232141
def _values_for_factorize(self):

Diff for: pandas/core/groupby/groupby.py

+14-13
Original file line numberDiff line numberDiff line change
@@ -1664,10 +1664,11 @@ def nth(self, n, dropna=None):
16641664

16651665
if dropna not in ['any', 'all']:
16661666
if isinstance(self._selected_obj, Series) and dropna is True:
1667-
warnings.warn("the dropna='%s' keyword is deprecated,"
1667+
warnings.warn("the dropna={dropna} keyword is deprecated,"
16681668
"use dropna='all' instead. "
16691669
"For a Series groupby, dropna must be "
1670-
"either None, 'any' or 'all'." % (dropna),
1670+
"either None, 'any' or 'all'.".format(
1671+
dropna=dropna),
16711672
FutureWarning,
16721673
stacklevel=2)
16731674
dropna = 'all'
@@ -2961,27 +2962,27 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
29612962
# a passed Categorical
29622963
elif is_categorical_dtype(self.grouper):
29632964

2964-
self.grouper = self.grouper._codes_for_groupby(self.sort)
2965-
codes = self.grouper.codes
2966-
categories = self.grouper.categories
2967-
2968-
# we make a CategoricalIndex out of the cat grouper
2969-
# preserving the categories / ordered attributes
2970-
self._labels = codes
2971-
29722965
# Use the observed values of the grouper if inidcated
29732966
observed = self.observed
29742967
if observed is None:
29752968
msg = ("pass observed=True to ensure that a "
29762969
"categorical grouper only returns the "
29772970
"observed groupers, or\n"
2978-
"observed=False to return NA for non-observed"
2979-
"values\n")
2971+
"observed=False to include"
2972+
"unobserved categories.\n")
29802973
warnings.warn(msg, FutureWarning, stacklevel=5)
29812974
observed = False
29822975

2976+
grouper = self.grouper
2977+
self.grouper = self.grouper._codes_for_groupby(
2978+
self.sort, observed)
2979+
categories = self.grouper.categories
2980+
2981+
# we make a CategoricalIndex out of the cat grouper
2982+
# preserving the categories / ordered attributes
2983+
self._labels = self.grouper.codes
29832984
if observed:
2984-
codes = algorithms.unique1d(codes)
2985+
codes = algorithms.unique1d(grouper.codes)
29852986
else:
29862987
codes = np.arange(len(categories))
29872988

Diff for: pandas/core/indexes/category.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -782,9 +782,9 @@ def _concat_same_dtype(self, to_concat, name):
782782
result.name = name
783783
return result
784784

785-
def _codes_for_groupby(self, sort):
785+
def _codes_for_groupby(self, sort, observed):
786786
""" Return a Categorical adjusted for groupby """
787-
return self.values._codes_for_groupby(sort)
787+
return self.values._codes_for_groupby(sort, observed)
788788

789789
@classmethod
790790
def _add_comparison_methods(cls):

0 commit comments

Comments
 (0)