pandas-dev
diff --git a/Diff for: ‎doc/source/groupby.rst
+51-23 b/Diff for: ‎doc/source/groupby.rst
+51-23
diff --git a/Diff for: ‎doc/source/whatsnew/v0.23.0.txt
+30-3 b/Diff for: ‎doc/source/whatsnew/v0.23.0.txt
+30-3
diff --git a/Diff for: ‎pandas/conftest.py
+9 b/Diff for: ‎pandas/conftest.py
+9
diff --git a/Diff for: ‎pandas/core/arrays/categorical.py
+20-2 b/Diff for: ‎pandas/core/arrays/categorical.py
+20-2
diff --git a/Diff for: ‎pandas/core/groupby/groupby.py
+14-13 b/Diff for: ‎pandas/core/groupby/groupby.py
+14-13
diff --git a/Diff for: ‎pandas/core/indexes/category.py
+2-2 b/Diff for: ‎pandas/core/indexes/category.py
+2-2
@@ -91,10 +91,10 @@ The mapping can be specified many different ways:
   - A Python function, to be called on each of the axis labels.
   - A list or NumPy array of the same length as the selected axis.
   - A dict or ``Series``, providing a ``label -> group name`` mapping.
-  - For ``DataFrame`` objects, a string indicating a column to be used to group. 
+  - For ``DataFrame`` objects, a string indicating a column to be used to group.
     Of course ``df.groupby('A')`` is just syntactic sugar for
     ``df.groupby(df['A'])``, but it makes life simpler.
-  - For ``DataFrame`` objects, a string indicating an index level to be used to 
+  - For ``DataFrame`` objects, a string indicating an index level to be used to
     group.
   - A list of any of the above things.
 
@@ -120,7 +120,7 @@ consider the following ``DataFrame``:
                       'D' : np.random.randn(8)})
    df
 
-On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`. 
+On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
 We could naturally group by either the ``A`` or ``B`` columns, or both:
 
 .. ipython:: python
@@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
 DataFrame column selection in GroupBy
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Once you have created the GroupBy object from a DataFrame, you might want to do 
-something different for each of the columns. Thus, using ``[]`` similar to 
+Once you have created the GroupBy object from a DataFrame, you might want to do
+something different for each of the columns. Thus, using ``[]`` similar to
 getting a column from a DataFrame, you can do:
 
 .. ipython:: python
@@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
 Selecting a group
 -----------------
 
-A single group can be selected using 
+A single group can be selected using
 :meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
 
 .. ipython:: python
@@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
 :ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
 and :ref:`resample API <timeseries.aggregate>`.
 
-An obvious one is aggregation via the 
-:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently 
+An obvious one is aggregation via the
+:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
 :meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:
 
 .. ipython:: python
@@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
 	:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
 	:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
 	:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
-	
 
-The aggregating functions above will exclude NA values. Any function which 
+
+The aggregating functions above will exclude NA values. Any function which
 reduces a :class:`Series` to a scalar value is an aggregation function and will work,
 a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
-:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a 
+:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
 filter, see :ref:`here <groupby.nth>`.
 
 .. _groupby.aggregate.multifunc:
@@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
 .. note::
 
    Some functions will automatically transform the input when applied to a
-   GroupBy object, but returning an object of the same shape as the original. 
+   GroupBy object, but returning an object of the same shape as the original.
    Passing ``as_index=False`` will not affect these transformation methods.
 
    For example: ``fillna, ffill, bfill, shift.``.
@@ -926,7 +926,7 @@ The dimension of the returned result can also change:
 
     In [11]: grouped.apply(f)
 
-``apply`` on a Series can operate on a returned value from the applied function, 
+``apply`` on a Series can operate on a returned value from the applied function,
 that is itself a series, and possibly upcast the result to a DataFrame:
 
 .. ipython:: python
@@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:
 
    df.groupby('A').std()
 
-Note that ``df.groupby('A').colname.std().`` is more efficient than 
+Note that ``df.groupby('A').colname.std().`` is more efficient than
 ``df.groupby('A').std().colname``, so if the result of an aggregation function
-is only interesting over one column (here ``colname``), it may be filtered 
+is only interesting over one column (here ``colname``), it may be filtered
 *before* applying the aggregation function.
 
+.. _groupby.observed:
+
+observed hanlding
+~~~~~~~~~~~~~~~~~
+
+When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
+controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
+that are observed groupers (``observed=True``). The ``observed`` keyword will default to ``True`` in the future.
+
+Show only the observed values:
+
+.. ipython:: python
+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
+
+Show all values:
+
+.. ipython:: python
+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
+
+The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
+
+.. ipython:: python
+
+   s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
+   s.index.dtype
+
 .. _groupby.missing:
 
 NA and NaT group handling
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If there are any NaN or NaT values in the grouping key, these will be 
-automatically excluded. In other words, there will never be an "NA group" or 
-"NaT group". This was not the case in older versions of pandas, but users were 
-generally discarding the NA group anyway (and supporting it was an 
+If there are any NaN or NaT values in the grouping key, these will be
+automatically excluded. In other words, there will never be an "NA group" or
+"NaT group". This was not the case in older versions of pandas, but users were
+generally discarding the NA group anyway (and supporting it was an
 implementation headache).
 
 Grouping with ordered factors
@@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
 Taking the nth row of each group
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-To select from a DataFrame or Series the nth item, use 
-:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and 
+To select from a DataFrame or Series the nth item, use
+:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
 will return a single row (or no row) per group if you pass an int for n:
 
 .. ipython:: python
@@ -1153,7 +1181,7 @@ Enumerate groups
 .. versionadded:: 0.20.2
 
 To see the ordering of the groups (as opposed to the order of rows
-within a group given by ``cumcount``) you can use 
+within a group given by ``cumcount``) you can use
 :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
 
 
@@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
 Multi-column factorization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract 
+By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
 information about the groups in a way similar to :func:`factorize` (as described
 further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
 naturally to multiple columns of mixed type and different
 
@@ -548,20 +548,47 @@ change to ``observed=True`` in the future. (:issue:`14942`, :issue:`8138`, :issu
    df['C'] = ['foo', 'bar'] * 2
    df
 
-Previous Behavior (show all values):
+``observed`` must now be passed when grouping by categoricals, or a
+``FutureWarning`` will show:
+
+.. ipython:: python
+   :okwarning:
+
+   df.groupby(['A', 'B', 'C']).count()
+
+
+To suppress the warning, with previous Behavior (show all values):
 
 .. ipython:: python
 
-.. code-block:: python
    df.groupby(['A', 'B', 'C'], observed=False).count()
 
 
-New Behavior (show only observed values):
+Future Behavior (show only observed values):
 
 .. ipython:: python
 
    df.groupby(['A', 'B', 'C'], observed=True).count()
 
+For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:
+
+.. ipython:: python
+
+   cat1 = pd.Categorical(["a", "a", "b", "b"],
+                         categories=["a", "b", "z"], ordered=True)
+   cat2 = pd.Categorical(["c", "d", "c", "d"],
+                         categories=["c", "d", "y"], ordered=True)
+   df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
+   df
+
+.. ipython:: python
+
+   pd.pivot_table(df, values='values', index=['A', 'B'],
+                  dropna=True)
+   pd.pivot_table(df, values='values', index=['A', 'B'],
+                  dropna=False)
+
+
 .. _whatsnew_0230.api_breaking.deprecate_panel:
 
 Deprecate Panel
 
@@ -66,6 +66,15 @@ def ip():
     return InteractiveShell()
 
 
+@pytest.fixture(params=[True, False])
+def observed(request):
+    """ pass in the observed keyword to groupby for [True, False]
+    This indicates whether categoricals should return values for
+    values which are not in the grouper [False], or only values which
+    appear in the grouper [True] """
+    return request.param
+
+
 @pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
                         pytest.param('xz', marks=td.skip_if_no_lzma)])
 def compression(request):
 
@@ -635,7 +635,7 @@ def _set_categories(self, categories, fastpath=False):
 
         self._dtype = new_dtype
 
-    def _codes_for_groupby(self, sort):
+    def _codes_for_groupby(self, sort, observed):
         """
         If sort=False, return a copy of self, coded with categories as
         returned by .unique(), followed by any categories not appearing in
@@ -649,6 +649,8 @@ def _codes_for_groupby(self, sort):
         ----------
         sort : boolean
             The value of the sort parameter groupby was called with.
+        observed : boolean
+            Account only for the observed values
 
         Returns
         -------
@@ -659,6 +661,22 @@ def _codes_for_groupby(self, sort):
             categories in the original order.
         """
 
+        # we only care about observed values
+        if observed:
+            unique_codes = unique1d(self.codes)
+            cat = self.copy()
+
+            take_codes = unique_codes[unique_codes != -1]
+            if self.ordered:
+                take_codes = np.sort(take_codes)
+
+            # we recode according to the uniques
+            cat._categories = self.categories.take(take_codes)
+            cat._codes = _recode_for_categories(self.codes,
+                                                self.categories,
+                                                cat._categories)
+            return cat
+
         # Already sorted according to self.categories; all is fine
         if sort:
             return self
@@ -2117,7 +2135,7 @@ def unique(self):
         # exclude nan from indexer for categories
         take_codes = unique_codes[unique_codes != -1]
         if self.ordered:
-            take_codes = sorted(take_codes)
+            take_codes = np.sort(take_codes)
         return cat.set_categories(cat.categories.take(take_codes))
 
     def _values_for_factorize(self):
 
@@ -1664,10 +1664,11 @@ def nth(self, n, dropna=None):
 
         if dropna not in ['any', 'all']:
             if isinstance(self._selected_obj, Series) and dropna is True:
-                warnings.warn("the dropna='%s' keyword is deprecated,"
+                warnings.warn("the dropna={dropna} keyword is deprecated,"
                               "use dropna='all' instead. "
                               "For a Series groupby, dropna must be "
-                              "either None, 'any' or 'all'." % (dropna),
+                              "either None, 'any' or 'all'.".format(
+                                  dropna=dropna),
                               FutureWarning,
                               stacklevel=2)
                 dropna = 'all'
@@ -2961,27 +2962,27 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
             # a passed Categorical
             elif is_categorical_dtype(self.grouper):
 
-                self.grouper = self.grouper._codes_for_groupby(self.sort)
-                codes = self.grouper.codes
-                categories = self.grouper.categories
-
-                # we make a CategoricalIndex out of the cat grouper
-                # preserving the categories / ordered attributes
-                self._labels = codes
-
                 # Use the observed values of the grouper if inidcated
                 observed = self.observed
                 if observed is None:
                     msg = ("pass observed=True to ensure that a "
                            "categorical grouper only returns the "
                            "observed groupers, or\n"
-                           "observed=False to return NA for non-observed"
-                           "values\n")
+                           "observed=False to include"
+                           "unobserved categories.\n")
                     warnings.warn(msg, FutureWarning, stacklevel=5)
                     observed = False
 
+                grouper = self.grouper
+                self.grouper = self.grouper._codes_for_groupby(
+                    self.sort, observed)
+                categories = self.grouper.categories
+
+                # we make a CategoricalIndex out of the cat grouper
+                # preserving the categories / ordered attributes
+                self._labels = self.grouper.codes
                 if observed:
-                    codes = algorithms.unique1d(codes)
+                    codes = algorithms.unique1d(grouper.codes)
                 else:
                     codes = np.arange(len(categories))
 
 
@@ -782,9 +782,9 @@ def _concat_same_dtype(self, to_concat, name):
         result.name = name
         return result
 
-    def _codes_for_groupby(self, sort):
+    def _codes_for_groupby(self, sort, observed):
         """ Return a Categorical adjusted for groupby """
-        return self.values._codes_for_groupby(sort)
+        return self.values._codes_for_groupby(sort, observed)
 
     @classmethod
     def _add_comparison_methods(cls):