-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: A new GroupBy method to slice rows preserving index and order #42947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 46 commits
72fd66d
d0ebbeb
33d7992
78e9ced
4d098cd
f84c365
d937757
e206912
1788f1b
f6977fa
bca4fdd
66536b1
d075c67
df1a767
f2e9f79
a9f9848
e42c86d
bab88c9
a74bd33
c77de1d
0d750bb
e952c25
2a6aafc
b7f8bfe
86e0c2e
6f75502
8de5ff2
f51fa88
3063f3a
4228251
41b1c73
ce36210
70dcdb5
c024e41
8abcac3
add5727
bcd1dd9
25459f7
fa6b86c
fefbacf
fa9f7e3
b589420
89deee3
e28cdfb
424ab14
258530d
d49e48f
1dd6258
f84f5c0
c068162
536298e
0e73278
4cfde7b
6343c9f
33a2225
6ca80c2
e94d4a8
0d91dca
acc3993
df52694
f42ae41
898fad4
ffaaf25
02ec03c
0691f99
7cad2c0
44120e1
88b8ac5
0ee53cd
945a482
9412e3e
138b791
6b29c82
ea45bc6
179912e
19edf00
94f6e99
ae21059
5b8142b
c8e0950
6ce90c4
4f6cbe1
7d92c79
19b21bb
1a055e4
ca164cf
337b15c
69d8956
cecc674
98a9460
95eb548
4c8644b
d0e9aa0
10cca16
9ccebf1
a3db969
4c4ba92
13ff29f
acf67b1
a3db6d1
ee8a86b
f4b24b0
82360f5
ba836dc
8abcad7
86c8e20
ee33df0
4a1aac9
d9671a6
a6dbc61
f65093c
534ea54
511c8fd
97c3ac0
90a4cb8
b58b235
f5ed6bf
21b3637
88613a9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,238 @@ | ||
from __future__ import annotations | ||
|
||
from typing import ( | ||
Iterable, | ||
cast, | ||
) | ||
|
||
import numpy as np | ||
|
||
from pandas._typing import ( | ||
FrameOrSeries, | ||
PositionalIndexer, | ||
) | ||
from pandas.util._decorators import ( | ||
cache_readonly, | ||
doc, | ||
) | ||
|
||
from pandas.core.dtypes.common import ( | ||
is_integer, | ||
is_list_like, | ||
) | ||
|
||
from pandas.core.groupby import groupby | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how are you importing this? (when this file itself is imported by group) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yah this looks like a circular import that is going to be really fragile There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I only did this as a mixin because I was copying the style of df.iloc. Which does not attempt to type itself. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok this is just for typing, so if you put it in a TYPE_CHECKING block will be ok (e.g .as the groupby.GroupBy should be separated out to avoid deps) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will wait until my There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It tested OK, so I have added the conditional type checking.; |
||
from pandas.core.indexes.api import CategoricalIndex | ||
|
||
|
||
class GroupByIndexingMixin: | ||
""" | ||
Mixin for adding .rows to GroupBy. | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
@property | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def _rows(self) -> _rowsGroupByIndexer: | ||
return _rowsGroupByIndexer(cast(groupby.GroupBy, self)) | ||
|
||
|
||
@doc(GroupByIndexingMixin._rows) | ||
class _rowsGroupByIndexer: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its ok and preferred to use capitalcase, e.g. RowsGroupByIndexer) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Done
Unfortunately, I need groupby because I need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
OK, I have tested making Unless you say otherwise, I will make that change. Of course, it effects the entire system, but I suppose we could always back it out... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've committed it. Fingers crossed... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no we don't care, this is an internal routine (its not on a groupby rather a grouper which is private) |
||
def __init__(self, grouped: groupby.GroupBy): | ||
self.grouped = grouped | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can the name here be more obvious? i had to double-check that this didnt correspond to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really struggled with that! I have renamed it groupByObject, which hopefully sums it up... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now it is renamed |
||
|
||
def __getitem__(self, arg: PositionalIndexer | tuple) -> FrameOrSeries: | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Positional index for selection by integer location per group. | ||
|
||
Used to implement GroupBy._rows which is used to implement GroupBy.nth | ||
when keyword dropna is None or absent. | ||
The behaviour extends GroupBy.nth and handles DataFrame.groupby() | ||
keyword parameters such as as_index and dropna in a compatible way. | ||
|
||
The additions to nth(arg) are: | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Handles iterables such as range. | ||
- Handles slice(start, stop, step) with | ||
start: positive, negative or None. | ||
stop: positive, negative or None. | ||
step: positive or None. | ||
|
||
Parameters | ||
---------- | ||
arg : PositionalIndexer | tuple | ||
Allowed values are: | ||
- Integer | ||
- Integer values iterable such as list or range | ||
- Slice | ||
- Comma separated list of integers and slices | ||
|
||
Returns | ||
------- | ||
Series | ||
The filtered subset of the original groupby Series. | ||
DataFrame | ||
The filtered subset of the original groupby DataFrame. | ||
|
||
See Also | ||
-------- | ||
DataFrame.iloc : Purely integer-location based indexing for selection by | ||
position. | ||
GroupBy.head : Return first n rows of each group. | ||
GroupBy.tail : Return last n rows of each group. | ||
GroupBy.nth : Take the nth row from each group if n is an int, or a | ||
subset of rows, if n is a list of ints. | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]], | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
... columns=["A", "B"]) | ||
>>> df.groupby("A", as_index=False)._rows[1:2] | ||
A B | ||
1 a 2 | ||
4 b 5 | ||
|
||
>>> df.groupby("A", as_index=False)._rows[1, -1] | ||
A B | ||
1 a 2 | ||
2 a 3 | ||
4 b 5 | ||
""" | ||
with groupby.group_selection_context(self.grouped): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this is the reason you need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also needed it in line 35 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for the |
||
if isinstance(arg, tuple): | ||
if all(is_integer(i) for i in arg): | ||
mask = self._handle_list(arg) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a weird extra line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
else: | ||
mask = self._handle_tuple(arg) | ||
|
||
elif isinstance(arg, slice): | ||
mask = self._handle_slice(arg) | ||
|
||
elif is_integer(arg): | ||
mask = self._handle_int(cast(int, arg)) | ||
|
||
elif is_list_like(arg): | ||
mask = self._handle_list(cast(Iterable[int], arg)) | ||
|
||
else: | ||
raise TypeError( | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
f"Invalid index {type(arg)}. " | ||
"Must be integer, list-like, slice or a tuple of " | ||
"integers and slices" | ||
) | ||
|
||
ids, _, _ = self.grouped.grouper.group_info | ||
|
||
# Drop NA values in grouping | ||
mask &= ids != -1 | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if mask is None or mask is True: | ||
result = self.grouped._selected_obj[:] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does _selected_obj vs. _obj_with_exclusions make a difference here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know. This is the method used by head, tail and nth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There is no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
produces
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right. It comes in from SelectionMixin and seems to lose the grouped column. |
||
|
||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
result = self.grouped._selected_obj[mask] | ||
|
||
if self.grouped.as_index: | ||
result_index = self.grouped.grouper.result_index | ||
result.index = result_index[ids[mask]] | ||
|
||
if not self.grouped.observed and isinstance( | ||
result_index, CategoricalIndex | ||
): | ||
result = result.reindex(result_index) | ||
|
||
result = self.grouped._reindex_output(result) | ||
if self.grouped.sort: | ||
result = result.sort_index() | ||
|
||
return result | ||
|
||
def _handle_int(self, arg: int) -> bool | np.ndarray: | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if arg >= 0: | ||
return self._ascending_count == arg | ||
|
||
else: | ||
return self._descending_count == (-arg - 1) | ||
|
||
def _handle_list(self, args: Iterable[int]) -> bool | np.ndarray: | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
positive = [arg for arg in args if arg >= 0] | ||
negative = [-arg - 1 for arg in args if arg < 0] | ||
|
||
mask: bool | np.ndarray = False | ||
|
||
if positive: | ||
mask |= np.isin(self._ascending_count, positive) | ||
|
||
if negative: | ||
mask |= np.isin(self._descending_count, negative) | ||
|
||
return mask | ||
|
||
def _handle_tuple(self, args: tuple) -> bool | np.ndarray: | ||
mask: bool | np.ndarray = False | ||
|
||
for arg in args: | ||
if is_integer(arg): | ||
mask |= self._handle_int(cast(int, arg)) | ||
|
||
elif isinstance(arg, slice): | ||
mask |= self._handle_slice(arg) | ||
|
||
else: | ||
raise ValueError( | ||
f"Invalid argument {type(arg)}. Should be int or slice." | ||
) | ||
|
||
return mask | ||
|
||
def _handle_slice(self, arg: slice) -> bool | np.ndarray: | ||
start = arg.start | ||
stop = arg.stop | ||
step = arg.step | ||
|
||
if step is not None and step < 0: | ||
raise ValueError(f"Invalid step {step}. Must be non-negative") | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
mask: bool | np.ndarray = True | ||
|
||
if step is None: | ||
step = 1 | ||
|
||
if start is None: | ||
if step > 1: | ||
mask &= self._ascending_count % step == 0 | ||
|
||
elif start >= 0: | ||
mask &= self._ascending_count >= start | ||
|
||
if step > 1: | ||
mask &= (self._ascending_count - start) % step == 0 | ||
|
||
else: | ||
mask &= self._descending_count < -start | ||
|
||
offset_array = self._descending_count + start + 1 | ||
limit_array = ( | ||
self._ascending_count + self._descending_count + (start + 1) | ||
) < 0 | ||
offset_array = np.where( | ||
limit_array, self._ascending_count, offset_array | ||
) | ||
|
||
mask &= offset_array % step == 0 | ||
|
||
if stop is not None: | ||
if stop >= 0: | ||
mask &= self._ascending_count < stop | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. extra line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't understand. All my if-then-elses have blank lines. Do you want them all removed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have removed most of the blank lines before elif and else |
||
else: | ||
mask &= self._descending_count >= -stop | ||
|
||
return mask | ||
|
||
@cache_readonly | ||
def _ascending_count(self) -> np.ndarray: | ||
return self.grouped._cumcount_array() | ||
|
||
@cache_readonly | ||
def _descending_count(self) -> np.ndarray: | ||
return self.grouped._cumcount_array(ascending=False) |
Uh oh!
There was an error while loading. Please reload this page.