-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: A new GroupBy method to slice rows preserving index and order #42947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
119 commits
Select commit
Hold shift + click to select a range
72fd66d
ENH: A new GroupBy method to slice rows preserving index and order
johnzangwill d0ebbeb
Formatting
johnzangwill 33d7992
Formatting
johnzangwill 78e9ced
Formatting
johnzangwill 4d098cd
Formatting
johnzangwill f84c365
Formatting
johnzangwill d937757
Add iloc to test_tab_completion
johnzangwill e206912
Add iloc to groupby/base.py
johnzangwill 1788f1b
Documentation
johnzangwill f6977fa
Cosmetics to make pre-commit happy
johnzangwill bca4fdd
Improve docstring
johnzangwill 66536b1
Delete a.md
johnzangwill d075c67
Add to doc and improve test
johnzangwill df1a767
Tidy-up for pre-commit
johnzangwill f2e9f79
Update groupbyindexing.py
johnzangwill a9f9848
Split a long line
johnzangwill e42c86d
GroupBy.rows implementation
johnzangwill bab88c9
Add rows to rst file
johnzangwill a74bd33
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill c77de1d
Change iloc to rows in test_allowlist.py
johnzangwill 0d750bb
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill e952c25
Add to base.py
johnzangwill 2a6aafc
Tidy some whitespace for pep8speaks
johnzangwill b7f8bfe
Tidied mask code
johnzangwill 86e0c2e
test_rows.py formatting
johnzangwill 6f75502
Correct docstring bullet format
johnzangwill 8de5ff2
Update test_rows.py
johnzangwill f51fa88
Remove blank line at end of docstring
johnzangwill 3063f3a
Small change to force rebuild
johnzangwill 4228251
Make rows 100% compatible with nth
johnzangwill 41b1c73
Temporarily reroute nth list and slice to rows
johnzangwill ce36210
Rows for all non-dropna calls + types and tests
johnzangwill 70dcdb5
Merge branch 'master' into groupby_iloc
johnzangwill c024e41
Changes for flake8
johnzangwill 8abcac3
just one more comma...
johnzangwill add5727
Add type hints
johnzangwill bcd1dd9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 25459f7
Delete my build.cmd. Accidental commit
johnzangwill fa6b86c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill fefbacf
jreback 12 Sep requested changes
johnzangwill fa9f7e3
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill b589420
remove white-space
johnzangwill 89deee3
Get rid of np.int test
johnzangwill e28cdfb
Revert "Get rid of np.int test"
johnzangwill 424ab14
Try again...
johnzangwill 258530d
More jreback requested changes
johnzangwill d49e48f
More tweaks
johnzangwill 1dd6258
Whitespace
johnzangwill f84f5c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill c068162
Remove blank lines in conditionals
johnzangwill 536298e
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 0e73278
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 4cfde7b
Mainly variable changes and some formatting
johnzangwill 6343c9f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 33a2225
Make group_selection_context a private GroupBy class method
johnzangwill 6ca80c2
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill e94d4a8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 0d91dca
Add conditional typing for groupby import
johnzangwill acc3993
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill df52694
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill f42ae41
Delete Example section
johnzangwill 898fad4
Changes for @rhshadrach.
johnzangwill ffaaf25
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 02ec03c
Remove more docstrings from tests
johnzangwill 0691f99
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 7cad2c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 44120e1
Don't need to check for None anymore
johnzangwill 88b8ac5
Speed up by checking dropna
johnzangwill 0ee53cd
Implement head, tail. column axis, change _rows to _middle and remove…
johnzangwill 945a482
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 9412e3e
Change _middle to _body
johnzangwill 138b791
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 6b29c82
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill ea45bc6
Change class name to match
johnzangwill 179912e
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 19edf00
Add negative values to test_body.py/test_against_head_and_tail()
johnzangwill 94f6e99
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill ae21059
Add _body docstring
johnzangwill 5b8142b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill c8e0950
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 6ce90c4
Make nth a link
johnzangwill 4f6cbe1
Improve doc
johnzangwill 7d92c79
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 19b21bb
Simplify examples
johnzangwill 1a055e4
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill ca164cf
Fix FrameOrSeries typing problem
johnzangwill 337b15c
Fix more new typing problems
johnzangwill 69d8956
More typing problems
johnzangwill cecc674
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 98a9460
More typing woes
johnzangwill 95eb548
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill 4c8644b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill d0e9aa0
Create test_body.py
johnzangwill 10cca16
Merge branch 'master' into groupby_iloc
johnzangwill 9ccebf1
Merge branch 'master' into groupby_iloc
johnzangwill a3db969
Resolve conflicts
johnzangwill 4c4ba92
Avoid groupby name clash
johnzangwill 13ff29f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill acf67b1
Delete duplicated test_body.py
johnzangwill a3db6d1
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill ee8a86b
Merge branch 'master' into groupby_iloc
johnzangwill f4b24b0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 82360f5
Rename test_body.py to test_indexing.py
johnzangwill ba836dc
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 8abcad7
@jreback suggested renames
johnzangwill 86c8e20
Update whatsnew v1.4.0
johnzangwill ee33df0
Correct typo in doc
johnzangwill 4a1aac9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill d9671a6
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill a6dbc61
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill f65093c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 534ea54
Resolve with another branch
johnzangwill 511c8fd
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 97c3ac0
NDFrameT cannot be used like that
johnzangwill 90a4cb8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill b58b235
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill f5ed6bf
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 21b3637
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill 88613a9
Merge branch 'master' into groupby_iloc
johnzangwill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,7 @@ | |
"groups", | ||
"head", | ||
"hist", | ||
"iloc", | ||
"indices", | ||
"ndim", | ||
"ngroups", | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
from __future__ import annotations | ||
|
||
from pandas.util._decorators import doc | ||
import numpy as np | ||
|
||
|
||
class GroupByIndexingMixin: | ||
""" | ||
Mixin for adding .iloc to GroupBy. | ||
""" | ||
|
||
@property | ||
def iloc(self) -> _ilocGroupByIndexer: | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Integer location-based indexing for selection by position per group. | ||
|
||
Similar to ``.apply(lambda x: x.iloc[i:j, k:l])``, but much faster and returns | ||
a subset of rows from the original DataFrame with the original index and order | ||
preserved. | ||
|
||
The output is compatible with head() and tail() | ||
The output is different from take() and nth() which do not preserve the index or order | ||
|
||
Inputs | ||
------ | ||
Allowed inputs for the first index are: | ||
|
||
- An integer, e.g. ``5``. | ||
- A slice object with ints and positive step, e.g. ``1:``, ``4:-3:2``. | ||
|
||
Allowed inputs for the second index are as for DataFrame.iloc, namely: | ||
|
||
- An integer, e.g. ``5``. | ||
- A list or array of integers, e.g. ``[4, 3, 0]``. | ||
- A slice object with ints, e.g. ``1:7``. | ||
- A boolean array. | ||
- A ``callable`` function with one argument (the calling Series or | ||
DataFrame) and that returns valid output for indexing (one of the above). | ||
|
||
Returns | ||
------- | ||
Series or DataFrame | ||
|
||
Note | ||
---- | ||
Neither GroupBy.nth() nor GroupBy.take() take a slice argument and | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
neither of them preserve the original DataFrame order and index. | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
They are both slow for large integer lists and take() is very slow for large group counts. | ||
|
||
Use Case | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
-------- | ||
Suppose that we have a multi-indexed DataFrame with a large primary index and a secondary sorted | ||
to a different order for each primary. | ||
To reduce the DataFrame to a middle slice of each secondary, group by the primary and then | ||
use iloc. | ||
This preserves the original DataFrame"s order and indexing. | ||
(See tests/groupby/test_groupby_iloc) | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]], | ||
... columns=["A", "B"]) | ||
>>> df.groupby("A").iloc[1:2] | ||
A B | ||
1 a 2 | ||
4 b 5 | ||
>>> df.groupby("A").iloc[:-1, -1:] | ||
B | ||
0 1 | ||
1 2 | ||
3 4 | ||
""" | ||
return _ilocGroupByIndexer(self) | ||
|
||
|
||
@doc(GroupByIndexingMixin.iloc) | ||
class _ilocGroupByIndexer: | ||
def __init__(self, grouped): | ||
self.grouped = grouped | ||
self.reversed = False | ||
self._cached_ascending_count = None | ||
self._cached_descending_count = None | ||
|
||
def __getitem__(self, arg): | ||
self.reversed = False | ||
|
||
if type(arg) == tuple: | ||
return self._handle_item(arg[0], arg[1]) | ||
|
||
else: | ||
return self._handle_item(arg, None) | ||
|
||
def _handle_item(self, arg0, arg1): | ||
typeof_arg = type(arg0) | ||
|
||
if typeof_arg == slice: | ||
start = arg0.start | ||
stop = arg0.stop | ||
step = arg0.step | ||
|
||
if step is not None and step < 0: | ||
raise ValueError( | ||
f"GroupBy.iloc row slice step must be positive. Slice was {start}:{stop}:{step}" | ||
) | ||
# self.reversed = True | ||
# start = None if start is None else -start - 1 | ||
# stop = None if stop is None else -stop - 1 | ||
# step = -step | ||
|
||
return self._handle_slice(start, stop, step, arg1) | ||
|
||
elif typeof_arg == int: | ||
return self._handle_slice(arg0, arg0 + 1, 1, arg1) | ||
|
||
else: | ||
raise ValueError( | ||
f"GroupBy.iloc row must be an integer or a slice, not a {typeof_arg}" | ||
) | ||
|
||
def _handle_slice(self, start, stop, step, arg1): | ||
mask = None | ||
if step is None: | ||
step = 1 | ||
|
||
self.grouped._reset_group_selection() | ||
|
||
if start is None: | ||
if step > 1: | ||
mask = self._ascending_count % step == 0 | ||
|
||
else: | ||
if start >= 0: | ||
mask = self._ascending_count >= start | ||
|
||
if step > 1: | ||
mask &= (self._ascending_count - start) % step == 0 | ||
|
||
else: | ||
mask = self._descending_count < -start | ||
|
||
if step > 1: | ||
# | ||
# if start is -ve and -start exceedes the length of a group | ||
# then step must count from the | ||
# first row of that group rather than the calculated offset | ||
# | ||
# count_array + reverse_array gives the length of the | ||
# current group enabling to switch between | ||
# the offset_array and the count_array depending on whether | ||
# -start exceedes the group size | ||
# | ||
offset_array = self._descending_count + start + 1 | ||
limit_array = ( | ||
self._ascending_count + self._descending_count + (start + 1) | ||
) < 0 | ||
offset_array = np.where( | ||
limit_array, self._ascending_count, offset_array | ||
) | ||
|
||
mask &= offset_array % step == 0 | ||
|
||
if stop is not None: | ||
if stop >= 0: | ||
if mask is None: | ||
mask = self._ascending_count < stop | ||
|
||
else: | ||
mask &= self._ascending_count < stop | ||
else: | ||
if mask is None: | ||
mask = self._descending_count >= -stop | ||
|
||
else: | ||
mask &= self._descending_count >= -stop | ||
|
||
if mask is None: | ||
arg0 = slice(None) | ||
|
||
else: | ||
arg0 = mask | ||
|
||
if arg1 is None: | ||
return self._selected_obj.iloc[arg0] | ||
|
||
else: | ||
return self._selected_obj.iloc[arg0, arg1] | ||
|
||
@property | ||
def _ascending_count(self): | ||
if self._cached_ascending_count is None: | ||
self._cached_ascending_count = self.grouped._cumcount_array() | ||
if self.reversed: | ||
self._cached_ascending_count = self._cached_ascending_count[::-1] | ||
|
||
return self._cached_ascending_count | ||
|
||
@property | ||
def _descending_count(self): | ||
if self._cached_descending_count is None: | ||
self._cached_descending_count = self.grouped._cumcount_array( | ||
ascending=False | ||
) | ||
if self.reversed: | ||
self._cached_descending_count = self._cached_descending_count[::-1] | ||
|
||
return self._cached_descending_count | ||
|
||
@property | ||
def _selected_obj(self): | ||
if self.reversed: | ||
return self.grouped._selected_obj.iloc[::-1] | ||
|
||
else: | ||
return self.grouped._selected_obj |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
""" Test positional grouped indexing with iloc GH#42864""" | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
import pandas as pd | ||
import pandas._testing as tm | ||
import random | ||
|
||
|
||
def test_doc_examples(): | ||
"""Test the examples in the documentation""" | ||
|
||
df = pd.DataFrame( | ||
[["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]], columns=["A", "B"] | ||
) | ||
|
||
grouped = df.groupby("A") | ||
result = grouped.iloc[1:2, :] | ||
expected = pd.DataFrame([["a", 2], ["b", 5]], columns=["A", "B"], index=[1, 4]) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
result = grouped.iloc[:-1, -1:] | ||
expected = pd.DataFrame([1, 2, 4], columns=["B"], index=[0, 1, 3]) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_multiindex(): | ||
"""Test the multiindex mentioned as the use-case in the documentation""" | ||
|
||
def make_df_from_data(data): | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
rows = {} | ||
for date in dates: | ||
for level in data[date]: | ||
rows[(date, level[0])] = {"A": level[1], "B": level[2]} | ||
|
||
df = pd.DataFrame.from_dict(rows, orient="index") | ||
df.index.names = ("Date", "Item") | ||
return df | ||
|
||
ndates = 1000 | ||
nitems = 40 | ||
dates = pd.date_range("20130101", periods=ndates, freq="D") | ||
items = [f"item {i}" for i in range(nitems)] | ||
|
||
data = {} | ||
for date in dates: | ||
levels = [ | ||
(item, random.randint(0, 10000) / 100, random.randint(0, 10000) / 100) for item in items | ||
] | ||
levels.sort(key=lambda x: x[1]) | ||
data[date] = levels | ||
|
||
df = make_df_from_data(data) | ||
result = df.groupby("Date").iloc[3:7] | ||
|
||
sliced = {date: data[date][3:7] for date in dates} | ||
expected = make_df_from_data(sliced) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_against_head_and_tail(): | ||
"""Test gives the same results as grouped head and tail""" | ||
|
||
n_groups = 100 | ||
n_rows_per_group = 30 | ||
|
||
data = { | ||
"group": [f"group {g}" for j in range(n_rows_per_group) for g in range(n_groups)], | ||
"value": [ | ||
random.randint(0, 10000) / 100 | ||
for j in range(n_rows_per_group) | ||
for g in range(n_groups) | ||
] | ||
} | ||
df = pd.DataFrame(data) | ||
grouped = df.groupby("group") | ||
|
||
for i in [1, 5, 29, 30, 31, 1000]: | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
result = grouped.iloc[:i, :] | ||
expected = grouped.head(i) | ||
johnzangwill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
result = grouped.iloc[-i:, :] | ||
expected = grouped.tail(i) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_against_df_iloc(): | ||
"""Test that a single group gives the same results as DataFame.iloc""" | ||
|
||
n_rows_per_group = 30 | ||
|
||
data = { | ||
"group": ["group 0" for j in range(n_rows_per_group)], | ||
"value": [random.randint(0, 10000) / 100 for j in range(n_rows_per_group)] | ||
} | ||
df = pd.DataFrame(data) | ||
grouped = df.groupby("group") | ||
|
||
for start in [None, 0, 1, 10, 29, 30, 1000, -1, -10, -29, -30, -1000]: | ||
for stop in [None, 0, 1, 10, 29, 30, 1000, -1, -10, -29, -30, -1000]: | ||
for step in [None, 1, 2, 3, 10, 29, 30, 100]: | ||
result = grouped.iloc[start:stop:step, :] | ||
expected = df.iloc[start:stop:step, :] | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_series(): | ||
"""Test grouped Series""" | ||
|
||
ser = pd.Series([1, 2, 3, 4, 5], index=["a", "a", "a", "b", "b"]) | ||
grouped = ser.groupby(level=0) | ||
result = grouped.iloc[1:2] | ||
expected = pd.Series([2, 5], index=["a", "b"]) | ||
|
||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_step(): | ||
"""Test grouped slice with step""" | ||
|
||
data = [["x", f"x{i}"] for i in range(5)] | ||
data += [["y", f"y{i}"] for i in range(4)] | ||
data += [["z", f"z{i}"] for i in range(3)] | ||
df = pd.DataFrame(data, columns=["A", "B"]) | ||
|
||
grouped = df.groupby("A") | ||
|
||
for step in [1, 2, 3, 4, 5]: | ||
result = grouped.iloc[::step, :] | ||
|
||
data = [["x", f"x{i}"] for i in range(0, 5, step)] | ||
data += [["y", f"y{i}"] for i in range(0, 4, step)] | ||
data += [["z", f"z{i}"] for i in range(0, 3, step)] | ||
|
||
index = [0 + i for i in range(0, 5, step)] | ||
index += [5 + i for i in range(0, 4, step)] | ||
index += [9 + i for i in range(0, 3, step)] | ||
|
||
expected = pd.DataFrame(data, columns=["A", "B"], index=index) | ||
|
||
tm.assert_frame_equal(result, expected) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.