-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Maintain column order with groupby.nth #22811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @reidy-p! Thanks for updating the PR.
Comment last updated on October 20, 2018 at 22:24 Hours UTC |
@@ -497,7 +497,8 @@ def _set_group_selection(self): | |||
|
|||
if len(groupers): | |||
# GH12839 clear selected obj cache when group selection changes | |||
self._group_selection = ax.difference(Index(groupers)).tolist() | |||
self._group_selection = ax.difference(Index(groupers), | |||
sort=False).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index.difference
tries to sort its result by default and this means that sometimes the order of the columns was changed from the original DataFrame. I added a new sort
parameter to Index.difference
with a default of True to control this.
Codecov Report
@@ Coverage Diff @@
## master #22811 +/- ##
=========================================
Coverage ? 92.2%
=========================================
Files ? 169
Lines ? 50927
Branches ? 0
=========================================
Hits ? 46955
Misses ? 3972
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. there are a couple of other PRs out there which add a sort=True default to the set ops, though I think , though might be for intersection IIRC. can you see if you can locate / reference.
sort : bool, default True | ||
Sort the resulting index if possible | ||
|
||
.. versionadded:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make sure this is added to all subclasses as well (mutli, interval) I think have there own impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate
try: | ||
the_diff = sorting.safe_sort(the_diff) | ||
except TypeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests in the index tests to exercise this (prob just parameterize the parameter in the tests)
the other issue is #17378 |
just wanted to make you aware of the other issues not super necessary to actually do it in this PR unless it’s straightforwars |
Mostly cross referencing for myself. #21603 should become a lot easier when this is completed; |
lgtm. can you rebase, ping on green. |
dc2428c
to
acf06ad
Compare
@jreback I've rebased now but do you want me to add @mroeschke I had the exact same thought when I was working on a PR for |
acf06ad
to
67218b7
Compare
yes new PR after this is merged. |
sort : bool, default True | ||
Sort the resulting index if possible | ||
|
||
.. versionadded:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate
8f881c8
to
cdc638b
Compare
I have made some updates but I just realised that I need to add the new sort parameter to the tests for the subclasses of Index |
922f1eb
to
f7446b5
Compare
@@ -1040,7 +1040,11 @@ def func(self, other): | |||
'objects that have compatible dtypes') | |||
raise TypeError(msg.format(op=op_name)) | |||
|
|||
result = getattr(self._multiindex, op_name)(other._multiindex) | |||
if op_name == 'difference': | |||
result = getattr(self._multiindex, op_name)(other._multiindex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit awkward at the moment because difference
is the only set operation with the sort
parameter. But if we add a sort
parameter to the other set operations I think we can get rid of the if
statement
@@ -2791,8 +2791,14 @@ def difference(self, other, sort=True): | |||
labels=[[]] * self.nlevels, | |||
names=result_names, verify_integrity=False) | |||
|
|||
difference = set(self._ndarray_values) - set(other._ndarray_values) | |||
this = self._get_unique_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old way of doing this using set
did not preserve the original order so I took this code from the difference
method in pandas/core/indexes/base.py
:
pandas/pandas/core/indexes/base.py
Lines 2950 to 2957 in 145c227
this = self._get_unique_index() | |
indexer = this.get_indexer(other) | |
indexer = indexer.take((indexer != -1).nonzero()[0]) | |
label_diff = np.setdiff1d(np.arange(this.size), indexer, | |
assume_unique=True) | |
the_diff = this.values.take(label_diff) |
rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) | ||
@pytest.mark.parametrize("sort", [True, False]) | ||
def test_difference(self, tz, sort): | ||
rng_dates = ['1/2/2000', '1/3/2000', '1/1/2000', '1/4/2000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to ensure that the sort
parameter was getting a proper test with unsorted data so I have rewritten some tests to have unsorted data (e.g., by manually specifying a list of dates here rather than using date_range
). I have made similar changes to other existing tests.
bf9c4f9
to
d4ec2d9
Compare
can you rebase and fixup |
ea186c0
to
27ae813
Compare
27ae813
to
d030680
Compare
d030680
to
2e4b31a
Compare
rebased. though will look again. |
thanks @reidy-p nice job! |
…fixed * upstream/master: DOC: more consistent flake8-commands in contributing.rst (pandas-dev#23724) DOC: Fixed the doctsring for _set_axis_name (GH 22895) (pandas-dev#22969) DOC: Improve GL03 message re: blank lines at end of docstrings. (pandas-dev#23649) TST: add tests for keeping dtype in Series.update (pandas-dev#23604) TST: For GH4861, Period and datetime in multiindex (pandas-dev#23776) TST: move .str-test to strings.py & parametrize it; precursor to pandas-dev#23582 (pandas-dev#23777) STY: isort tests/scalar, tests/tslibs, import libwindow instead of _window (pandas-dev#23787) BUG: fixed .str.contains(..., na=False) for categorical series (pandas-dev#22170) BUG: Maintain column order with groupby.nth (pandas-dev#22811) API/DEPR: replace kwarg "pat" with "sep" in str.[r]partition (pandas-dev#23767) CLN: Finish isort core (pandas-dev#23765) TST: Mark test_pct_max_many_rows with pytest.mark.single (pandas-dev#23799)
git diff upstream/master -u -- "*.py" | flake8 --diff