BUG: Maintain column order with groupby.nth #22811

reidy-p · 2018-09-23T10:52:42Z

closes nth() mixes column order #20760
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-09-23T10:52:47Z

Hello @reidy-p! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/groupby/groupby.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/interval.py !
There are no PEP8 issues in the file pandas/core/indexes/multi.py !
There are no PEP8 issues in the file pandas/tests/groupby/test_nth.py !
There are no PEP8 issues in the file pandas/tests/indexes/common.py !
There are no PEP8 issues in the file pandas/tests/indexes/datetimes/test_setops.py !
There are no PEP8 issues in the file pandas/tests/indexes/interval/test_interval.py !
There are no PEP8 issues in the file pandas/tests/indexes/multi/test_set_ops.py !
There are no PEP8 issues in the file pandas/tests/indexes/period/test_period.py !
There are no PEP8 issues in the file pandas/tests/indexes/period/test_setops.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_base.py !
There are no PEP8 issues in the file pandas/tests/indexes/timedeltas/test_timedelta.py !

Comment last updated on October 20, 2018 at 22:24 Hours UTC

reidy-p · 2018-09-23T10:55:45Z

pandas/core/groupby/groupby.py

@@ -497,7 +497,8 @@ def _set_group_selection(self):

        if len(groupers):
            # GH12839 clear selected obj cache when group selection changes
-            self._group_selection = ax.difference(Index(groupers)).tolist()
+            self._group_selection = ax.difference(Index(groupers),
+                                                  sort=False).tolist()


Index.difference tries to sort its result by default and this means that sometimes the order of the columns was changed from the original DataFrame. I added a new sort parameter to Index.difference with a default of True to control this.

codecov · 2018-09-23T11:59:48Z

Codecov Report

❗ No coverage uploaded for pull request base (master@960a73f). Click here to learn what that means.
The diff coverage is 81.81%.

@@            Coverage Diff            @@
##             master   #22811   +/-   ##
=========================================
  Coverage          ?    92.2%           
=========================================
  Files             ?      169           
  Lines             ?    50927           
  Branches          ?        0           
=========================================
  Hits              ?    46955           
  Misses            ?     3972           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.62% <81.81%> (?)`
#single	`42.3% <45.45%> (?)`

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.46% <100%> (ø)`
pandas/core/groupby/groupby.py	`96.47% <100%> (ø)`
pandas/core/indexes/base.py	`96.55% <66.66%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...13a23f7. Read the comment docs.

jreback

lgtm. there are a couple of other PRs out there which add a sort=True default to the set ops, though I think , though might be for intersection IIRC. can you see if you can locate / reference.

jreback · 2018-09-23T12:13:36Z

pandas/core/indexes/base.py

+        sort : bool, default True
+            Sort the resulting index if possible
+
+            .. versionadded:: 0.24.0


can you make sure this is added to all subclasses as well (mutli, interval) I think have there own impl.

can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate

jreback · 2018-09-23T12:14:10Z

pandas/core/indexes/base.py

-        try:
-            the_diff = sorting.safe_sort(the_diff)
-        except TypeError:
-            pass


can you add some tests in the index tests to exercise this (prob just parameterize the parameter in the tests)

jreback · 2018-09-25T14:23:48Z

the other issue is #17378

reidy-p · 2018-09-26T19:49:30Z

#20809 and #17878 are also related. I'm working on trying to expand the pull request to the other set ops but it will take a few more days at least.

jreback · 2018-09-27T00:18:45Z

just wanted to make you aware of the other issues

not super necessary to actually do it in this PR unless it’s straightforwars

mroeschke · 2018-09-30T22:14:23Z

Mostly cross referencing for myself. #21603 should become a lot easier when this is completed; first and last could not be written strictly in terms of nth due to the column ordering.

jreback · 2018-10-01T12:10:21Z

lgtm. can you rebase, ping on green.

reidy-p · 2018-10-01T20:04:14Z

@jreback I've rebased now but do you want me to add sort=True to the other set operations in this PR too (or just the subclasses of Index) as you suggested above? I've started it but it needs a bit more work. I can put it in a new PR if it's easier.

@mroeschke I had the exact same thought when I was working on a PR for first and last which is why I opened this PR first!

jreback · 2018-10-07T23:01:29Z

PR too (or just the subclasses of Index) as you suggested above? I've started it but it needs a bit more work. I can put it in a new PR if it's easier.

yes new PR after this is merged.

jreback · 2018-10-07T23:02:32Z

pandas/core/indexes/base.py

+        sort : bool, default True
+            Sort the resulting index if possible
+
+            .. versionadded:: 0.24.0


can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate

reidy-p · 2018-10-11T21:58:15Z

I have made some updates but I just realised that I need to add the new sort parameter to the tests for the subclasses of Index

reidy-p · 2018-10-20T22:25:22Z

pandas/core/indexes/interval.py

@@ -1040,7 +1040,11 @@ def func(self, other):
                       'objects that have compatible dtypes')
                raise TypeError(msg.format(op=op_name))

-            result = getattr(self._multiindex, op_name)(other._multiindex)
+            if op_name == 'difference':
+                result = getattr(self._multiindex, op_name)(other._multiindex,


This is a bit awkward at the moment because difference is the only set operation with the sort parameter. But if we add a sort parameter to the other set operations I think we can get rid of the if statement

reidy-p · 2018-10-20T22:28:49Z

pandas/core/indexes/multi.py

@@ -2791,8 +2791,14 @@ def difference(self, other, sort=True):
                              labels=[[]] * self.nlevels,
                              names=result_names, verify_integrity=False)

-        difference = set(self._ndarray_values) - set(other._ndarray_values)
+        this = self._get_unique_index()


The old way of doing this using set did not preserve the original order so I took this code from the difference method in pandas/core/indexes/base.py:

pandas/pandas/core/indexes/base.py

Lines 2950 to 2957 in 145c227

this = self._get_unique_index()

indexer = this.get_indexer(other)

indexer = indexer.take((indexer != -1).nonzero()[0])

label_diff = np.setdiff1d(np.arange(this.size), indexer,

assume_unique=True)

the_diff = this.values.take(label_diff)

reidy-p · 2018-10-20T22:30:08Z

pandas/tests/indexes/datetimes/test_setops.py

-        rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz)
+    @pytest.mark.parametrize("sort", [True, False])
+    def test_difference(self, tz, sort):
+        rng_dates = ['1/2/2000', '1/3/2000', '1/1/2000', '1/4/2000',


I wanted to ensure that the sort parameter was getting a proper test with unsorted data so I have rewritten some tests to have unsorted data (e.g., by manually specifying a list of dates here rather than using date_range). I have made similar changes to other existing tests.

jreback · 2018-11-01T01:19:01Z

can you rebase and fixup

jreback · 2018-11-18T22:59:33Z

rebased. though will look again.

jreback · 2018-11-20T01:10:58Z

thanks @reidy-p nice job!

…fixed * upstream/master: DOC: more consistent flake8-commands in contributing.rst (pandas-dev#23724) DOC: Fixed the doctsring for _set_axis_name (GH 22895) (pandas-dev#22969) DOC: Improve GL03 message re: blank lines at end of docstrings. (pandas-dev#23649) TST: add tests for keeping dtype in Series.update (pandas-dev#23604) TST: For GH4861, Period and datetime in multiindex (pandas-dev#23776) TST: move .str-test to strings.py & parametrize it; precursor to pandas-dev#23582 (pandas-dev#23777) STY: isort tests/scalar, tests/tslibs, import libwindow instead of _window (pandas-dev#23787) BUG: fixed .str.contains(..., na=False) for categorical series (pandas-dev#22170) BUG: Maintain column order with groupby.nth (pandas-dev#22811) API/DEPR: replace kwarg "pat" with "sep" in str.[r]partition (pandas-dev#23767) CLN: Finish isort core (pandas-dev#23765) TST: Mark test_pct_max_many_rows with pytest.mark.single (pandas-dev#23799)

reidy-p commented Sep 23, 2018

View reviewed changes

jreback added the Bug label Sep 23, 2018

jreback added this to the 0.24.0 milestone Sep 23, 2018

jreback added the Groupby label Sep 23, 2018

jreback requested changes Sep 23, 2018

View reviewed changes

jreback approved these changes Oct 1, 2018

View reviewed changes

reidy-p force-pushed the nth_column_order branch from dc2428c to acf06ad Compare October 1, 2018 19:59

reidy-p force-pushed the nth_column_order branch from acf06ad to 67218b7 Compare October 3, 2018 22:34

jreback requested changes Oct 7, 2018

View reviewed changes

reidy-p force-pushed the nth_column_order branch 3 times, most recently from 8f881c8 to cdc638b Compare October 11, 2018 21:49

reidy-p force-pushed the nth_column_order branch 3 times, most recently from 922f1eb to f7446b5 Compare October 20, 2018 22:23

reidy-p commented Oct 20, 2018

View reviewed changes

reidy-p force-pushed the nth_column_order branch 2 times, most recently from bf9c4f9 to d4ec2d9 Compare October 28, 2018 16:00

reidy-p force-pushed the nth_column_order branch 3 times, most recently from ea186c0 to 27ae813 Compare November 3, 2018 14:24

reidy-p force-pushed the nth_column_order branch from 27ae813 to d030680 Compare November 10, 2018 23:21

reidy-p added 3 commits November 10, 2018 23:23

BUG: Maintain column order with groupby.nth

bc68c37

Add optional sort parameter to difference method in subclasses

351138d

add more tests

2e4b31a

reidy-p force-pushed the nth_column_order branch from d030680 to 2e4b31a Compare November 10, 2018 23:25

Merge branch 'master' into PR_TOOL_MERGE_PR_22811

13a23f7

jreback approved these changes Nov 18, 2018

View reviewed changes

jreback merged commit 71ba5bf into pandas-dev:master Nov 20, 2018

reidy-p deleted the nth_column_order branch December 8, 2018 13:05

reidy-p mentioned this pull request Dec 28, 2018

ENH: Add sort parameter to other set operations if possible #24471

Closed

6 tasks

jorisvandenbossche mentioned this pull request Jan 28, 2019

Index.intersection changed behavior to sort by default in pandas 0.24 #24959

Closed

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Maintain column order with groupby.nth (pandas-dev#22811)

a5ad5fc

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Maintain column order with groupby.nth (pandas-dev#22811)

01ea768

wence- mentioned this pull request Nov 21, 2023

BUG: Discrepency between documentation and output for outer merge on index when left and right indices match and are unique #55992

Closed

3 tasks

	this = self._get_unique_index()

	indexer = this.get_indexer(other)
	indexer = indexer.take((indexer != -1).nonzero()[0])

	label_diff = np.setdiff1d(np.arange(this.size), indexer,
	assume_unique=True)
	the_diff = this.values.take(label_diff)

Uh oh!

BUG: Maintain column order with groupby.nth #22811

BUG: Maintain column order with groupby.nth #22811

Uh oh!

Conversation

reidy-p commented Sep 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Sep 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on October 20, 2018 at 22:24 Hours UTC

Uh oh!

reidy-p Sep 23, 2018

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Sep 23, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Sep 23, 2018

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 25, 2018

Uh oh!

reidy-p commented Sep 26, 2018

Uh oh!

jreback commented Sep 27, 2018

Uh oh!

mroeschke commented Sep 30, 2018

Uh oh!

jreback commented Oct 1, 2018

Uh oh!

reidy-p commented Oct 1, 2018

Uh oh!

jreback commented Oct 7, 2018

Uh oh!

jreback Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

reidy-p commented Oct 11, 2018

Uh oh!

reidy-p Oct 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reidy-p Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

reidy-p Oct 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 1, 2018

Uh oh!

jreback commented Nov 18, 2018

Uh oh!

jreback commented Nov 20, 2018

Uh oh!

Uh oh!

reidy-p commented Sep 23, 2018 •

edited

Loading

pep8speaks commented Sep 23, 2018 •

edited

Loading

codecov bot commented Sep 23, 2018 •

edited

Loading

reidy-p Oct 20, 2018 •

edited

Loading

reidy-p Oct 20, 2018 •

edited

Loading