Skip to content

WIP/API: Implemented NDFrame.argsort() and NDFrame.ordering(). #12707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

seth-p
Copy link
Contributor

@seth-p seth-p commented Mar 24, 2016

... still a work in progress

@gfyoung
Copy link
Member

gfyoung commented Mar 24, 2016

@jreback : Whether or not this change is useful/correct aside, should this wait until a PR for fixing the rest of the compatibility issues with fromnumeric.py is done first? That would be useful in paving/shaping the way for the development of other fromnumeric.py functions like this one.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

@seth-p not going to accept this. you are adding ANOTHER way of doing things which is simply not necessary w.r.t. our discussion in the other issue. .argsort does exactly what the docs say it should.

I think you really have to have a good case for diverging here and I just don't see a usecase at all.

@jreback jreback closed this Mar 24, 2016
@seth-p
Copy link
Contributor Author

seth-p commented Mar 24, 2016

@jreback, I feel like you didn't actually read everything I wrote in #12694. Let me try again. If you're still not convinced, then I give up.

@gfyoung, could you chime in?

Yes, I pointed out in #12694 (comment) that the documentation of Series.argsort() is correct. The problem isn't that the implementation doesn't correspond to the documentation. The problem is that the implementation/documentation is utterly useless. Not useless in the sense that there's a more efficient way to accomplish the same thing. Not useless in the sense that it's not Pythonic, or not Panda-ic/ish. Useless in the sense that there is no possible logical question to which it is the answer.

Let me give an analogy. Consider the following absurd function. On first glance it might look like a plausible thing to do. And I could probably document it "correctly" with a one-sentence summary that sounds logical. But it is absurd. The resulting Series makes no sense. I would hope that if someone submitted a PR for Series.absurd() that it would be rejected.

In [203]: def absurd(s):
   .....:     return pd.Series(s.values, index=s.index.sort_values()).reindex(s.index)
   .....:

In [204]: s = pd.Series([500, 100, np.nan, 300, 200, 600, np.nan], index=list('bacgfde'))

In [205]: s
Out[205]:
b    500.0
a    100.0
c      NaN
g    300.0
f    200.0
d    600.0
e      NaN
dtype: float64

In [206]: absurd(s)
Out[206]:
b    100.0
a    500.0
c      NaN
g      NaN
f    600.0
d    300.0
e    200.0
dtype: float64

Anyway, let's return to .argsort(). Let me make three statements about a = s.argsort(). Please let me know if you disagree with any of them.

  1. The only valid interpretation of a.values is as indices into s.dropna(). They have no meaning as indices into s. So if you're going to be using them with s.dropna() anyway, you might as well eliminate the putative NaN handling from .argsort() and just use s.dropna().argsort(). In other words, `theNaNhandling in the existing``.argsort()`` complicates things and doesn't help at all.
  2. There is no connection between a.values and a.index (which is equal to s.index). The only plausible use of a is in using a.values and ignoring a.index; there is no way that using a as a Series (in a way that actually uses a.index) makes sense.
  3. In view of 1 and 2 above, defining s.argsort() to return simply the NumPy s.values.argsort() would (a) give a more useful treatment of NaNs, and (b) avoid a spurious and misleading association between a.values and a.index.

In view of 3, I propose to redefine Series.argsort() simply as self.values.argsort(), and also extend it to DataFrame, Panel, and Panel4D. Separately, I propose a separate .ordering() method (for Series, DataFrame, Panel, and Panel4D) that accomplishes what I think whoever implemented Series.argsort() must have thought they were implementing; at any rate, it makes sense, and is potentially useful.

@gfyoung
Copy link
Member

gfyoung commented Mar 24, 2016

@seth-p : Not being a heavy user of argsort in general (both in numpy or pandas), I will not comment on the validity of this PR or whether or not this is a beneficial change, as that is up to @jreback (or other users with merging powers) to decide. However, I will say two things:

  1. Your argument does nothing to answer @jreback original concern, which is use case. You need to illustrate why your changes would be useful in the context of every class you are adding it to.

  2. You can't make the claim that the implementation is "utterly useless." On what grounds can you make that assertion? Your own opinion is not sufficient to support it. There are too many users out there of this library for a single person's opinion to carry any weight.

@seth-p
Copy link
Contributor Author

seth-p commented Mar 24, 2016

@gfyoung: Fair enough.

1a. My proposed implementation of .argsort() for all data structures is as useful as NumPy's. To be honest, I can't think of a clean use in higher dimensions off the top of my head, but in one dimension the use is clear, e.g. a[b.argsort()] will give you a sorted by the values in b. (Perhaps one can somehow accomplish the same thing in higher dimensions, though I haven't figured it out.) If we're going to implement it simply as s.values.argsort(), might as well support higher-dimensional objects as well.

1b. My proposed implementation of .ordering() is useful in the following scenario. Suppose df is a data frame of stock market capitalizations, with df.columns being stocks and df.index being days. Then (-df).ordering() < 500 gives me a mask of the largest 500 stocks on each day, or more simply (-df).ordering() lets me determine the market cap rank of a given stock on a given day.

2 . Yes, I realize that in general one can't make a blanket statement that something is useless, but in this case the connection between the values and the index of the resulting Series is so tenuous/meaningless/nonexistent, that I find it extremely hard to fathom. I challenge anyone to show me an example of a plausible use of s.argsort() in the presence of NaNs that doesn't (a) simply use s.dropna() (so that the NaN handling is superfluous), and (b) makes use of the Series structure (and not just the values).

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

@seth-p

I would assert these cover all the cases where .argsort would actually be used. Further [4] is way more performant. Unless you can come up with a case which is actually useful, I don't see a problem. It performs exactly as advertised and when used in a pandonic way (meaning with the index still attached) its quite useful. Using ONLY numpy routines is not that useful, but numpy requires you to track way more things manually anyhow.

In [1]: s = pd.Series([500, 100, np.nan, 300, 200, 600, np.nan], index=list('bacgfde'))

In [2]: s
Out[2]: 
b    500.0
a    100.0
c      NaN
g    300.0
f    200.0
d    600.0
e      NaN
dtype: float64

# what you need/want if you actually sort things
In [3]: s.sort_values()
Out[3]: 
a    100.0
f    200.0
g    300.0
b    500.0
d    600.0
c      NaN
e      NaN
dtype: float64

# this is a topk
In [4]: s.nlargest(2)
Out[4]: 
d    600.0
b    500.0
dtype: float64

# of course you can position nans at the beginning or end
In [5]: s.reset_index(drop=True).sort_values()
Out[5]: 
1    100.0
4    200.0
3    300.0
0    500.0
5    600.0
2      NaN
6      NaN
dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

So maybe we should deprecate / change .argsort, but replacing with another routine is just plain API bloat and not going to happen.

@seth-p
Copy link
Contributor Author

seth-p commented Mar 24, 2016

I'm fine dropping/depracating Series.argsort(). The one thing you lose over my proposed .argsort() is the ability to do a[b.argsort()] where a and b are of the same length but have different indexes, though (i) you could argue that this use case is non-Panda-ish; and (ii) one could always do a[b.values.argsort()]. I feel much stronger about eliminating/depracating the existing Series.argsort() than I do about my proposed replacement.

As for my proposed .ordering(), I don't see a way using DataFrame.sort_values() to accomplish what I described for a DataFrame. Suppose I have two DataFramess, returns and market_caps, each of the form DataFrame(values, index=days, columns=stocks). With my .ordering() I could calculate a Series consisting of the mean return each day of the 500 largest stocks (that day): returns[(-market_caps).ordering() < 500].mean(axis=1). How would one accomplish this using .sort_values() or .nlargest() or some other existing functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants