-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: unify sort API #5190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What happens if you have columns named 'index' or 'columns'? |
Is the version of mergesort used stable? Assuming yes there are times then to prefer merge over quick. |
mergesort is stable. the point is that right now |
Fwiw java uses different sorts for primitives vs objects : http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#sort(int[], int, int) Stability for objects speed for primitives. |
My point being there might actually be a valid reason to gave different defaults. |
There IS often a valid reason to have different sorts available, that's why there is a keyword, but its incorrect to have a routine which does the same thing have different defaults. aside from the fact that java is NOT a good example of coding :) these are all primitives which are being sorted |
For 0.13 - could we change |
I think we ought to fix the API - but let's think thru |
okay |
You can't make quicksort the default, sort stability is a contract and you can't just break it. numpy arrays and series currently have compatible signatures for sort:
I find it hard to believe that's a coincidence or just an act of cargo-cult when the API was designed. In general, more "consistency" achieved by breaking existing code is not automatically a good move. Once an API is established, the useability argument is moot since more people are used to it then not. There have been too many breaking changes in recent releases and we seem to treat TL;DR: if a 1000 people need to modify working code to accomodate this change, is it still worth doing? |
@y-p okay, so let's use that signature everywhere instead |
You think we should drop the |
quick sort is not the default, not sure where u see that I don't think anything actually needs to change here except allowing inplace arguments (which sort has inplace=True and order has inplace=False) and IMHO just because numpy had setup an API doesn't mean it's what pandas should necessarily do - most of the time yes but by definition pandas does provide a richer environment |
The suggested unified signature:
|
sorry I wrote what the default are! I think they happen to be weird that sorting in place vs a copy has 2 different defaults guess numpy defaults quick sort |
Like I said, I don't think the reason was cargo-cult there must have been a stronger reason |
The more detailed the argument the less effective it is I guess. Will the proposed changes break a large chunk of existing code and if so, is it worth it? |
The sort options are a little confusing so I'm glad to see a unified API in the works. As of 0.14.1, the series |
@patricktokeeffe unfortunately this won't be address in 0.15.0 (unless someone steps up). Its not difficult, just a bit of testing / deprecation. Interested? |
Sounds interesting. Is there a branch in progress? Most of the discussion seems to be in this issue. |
nope just this issue feel free to poke around and see what you would change to make things consistent would ideally be done in 0.15 if we are to break anything |
I do think How about this unified signature? It has:
def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True,
inplace=False, kind='quicksort', na_position='last'): The current default behavior for So the revised proposed changes are:
This still swaps the first two arguments of df.sort_index, but I don't understand how df.sort differs from df.sort_index right now anyway. Regarding syntax:
There is still refinement to do here. I haven't used MultiIndexes, for example, so I don't know quite they should be handled. Probably like they are in the columns/by keywords of df.sort/sort_index but passed to on_index instead? |
Per discussion in #7121, level sorting can be folded into
|
can you post signature for new sort? |
current proposal: def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True,
inplace=False, kind='quicksort', na_position='last', sort_remaining=None): |
why do you need
|
I introduced Would users ever sort on columns and index? Like: |
That last is possible if you wanted to sort by a particular |
@patricktokeeffe can you post a complete proposal (e.g. signature / doc string), functions to add/change, and and API changes to expect. I will upate the top of this PR with it. |
OK. revised proposed signature (complete proposal forthcoming...): def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=None): The sort_remaining keyword comes from Series/DataFrame.sortlevel. |
For reference, the 0.14.1 signatures are: Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)
DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort',
na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort',
na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True) Proposed unified signature for # maybe axis first, then by?
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True):
"""Sort by labels (along either axis), by the values in column(s) or both.
If both, labels take precedence over columns. If neither is specified,
behavior is object-dependent: series = on values, dataframe = on index.
Parameters
----------
by : column name or list of column names
if not None, sort on values in specified column name; perform nested
sort if list of column names specified. this argument ignored by series
axis : {0, 1}
sort index/rows (0) or columns (1); for Series, only 0 can be specified
level : int or level name or list of ints or list of column names
if not None, sort on values in specified index level(s)
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort orders.
inplace : bool
if True, perform operation in-place (without creating new instance)
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
Choice of sorting algorithm. See np.sort for more information.
‘mergesort’ is the only stable algorithm. For data frames, this option is
only applied when sorting on a single column or label.
na_position : {'first', 'last'}
‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other levels
too (in order) after sorting by specified level
""" The Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort',
na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True)
# by is DEPRECATED, see change 7
DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True)
# or maybe level=None Proposed changes:
Notes:
Syntax:
This is getting pretty hairy... Please point out contradictions/logic flaws/etc. |
I am going to close this issue, and have you open a new one (with your text from above, just copy-past). |
related is #2094
related is #6847 (fixes kind and some arg ordering)
related is #7121 (make
sortlevel
a part ofsort_index
by adding level arg)the sorting API is currently inconsistent and confusing. here is what exists:
Series:
sort
: callsSeries.order
, in-place, defaultsquicksort
order
: do the sort on values, return a new object, defaultsmergesort
sort_index
: sort by labels, returns new objectFrame:
sort
: callssort_index
sort_index
: sorts by the index with no args, otherwise a nested sort of the passed columnsThe semantics are different between
Series
andDataFrame
. InSeries
,sort
mean in-place,order
returns a new object.sort/order
sort on the values, whilesort_index
sorts on the index. For aDataFrame
,sort
andsort_index
are the same and sort on a column/list of columns;inplace
is a keyword.Proposed signature of combined methods. We need to break a
Series
API here. becausesort
is an in-place method which is quite inconsistent with everything else.This is what I think we should do:
Series.sort/order
be the same.index
to provide index sorting (which means sort by the specifiied axis)inplace=False
(which is the same as now, except forSeries.sort
).Series.sort_index
doess.sort('index')
DataFrame.sort_index
doesdf.sort('index')
Series.order
DataFrame.sort_columns
to perform axis=1 sortingThis does switch the argument to the current
sort_index
, (e.g. axis is currently first), but I think then allows more natural syntaxdf.sort()
ordf.sort_index()
ordf.sort_index('index')
sort on the index labelsdf.sort(['A','B'],axis=1)
sort on these columns (allow 'index' here as well to sort on the index too)df.sort_columns()
ordf.sort('columns')
sort on the column labelsdf.sort_columns()
defaultsaxis=1
, sodf.sort_columns(['A','B'])
is equiv of - -df.sort(['A','B'],axis=1)
s.sort()
sort on the valuess.sort('index')
ors.sort_index()
sort on the series indexThe text was updated successfully, but these errors were encountered: