Skip to content

PERF: optimized median func when bottleneck not present #16509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 22, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of instantiating :class:`SparseDataFrame` (:issue:`16773`)
- Improved performance of :func:`median()` when bottleneck is not installed but :func:`np.namnmedian` is present (:issue:`16468`)
- :attr:`Series.dt` no longer performs frequency inference, yielding a large speedup when accessing the attribute (:issue:`17210`)


Expand Down
17 changes: 13 additions & 4 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from distutils.version import LooseVersion

import numpy as np
from pandas import compat
from pandas import compat, _np_version_under1p9
from pandas._libs import tslib, algos, lib
from pandas.core.dtypes.common import (
_get_dtype,
Expand Down Expand Up @@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):

values, mask, dtype, dtype_max = _get_values(values, skipna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _np_version_under1p19 which are already defined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then define

if _np_under_version1p19:
       agg1d = get_median
       aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
      agg1d = np.nanmedian
      aggaxis = lambda values, axis: np.nanmedian(values, axis)

then the code is almost the same as before, just replace with agg1d and aggaxis


if not skipna:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fix the test to do the correct comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so use the skipna flag as a switch when generating which function to use.

return _wrap_results(np.median(values, axis=axis), dtype)

def get_median(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this will break in numpy < 1.9, you need to do this conditionally.

mask = notna(x)
if not skipna and not mask.all():
return np.nan
return algos.median(_values_from_object(x[mask]))

if _np_version_under1p9:
agg1d = get_median
aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
agg1d = np.nanmedian
aggaxis = np.nanmedian

if not is_float_dtype(values):
values = values.astype('f8')
values[mask] = np.nan
Expand All @@ -363,8 +373,7 @@ def get_median(x):
if values.ndim > 1:
# there's a non-empty array to apply over otherwise numpy raises
if notempty:
return _wrap_results(
np.apply_along_axis(get_median, axis, values), dtype)
return _wrap_results(aggaxis(values, axis), dtype)

# must return the correct shape, but median is not defined for the
# empty set so return nans of shape "everything but the passed axis"
Expand All @@ -377,7 +386,7 @@ def get_median(x):
return _wrap_results(ret, dtype)

# otherwise return a scalar value
return _wrap_results(get_median(values) if notempty else np.nan, dtype)
return _wrap_results(agg1d(values) if notempty else np.nan, dtype)


def _get_counts_nanvar(mask, axis, ddof, dtype=float):
Expand Down