-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: better dtype inference when doing DataFrame reductions #52788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 52 commits
1e7e563
6397977
0e797b9
b846e70
76ce594
7644598
51da9ef
8d925cd
d7d1989
54bcb60
03b8ce4
e0af36f
64d8d60
a95e5b9
e7a75e4
2e64191
b6c1dc8
32f9a73
8bf7ba8
35b07c5
6a390d4
8dc2acf
5a65c70
9cb34ec
8521f18
82cd91e
e0bc63e
7cf26ae
6330840
efae9dc
b585f3b
52763ab
d4f2a84
7bfe3fe
f48ea09
bbd8cb8
c6e9a80
5200896
26d4059
b6bd75e
44dcdce
3ebcbff
99d034e
79df9db
d01fc1d
bc582f6
a7fd1b1
1781d30
68fd316
8ceb57d
4375cb2
f7b354c
f91c6ca
9a881fa
9d50f85
026696f
f603de0
a7e69ad
772998f
b20a289
082ddd9
8032514
3a3ec95
77992f7
23f22fb
3b8d8f0
1e39b65
1ed3e2d
467073a
dd0bfe8
49334c7
5634106
f85deab
6519712
74410f6
e7503dc
24e2d11
e3afa18
899a2fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,7 +52,7 @@ def _reductions( | |
axis : int, optional, default None | ||
""" | ||
if not skipna: | ||
if mask.any(axis=axis) or check_below_min_count(values.shape, None, min_count): | ||
if mask.any() or check_below_min_count(values.shape, None, min_count): | ||
return libmissing.NA | ||
else: | ||
return func(values, axis=axis, **kwargs) | ||
|
@@ -119,11 +119,11 @@ def _minmax( | |
# min/max with empty array raise in numpy, pandas returns NA | ||
return libmissing.NA | ||
else: | ||
return func(values) | ||
return func(values, axis=axis) | ||
else: | ||
subset = values[~mask] | ||
if subset.size: | ||
return func(subset) | ||
return func(values, where=~mask, axis=axis, initial=subset[0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason you needed to pass the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is intended to work for 2d arrays returning a 1d array, but without There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, OK I see you changes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've changed back to use |
||
else: | ||
# min/max with empty array raise in numpy, pandas returns NA | ||
return libmissing.NA | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1533,6 +1533,12 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs): | |
|
||
return result.as_py() | ||
|
||
def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you adding another method here? what's wrong with just fixing _reduce? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also, we can't supply There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so why don't u just update _reduce? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a version similar to this that added a keepdims kwd to _reduce and we decided that this was better bc it didn't require a deprecation path for 3rd party EAs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. _reduce calls other methods, e.g. sum. It's in those methods the failures happen when we give keepdims=True and those methods are public. Do we want to change their signatures (and the ._reduce signature) to include keepdims? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's just ading the keepdims keyword to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Compatibility concerns aside, I think the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a hypothetical EA wanted to do a reduction lazily, that would be much easier with a keepdims keyword than with a _reduce_and_wrap method. Just a thought, not worth contorting ourselves over a hypothetical EA There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not opposed to a One way to address the compatibility concerns could be to introspect the signature of In v3.0 we'll skip the signature introspection and make the |
||
"""Takes the result of ``_reduce`` and wraps it an a ndarray/extensionArray.""" | ||
result = self._reduce_pyarrow(name, skipna=skipna, **kwargs) | ||
result = pa.array([result.as_py()], type=result.type) | ||
return type(self)(result) | ||
|
||
def __setitem__(self, key, value) -> None: | ||
"""Set one or more values inplace. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2099,6 +2099,10 @@ def _reverse_indexer(self) -> dict[Hashable, npt.NDArray[np.intp]]: | |
# ------------------------------------------------------------------ | ||
# Reductions | ||
|
||
def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs): | ||
result = self._reduce(name, skipna=skipna, **kwargs) | ||
return type(self)([result], dtype=self.dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we get here with e.g. any/all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Categorical doesn't support any/all, IDK why actually, seems like it could, if the categories do. Do you have any specific issue or other array in mind? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gentle ping... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe a comment that if any/all are ever supported then we shouldnt do this wrapping? |
||
|
||
def min(self, *, skipna: bool = True, **kwargs): | ||
""" | ||
The minimum value of the object. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,10 @@ | |
Shape, | ||
npt, | ||
) | ||
from pandas.compat import ( | ||
IS64, | ||
is_platform_windows, | ||
) | ||
from pandas.errors import AbstractMethodError | ||
from pandas.util._decorators import doc | ||
from pandas.util._validators import validate_fillna_kwargs | ||
|
@@ -1088,25 +1092,62 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs): | |
|
||
# median, skew, kurt, sem | ||
op = getattr(nanops, f"nan{name}") | ||
result = op(data, axis=0, skipna=skipna, mask=mask, **kwargs) | ||
|
||
axis = kwargs.pop("axis", None) | ||
result = op(data, axis=axis, skipna=skipna, mask=mask, **kwargs) | ||
if np.isnan(result): | ||
return libmissing.NA | ||
result = libmissing.NA | ||
|
||
return result | ||
return self._wrap_reduction_result( | ||
name, result, skipna=skipna, axis=axis, **kwargs | ||
) | ||
|
||
def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs): | ||
df = self.reshape(-1, 1) | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
res = df._reduce(name=name, skipna=skipna, axis=0, **kwargs) | ||
return res | ||
|
||
def _wrap_reduction_result(self, name: str, result, skipna, **kwargs): | ||
axis = kwargs["axis"] | ||
if isinstance(result, np.ndarray): | ||
axis = kwargs["axis"] | ||
if skipna: | ||
# we only retain mask for all-NA rows/columns | ||
mask = self._mask.all(axis=axis) | ||
else: | ||
mask = self._mask.any(axis=axis) | ||
|
||
return self._maybe_mask_result(result, mask) | ||
elif result is libmissing.NA and self.ndim == 2: | ||
result = self._wrap_na_result(name=name, axis=axis) | ||
return result | ||
return result | ||
|
||
def _wrap_na_result(self, *, name, axis): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this feels like it got a lot more complicated than the first attempt at "keepdims". does this just address more corner cases the first attempt missed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this is not great, and I've been tinkered quite a lot with other implementations. I think/hope I will have a simpler solution tonight. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I looked into this and unless I do a relatively major refactoring, it looks like I just move complexity around by changing this. So without a bigger rework of Suggestions/ideas that show otherwise welcome, of course. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to add, the underlying issue is that the functions in The solution would be to return proper 2D results from
and the wrap the result in an array in However, I 'm not sure if that's the direction we want to pursue because unless we want to support 2d masked arrays, the current solution could still be simpler than building out 2d reduction support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Longer term, that indeed sounds as a good alternative, even for the current 1D-reshaped-as-2D case. One potential disadvantage is that you still need to do some actual calculation on the data, unnecessarily, to get the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only "know" because of the hard-coding here, for example IMO that's a possible future PR, if we choose to go in that direction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In the end, also numpy kind of "hard codes" this, but just in the implementation itself. And we get the difficulty of combining the algo of numpy with our own logic, where we also want to know this information. So I don't think it's necessarily "wrong" to hard code this on our side as well (i.e. essentially keep it as you are doing in this PR, also long term (at least as long as the arrays are only 1D)). Sidenote, from seeing the additional complexity for 32bit systems in deciding the result dtype, I do wonder if we actually want to get rid of that system-dependent behaviour? (we also don't follow numpy's behaviour in the constructors, but always default to int64) |
||
mask_size = self.shape[1] if axis == 0 else self.shape[0] | ||
mask = np.ones(mask_size, dtype=bool) | ||
|
||
float_dtyp = "float32" if self.dtype == "Float32" else "float64" | ||
if name in ["mean", "median", "var", "std", "skew"]: | ||
np_dtype = float_dtyp | ||
elif name in ["min", "max"] or self.dtype.itemsize == 8: | ||
np_dtype = self.dtype.numpy_dtype.name | ||
else: | ||
is_windows_or_32bit = is_platform_windows() or not IS64 | ||
int_dtyp = "int32" if is_windows_or_32bit else "int64" | ||
uint_dtyp = "uint32" if is_windows_or_32bit else "uint64" | ||
np_dtype = {"b": int_dtyp, "i": int_dtyp, "u": uint_dtyp, "f": float_dtyp}[ | ||
self.dtype.kind | ||
] | ||
|
||
value = np.array([1], dtype=np_dtype) | ||
return self._maybe_mask_result(value, mask=mask) | ||
|
||
def _wrap_min_count_reduction_result( | ||
self, name: str, result, skipna, min_count, **kwargs | ||
): | ||
if min_count == 0 and isinstance(result, np.ndarray): | ||
return self._maybe_mask_result(result, np.zeros(result.shape, dtype=bool)) | ||
return self._wrap_reduction_result(name, result, skipna, **kwargs) | ||
|
||
def sum( | ||
self, | ||
*, | ||
|
@@ -1124,8 +1165,8 @@ def sum( | |
min_count=min_count, | ||
axis=axis, | ||
) | ||
return self._wrap_reduction_result( | ||
"sum", result, skipna=skipna, axis=axis, **kwargs | ||
return self._wrap_min_count_reduction_result( | ||
"sum", result, skipna=skipna, min_count=min_count, axis=axis, **kwargs | ||
) | ||
|
||
def prod( | ||
|
@@ -1137,15 +1178,16 @@ def prod( | |
**kwargs, | ||
): | ||
nv.validate_prod((), kwargs) | ||
|
||
result = masked_reductions.prod( | ||
self._data, | ||
self._mask, | ||
skipna=skipna, | ||
min_count=min_count, | ||
axis=axis, | ||
) | ||
return self._wrap_reduction_result( | ||
"prod", result, skipna=skipna, axis=axis, **kwargs | ||
return self._wrap_min_count_reduction_result( | ||
"prod", result, skipna=skipna, min_count=min_count, axis=axis, **kwargs | ||
) | ||
|
||
def mean(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs): | ||
|
@@ -1192,23 +1234,29 @@ def std( | |
|
||
def min(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs): | ||
nv.validate_min((), kwargs) | ||
return masked_reductions.min( | ||
result = masked_reductions.min( | ||
self._data, | ||
self._mask, | ||
skipna=skipna, | ||
axis=axis, | ||
) | ||
return self._wrap_reduction_result( | ||
"min", result, skipna=skipna, axis=axis, **kwargs | ||
) | ||
|
||
def max(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs): | ||
nv.validate_max((), kwargs) | ||
return masked_reductions.max( | ||
result = masked_reductions.max( | ||
self._data, | ||
self._mask, | ||
skipna=skipna, | ||
axis=axis, | ||
) | ||
return self._wrap_reduction_result( | ||
"max", result, skipna=skipna, axis=axis, **kwargs | ||
) | ||
|
||
def any(self, *, skipna: bool = True, **kwargs): | ||
def any(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it needed to add the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll look into it, could be connected to your previous comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this works, and I've made another version, will if it passes, and then I'll look into your other comments |
||
""" | ||
Return whether any element is truthy. | ||
|
||
|
@@ -1227,6 +1275,7 @@ def any(self, *, skipna: bool = True, **kwargs): | |
If `skipna` is False, the result will still be True if there is | ||
at least one element that is truthy, otherwise NA will be returned | ||
if there are NA's present. | ||
axis : int, optional, default 0 | ||
**kwargs : any, default None | ||
Additional keywords have no effect but might be accepted for | ||
compatibility with NumPy. | ||
|
@@ -1270,7 +1319,6 @@ def any(self, *, skipna: bool = True, **kwargs): | |
>>> pd.array([0, 0, pd.NA]).any(skipna=False) | ||
<NA> | ||
""" | ||
kwargs.pop("axis", None) | ||
nv.validate_any((), kwargs) | ||
|
||
values = self._data.copy() | ||
|
@@ -1289,7 +1337,7 @@ def any(self, *, skipna: bool = True, **kwargs): | |
else: | ||
return self.dtype.na_value | ||
|
||
def all(self, *, skipna: bool = True, **kwargs): | ||
def all(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs): | ||
""" | ||
Return whether all elements are truthy. | ||
|
||
|
@@ -1308,6 +1356,7 @@ def all(self, *, skipna: bool = True, **kwargs): | |
If `skipna` is False, the result will still be False if there is | ||
at least one element that is falsey, otherwise NA will be returned | ||
if there are NA's present. | ||
axis : int, optional, default 0 | ||
**kwargs : any, default None | ||
Additional keywords have no effect but might be accepted for | ||
compatibility with NumPy. | ||
|
@@ -1351,7 +1400,6 @@ def all(self, *, skipna: bool = True, **kwargs): | |
>>> pd.array([1, 0, pd.NA]).all(skipna=False) | ||
False | ||
""" | ||
kwargs.pop("axis", None) | ||
nv.validate_all((), kwargs) | ||
|
||
values = self._data.copy() | ||
|
@@ -1361,7 +1409,7 @@ def all(self, *, skipna: bool = True, **kwargs): | |
# bool, int, float, complex, str, bytes, | ||
# _NestedSequence[Union[bool, int, float, complex, str, bytes]]]" | ||
np.putmask(values, self._mask, self._truthy_value) # type: ignore[arg-type] | ||
result = values.all() | ||
result = values.all(axis=axis) | ||
|
||
if skipna: | ||
return result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, is it worth adding this note? It's a significant bug fix / enhancement (that changes behaviour, so we give it visibility in the whatsnew notes), but we still don't do this for many other fixes / enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait to see what the consensus is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Joris
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've removed the note.