Skip to content

ENH: better dtype inference when doing DataFrame reductions #52788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Jul 13, 2023
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
1e7e563
ENH: better dtype inference when doing DataFrame reductions
topper-123 Apr 19, 2023
6397977
precommit issues
topper-123 Apr 19, 2023
0e797b9
fix failures
topper-123 Apr 19, 2023
b846e70
fix failures
topper-123 Apr 19, 2023
76ce594
mypy + some docs
topper-123 Apr 20, 2023
7644598
doc linting linting
topper-123 Apr 20, 2023
51da9ef
refactor to use _reduce_with_wrap
topper-123 Apr 20, 2023
8d925cd
docstring linting
topper-123 Apr 20, 2023
d7d1989
pyarrow failure + linting
topper-123 Apr 20, 2023
54bcb60
pyarrow failure + linting
topper-123 Apr 20, 2023
03b8ce4
linting
topper-123 Apr 20, 2023
e0af36f
doc stuff
topper-123 Apr 20, 2023
64d8d60
linting fixes
topper-123 Apr 21, 2023
a95e5b9
fix fix doc string
topper-123 Apr 22, 2023
e7a75e4
remove _wrap_na_result
topper-123 Apr 22, 2023
2e64191
doc string example
topper-123 Apr 23, 2023
b6c1dc8
pyarrow + categorical
topper-123 Apr 24, 2023
32f9a73
silence bugs
topper-123 Apr 25, 2023
8bf7ba8
silence errors
topper-123 Apr 25, 2023
35b07c5
silence errors II
topper-123 Apr 25, 2023
6a390d4
fix errors III
topper-123 Apr 25, 2023
8dc2acf
various fixups
topper-123 Apr 25, 2023
5a65c70
various fixups
topper-123 Apr 25, 2023
9cb34ec
delay fixing windows and 32bit failures
topper-123 Apr 26, 2023
8521f18
BUG: Adding a columns to a Frame with RangeIndex columns using a non-…
topper-123 Apr 23, 2023
82cd91e
DOC: Update whatsnew (#52882)
phofl Apr 23, 2023
e0bc63e
CI: Change development python version to 3.10 (#51133)
phofl Apr 26, 2023
7cf26ae
update
topper-123 Apr 27, 2023
6330840
update
topper-123 Apr 29, 2023
efae9dc
add docs
topper-123 May 1, 2023
b585f3b
fix windows tests
topper-123 May 1, 2023
52763ab
fix windows tests
topper-123 May 1, 2023
d4f2a84
remove guards for 32bit linux
topper-123 May 2, 2023
7bfe3fe
add bool tests + fix 32-bit failures
topper-123 May 2, 2023
f48ea09
fix pre-commit failures
topper-123 May 2, 2023
bbd8cb8
fix mypy failures
topper-123 May 2, 2023
c6e9a80
rename _reduce_with -> _reduce_and_wrap
topper-123 May 2, 2023
5200896
assert missing attributes
topper-123 May 2, 2023
26d4059
reduction dtypes on windows and 32bit systems
topper-123 May 3, 2023
b6bd75e
add tests for min_count=0
topper-123 May 3, 2023
44dcdce
PERF:median with axis=1
topper-123 May 4, 2023
3ebcbff
median with axis=1 fix
topper-123 May 4, 2023
99d034e
streamline Block.reduce
topper-123 May 5, 2023
79df9db
fix comments
topper-123 May 6, 2023
d01fc1d
FIX preserve dtype with datetime columns of different resolution when…
glemaitre May 14, 2023
bc582f6
BUG Merge not behaving correctly when having `MultiIndex` with a sing…
Charlie-XIAO May 16, 2023
a7fd1b1
BUG: preserve dtype for right/outer merge of datetime with different …
jorisvandenbossche May 17, 2023
1781d30
remove special BooleanArray.sum method
topper-123 May 22, 2023
68fd316
remove BooleanArray.prod
topper-123 May 23, 2023
8ceb57d
fixes
topper-123 May 27, 2023
4375cb2
Update doc/source/whatsnew/v2.1.0.rst
topper-123 May 29, 2023
f7b354c
Update pandas/core/array_algos/masked_reductions.py
topper-123 May 29, 2023
f91c6ca
small cleanup
topper-123 May 29, 2023
9a881fa
small cleanup
topper-123 May 29, 2023
9d50f85
Merge branch 'master' into reduction_dtypes_II
topper-123 May 31, 2023
026696f
Merge branch 'master' into reduction_dtypes_II
topper-123 May 31, 2023
f603de0
only reduce 1d
topper-123 May 31, 2023
a7e69ad
Merge branch 'reduction_dtypes_II' of https://github.com/topper-123/p…
topper-123 May 31, 2023
772998f
fix after #53418
topper-123 May 31, 2023
b20a289
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 1, 2023
082ddd9
update according to comments
topper-123 Jun 3, 2023
8032514
revome note
topper-123 Jun 3, 2023
3a3ec95
update _minmax
topper-123 Jun 5, 2023
77992f7
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 5, 2023
23f22fb
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 10, 2023
3b8d8f0
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 10, 2023
1e39b65
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 19, 2023
1ed3e2d
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 24, 2023
467073a
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 27, 2023
dd0bfe8
Merge branch 'master' into reduction_dtypes_II
topper-123 Jun 29, 2023
49334c7
REF: add keepdims parameter to ExtensionArray._reduce + remove Extens…
topper-123 Jun 29, 2023
5634106
REF: add keepdims parameter to ExtensionArray._reduce + remove Extens…
topper-123 Jun 29, 2023
f85deab
fix whatsnew
topper-123 Jun 29, 2023
6519712
fix _reduce call
topper-123 Jun 29, 2023
74410f6
Merge branch 'master' into reduction_dtypes_II
topper-123 Jul 7, 2023
e7503dc
Merge branch 'master' into reduction_dtypes_II
topper-123 Jul 12, 2023
24e2d11
Merge branch 'master' into reduction_dtypes_II
topper-123 Jul 12, 2023
e3afa18
simplify test
topper-123 Jul 12, 2023
899a2fb
add tests for any/all
topper-123 Jul 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ objects.
api.extensions.ExtensionArray._from_sequence_of_strings
api.extensions.ExtensionArray._hash_pandas_object
api.extensions.ExtensionArray._reduce
api.extensions.ExtensionArray._reduce_and_wrap
api.extensions.ExtensionArray._values_for_argsort
api.extensions.ExtensionArray._values_for_factorize
api.extensions.ExtensionArray.argsort
Expand Down
9 changes: 8 additions & 1 deletion doc/source/user_guide/integer_na.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,20 @@ These dtypes can be merged, reshaped & casted.
pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes
df["A"].astype(float)

Reduction and groupby operations such as 'sum' work as well.
Reduction and groupby operations such as :meth:`~DataFrame.sum` work as well.

.. ipython:: python

df.sum(numeric_only=True)
df.sum()
df.groupby("B").A.sum()

.. versionchanged:: 2.1.0

When doing reduction operations (:meth:`~DataFrame.sum` etc.) on numeric-only data
frames the integer array dtype will be maintained. Previously, the dtype of reduction
result would have been a numpy numeric dtype.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, is it worth adding this note? It's a significant bug fix / enhancement (that changes behaviour, so we give it visibility in the whatsnew notes), but we still don't do this for many other fixes / enhancements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait to see what the consensus is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Joris

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've removed the note.


Scalar NA Value
---------------

Expand Down
6 changes: 6 additions & 0 deletions doc/source/user_guide/pyarrow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,12 @@ The following are just some examples of operations that are accelerated by nativ
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
ser_dt.dt.strftime("%Y-%m")

.. versionchanged:: 2.1.0

When doing :class:`DataFrame` reduction operations (:meth:`~DataFrame.sum` etc.) on
pyarrow data the dtype now will be maintained when possible. Previously, the dtype
of reduction result would have been a numpy numeric dtype.

I/O Reading
-----------

Expand Down
39 changes: 36 additions & 3 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,43 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_210.enhancements.enhancement1:
.. _whatsnew_210.enhancements.reduction_extension_dtypes:

enhancement1
^^^^^^^^^^^^
DataFrame reductions preserve extension dtypes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions of pandas, the results of DataFrame reductions
(:meth:`DataFrame.sum` :meth:`DataFrame.mean` etc.) has numpy dtypes even when the DataFrames
were of extension dtypes. Pandas can now keep the dtypes when doing reductions over Dataframe
columns with a common dtype (:issue:`52788`).

*Old Behavior*

.. code-block:: ipython

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")
In [2]: df.sum()
Out[2]:
a 5
b 9
dtype: int64
In [3]: df = df.astype("int64[pyarrow]")
In [4]: df.sum()
Out[4]:
a 5
b 9
dtype: int64

*New Behavior*

.. ipython:: python

df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")
df.sum()
df = df.astype("int64[pyarrow]")
df.sum()

Notice that the dtype is now a masked dtype and pyarrow dtype, respectively, while previously it was a numpy integer dtype.

.. _whatsnew_210.enhancements.enhancement2:

Expand Down
6 changes: 3 additions & 3 deletions pandas/core/array_algos/masked_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def _reductions(
axis : int, optional, default None
"""
if not skipna:
if mask.any(axis=axis) or check_below_min_count(values.shape, None, min_count):
if mask.any() or check_below_min_count(values.shape, None, min_count):
return libmissing.NA
else:
return func(values, axis=axis, **kwargs)
Expand Down Expand Up @@ -119,11 +119,11 @@ def _minmax(
# min/max with empty array raise in numpy, pandas returns NA
return libmissing.NA
else:
return func(values)
return func(values, axis=axis)
else:
subset = values[~mask]
if subset.size:
return func(subset)
return func(values, where=~mask, axis=axis, initial=subset[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you needed to pass the axis keyword here? I don't think this can ever work / have an impact right now, since values will be flattened because of the subsetting with mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to work for 2d arrays returning a 1d array, but without axis we'll say axis=None, which won't return a 1d array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK I see you changes func(subset) to func(values, ..) (so calculating on the 2D values instead of 1D subset). But in that case, creating subset is unnecessary overhead, only to essentially check if mask is not all-True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed back to use subset. To use where=~mask the func requires initial to have a value, so there is no real saving by using where=~mask in this case.

else:
# min/max with empty array raise in numpy, pandas returns NA
return libmissing.NA
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1533,6 +1533,12 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):

return result.as_py()

def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding another method here? what's wrong with just fixing _reduce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_reduce on 1d arrays only returns a scalar and we can't differentiate between scalars from reductions from e.g. numpy.int64 and pandas.Int64() arrays. Reductions that return pd.NA are just as bad, because pd.NA holds no dtype info.

Also, we can't supply keepdims to _reduce, because pandas raises when keepdims is given as a parameter in the reduction methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why don't u just update _reduce?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a version similar to this that added a keepdims kwd to _reduce and we decided that this was better bc it didn't require a deprecation path for 3rd party EAs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_reduce calls other methods, e.g. sum. It's in those methods the failures happen when we give keepdims=True and those methods are public. Do we want to change their signatures (and the ._reduce signature) to include keepdims?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just ading the keepdims keyword to _reduce, that will be relatively technically easy. It's adding it to sum etc. that probably will take more effort. Also note @jbrockmendel comment about deprecation path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility concerns aside, I think the keepdims argument to _reduce is the nicer solution.
But, external EAs don't have this keyword, so that means we would need to add some compatibility code anyway, everywhere we call the _reduce method (check if it supports the new keyword, and if not still wrap the result in np.array([res]) array, just like the current base implementation of _reduce_and_wrap does). With that, I am not sure that will be an improvement over the current solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a hypothetical EA wanted to do a reduction lazily, that would be much easier with a keepdims keyword than with a _reduce_and_wrap method. Just a thought, not worth contorting ourselves over a hypothetical EA

Copy link
Contributor Author

@topper-123 topper-123 Jun 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to a keepdims parameter in _reduce, except for the compatibility concerns, but I would like a decision.

One way to address the compatibility concerns could be to introspect the signature of _reduce to see if it has a keepdims parameter or not. If it does, call _reduce with keepdims=True when doing dataframe reductions. If it doesn't, call it without a keepdims parameter, emit a warning that keepdims will become required in the future and wrap the scalar reduction result in a numpy array like result = np.array(result).reshape(1), to keep the current behavior.

In v3.0 we'll skip the signature introspection and make the keepdims parameter required.

"""Takes the result of ``_reduce`` and wraps it an a ndarray/extensionArray."""
result = self._reduce_pyarrow(name, skipna=skipna, **kwargs)
result = pa.array([result.as_py()], type=result.type)
return type(self)(result)

def __setitem__(self, key, value) -> None:
"""Set one or more values inplace.

Expand Down
29 changes: 29 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ class ExtensionArray:
_from_sequence_of_strings
_hash_pandas_object
_reduce
_reduce_and_wrap
_values_for_argsort
_values_for_factorize

Expand Down Expand Up @@ -184,6 +185,7 @@ class ExtensionArray:

* _accumulate
* _reduce
* _reduce_and_wrap

One can implement methods to handle parsing from strings that will be used
in methods such as ``pandas.io.parsers.read_csv``.
Expand Down Expand Up @@ -1425,6 +1427,11 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
Raises
------
TypeError : subclass does not define reductions

See Also
--------
ExtensionArray._reduce_and_wrap
Calls ``_reduce`` and wraps the result in a ndarray/ExtensionArray.
"""
meth = getattr(self, name, None)
if meth is None:
Expand All @@ -1434,6 +1441,28 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
)
return meth(skipna=skipna, **kwargs)

def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs):
"""
Call ``_reduce`` and wrap the result in a ndarray/ExtensionArray.

This is used to control the returned dtype when doing reductions in DataFrames,
and ensures the correct dtype for e.g. ``DataFrame({"a": extr_arr2}).sum()``.

Returns
-------
ndarray or ExtensionArray

Examples
--------
>>> arr = pd.array([1, 2, pd.NA])
>>> arr._reduce_and_wrap("sum", kwargs={})
<IntegerArray>
[3]
Length: 1, dtype: Int64
"""
result = self._reduce(name, skipna=skipna, **kwargs)
return np.array([result])

# https://github.com/python/typeshed/issues/2148#issuecomment-520783318
# Incompatible types in assignment (expression has type "None", base class
# "object" defined the type as "Callable[[object], int]")
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2099,6 +2099,10 @@ def _reverse_indexer(self) -> dict[Hashable, npt.NDArray[np.intp]]:
# ------------------------------------------------------------------
# Reductions

def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs):
result = self._reduce(name, skipna=skipna, **kwargs)
return type(self)([result], dtype=self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we get here with e.g. any/all?

Copy link
Contributor Author

@topper-123 topper-123 May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical doesn't support any/all, IDK why actually, seems like it could, if the categories do.

Do you have any specific issue or other array in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gentle ping...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment that if any/all are ever supported then we shouldnt do this wrapping?


def min(self, *, skipna: bool = True, **kwargs):
"""
The minimum value of the object.
Expand Down
80 changes: 64 additions & 16 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@
Shape,
npt,
)
from pandas.compat import (
IS64,
is_platform_windows,
)
from pandas.errors import AbstractMethodError
from pandas.util._decorators import doc
from pandas.util._validators import validate_fillna_kwargs
Expand Down Expand Up @@ -1088,25 +1092,62 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):

# median, skew, kurt, sem
op = getattr(nanops, f"nan{name}")
result = op(data, axis=0, skipna=skipna, mask=mask, **kwargs)

axis = kwargs.pop("axis", None)
result = op(data, axis=axis, skipna=skipna, mask=mask, **kwargs)
if np.isnan(result):
return libmissing.NA
result = libmissing.NA

return result
return self._wrap_reduction_result(
name, result, skipna=skipna, axis=axis, **kwargs
)

def _reduce_and_wrap(self, name: str, *, skipna: bool = True, kwargs):
df = self.reshape(-1, 1)
res = df._reduce(name=name, skipna=skipna, axis=0, **kwargs)
return res

def _wrap_reduction_result(self, name: str, result, skipna, **kwargs):
axis = kwargs["axis"]
if isinstance(result, np.ndarray):
axis = kwargs["axis"]
if skipna:
# we only retain mask for all-NA rows/columns
mask = self._mask.all(axis=axis)
else:
mask = self._mask.any(axis=axis)

return self._maybe_mask_result(result, mask)
elif result is libmissing.NA and self.ndim == 2:
result = self._wrap_na_result(name=name, axis=axis)
return result
return result

def _wrap_na_result(self, *, name, axis):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like it got a lot more complicated than the first attempt at "keepdims". does this just address more corner cases the first attempt missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not great, and I've been tinkered quite a lot with other implementations. I think/hope I will have a simpler solution tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I looked into this and unless I do a relatively major refactoring, it looks like I just move complexity around by changing this. So without a bigger rework of arrays_algos, I don't see any clearly better method.

Suggestions/ideas that show otherwise welcome, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add, the underlying issue is that the functions in masked_reductions.py only return scalar NA values. This means that we don't get the type information when doing reductions that return NA, but have to infer it and the inferring is complex because it depends on calling method, dtype of calling data and OS platform.

The solution would be to return proper 2D results from masked_reductions.py functions, e.g. return a tuple instead of scalar, e.g.

  1. return (scalar_value, none), when returning scalar
  2. return (value_array, mask_array) when returning a masked array

and the wrap the result in an array in BaseMaskedArray._wrap_reduction_result, when it should be a masked array.

However, I 'm not sure if that's the direction we want to pursue because unless we want to support 2d masked arrays, the current solution could still be simpler than building out 2d reduction support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term, that indeed sounds as a good alternative, even for the current 1D-reshaped-as-2D case. One potential disadvantage is that you still need to do some actual calculation on the data, unnecessarily, to get the value_array return value (so we can use that to determine the result dtype). Calculating the actual value is "unnecessary" if you know the result will be masked. Of course it avoids hardcoding result dtypes. But for the case of skipna=False with a large array, that might introduce a slowdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only "know" because of the hard-coding here, for example Series([1, 2, pd.NA], dtype="int8").sum() is Int32 on some systems and Int64 on others. So to avoid the hardcoding we will have to do a calculation, so it's a trade-off (or we could hard-code the dtype when we're 100 % sure we get a NA, but that'll be about the same complexity as now + the extra code for doing real 2d ops). I think we should only do the 2d ops if we want it independent of this issue here.

IMO that's a possible future PR, if we choose to go in that direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only "know" because of the hard-coding here

In the end, also numpy kind of "hard codes" this, but just in the implementation itself. And we get the difficulty of combining the algo of numpy with our own logic, where we also want to know this information. So I don't think it's necessarily "wrong" to hard code this on our side as well (i.e. essentially keep it as you are doing in this PR, also long term (at least as long as the arrays are only 1D)).

Sidenote, from seeing the additional complexity for 32bit systems in deciding the result dtype, I do wonder if we actually want to get rid of that system-dependent behaviour? (we also don't follow numpy's behaviour in the constructors, but always default to int64)
Although given that we still call numpy for the actual operation, it probably doesn't reduce complexity to ensure this dtype guarantee (we would need to move the 32bit check to the implementation of the masked sum)

mask_size = self.shape[1] if axis == 0 else self.shape[0]
mask = np.ones(mask_size, dtype=bool)

float_dtyp = "float32" if self.dtype == "Float32" else "float64"
if name in ["mean", "median", "var", "std", "skew"]:
np_dtype = float_dtyp
elif name in ["min", "max"] or self.dtype.itemsize == 8:
np_dtype = self.dtype.numpy_dtype.name
else:
is_windows_or_32bit = is_platform_windows() or not IS64
int_dtyp = "int32" if is_windows_or_32bit else "int64"
uint_dtyp = "uint32" if is_windows_or_32bit else "uint64"
np_dtype = {"b": int_dtyp, "i": int_dtyp, "u": uint_dtyp, "f": float_dtyp}[
self.dtype.kind
]

value = np.array([1], dtype=np_dtype)
return self._maybe_mask_result(value, mask=mask)

def _wrap_min_count_reduction_result(
self, name: str, result, skipna, min_count, **kwargs
):
if min_count == 0 and isinstance(result, np.ndarray):
return self._maybe_mask_result(result, np.zeros(result.shape, dtype=bool))
return self._wrap_reduction_result(name, result, skipna, **kwargs)

def sum(
self,
*,
Expand All @@ -1124,8 +1165,8 @@ def sum(
min_count=min_count,
axis=axis,
)
return self._wrap_reduction_result(
"sum", result, skipna=skipna, axis=axis, **kwargs
return self._wrap_min_count_reduction_result(
"sum", result, skipna=skipna, min_count=min_count, axis=axis, **kwargs
)

def prod(
Expand All @@ -1137,15 +1178,16 @@ def prod(
**kwargs,
):
nv.validate_prod((), kwargs)

result = masked_reductions.prod(
self._data,
self._mask,
skipna=skipna,
min_count=min_count,
axis=axis,
)
return self._wrap_reduction_result(
"prod", result, skipna=skipna, axis=axis, **kwargs
return self._wrap_min_count_reduction_result(
"prod", result, skipna=skipna, min_count=min_count, axis=axis, **kwargs
)

def mean(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs):
Expand Down Expand Up @@ -1192,23 +1234,29 @@ def std(

def min(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs):
nv.validate_min((), kwargs)
return masked_reductions.min(
result = masked_reductions.min(
self._data,
self._mask,
skipna=skipna,
axis=axis,
)
return self._wrap_reduction_result(
"min", result, skipna=skipna, axis=axis, **kwargs
)

def max(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs):
nv.validate_max((), kwargs)
return masked_reductions.max(
result = masked_reductions.max(
self._data,
self._mask,
skipna=skipna,
axis=axis,
)
return self._wrap_reduction_result(
"max", result, skipna=skipna, axis=axis, **kwargs
)

def any(self, *, skipna: bool = True, **kwargs):
def any(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to add the axis keyword here (it's not actually being used?)

Copy link
Contributor Author

@topper-123 topper-123 May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it, could be connected to your previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, and I've made another version, will if it passes, and then I'll look into your other comments

"""
Return whether any element is truthy.

Expand All @@ -1227,6 +1275,7 @@ def any(self, *, skipna: bool = True, **kwargs):
If `skipna` is False, the result will still be True if there is
at least one element that is truthy, otherwise NA will be returned
if there are NA's present.
axis : int, optional, default 0
**kwargs : any, default None
Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Expand Down Expand Up @@ -1270,7 +1319,6 @@ def any(self, *, skipna: bool = True, **kwargs):
>>> pd.array([0, 0, pd.NA]).any(skipna=False)
<NA>
"""
kwargs.pop("axis", None)
nv.validate_any((), kwargs)

values = self._data.copy()
Expand All @@ -1289,7 +1337,7 @@ def any(self, *, skipna: bool = True, **kwargs):
else:
return self.dtype.na_value

def all(self, *, skipna: bool = True, **kwargs):
def all(self, *, skipna: bool = True, axis: AxisInt | None = 0, **kwargs):
"""
Return whether all elements are truthy.

Expand All @@ -1308,6 +1356,7 @@ def all(self, *, skipna: bool = True, **kwargs):
If `skipna` is False, the result will still be False if there is
at least one element that is falsey, otherwise NA will be returned
if there are NA's present.
axis : int, optional, default 0
**kwargs : any, default None
Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Expand Down Expand Up @@ -1351,7 +1400,6 @@ def all(self, *, skipna: bool = True, **kwargs):
>>> pd.array([1, 0, pd.NA]).all(skipna=False)
False
"""
kwargs.pop("axis", None)
nv.validate_all((), kwargs)

values = self._data.copy()
Expand All @@ -1361,7 +1409,7 @@ def all(self, *, skipna: bool = True, **kwargs):
# bool, int, float, complex, str, bytes,
# _NestedSequence[Union[bool, int, float, complex, str, bytes]]]"
np.putmask(values, self._mask, self._truthy_value) # type: ignore[arg-type]
result = values.all()
result = values.all(axis=axis)

if skipna:
return result
Expand Down
Loading