-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: Bump dev pin on NumPy #60987
MNT: Bump dev pin on NumPy #60987
Conversation
>>> s.argmax() | ||
2 | ||
np.int64(2) | ||
>>> s.argmin() | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few of these where I'm wondering if we should be returning Python scalars instead of NumPy. Should issues be opened for these?
cc @pandas-dev/pandas-core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we always want to return Python scalars (IIRC we got a lot of issues about this in iteration and iteration-like APIs in the past)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even just wrapping the result of Series._reduce
with maybe_box_naive
breaks 692 tests. From a cursory look, they're tests that are expecting a NumPy scalar back. A lot however are something like op(data).any().any()
so that they will work with DataFrame and Series. I plan to bring this up in the next dev meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should always return Python scalars. I'm surprised at the amount of failures that expect NumPy scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think you need a deprecation on this, because people may have code that depends on the result being a numpy scalar. I think that the tests we have in pandas-stubs
for typing may depend on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put it up behind a future
option, maybe something like always_return_python_scalars
pandas/core/algorithms.py
Outdated
@@ -415,7 +415,7 @@ def unique(values): | |||
|
|||
>>> pd.unique(pd.array([1 + 1j, 2, 3])) | |||
<NumpyExtensionArray> | |||
[(1+1j), (2+0j), (3+0j)] | |||
[np.complex128(1+1j), np.complex128(2+0j), np.complex128(3+0j)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged #54268 a while back to make these reprs render like the Python scalars/pre numpy 2.0. It appears that PR didn't touch all the relevant repr methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a precursor to fix the reprs for arrays, and then revert some of the outputs here.
Your comment only applies to EA reprs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say any repr in pandas shouldn't be showing the NEP 51 style repr for scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say any repr in pandas shouldn't be showing the NEP 51 style repr for scalars.
The main motivation behind NEP 51 is that the Python numerical types behave differently from the NumPy scalars.
They give a few examples such as
Python programmers sometimes write obj is True and will surprised when an object that shows as True fails to pass the test. It is much easier to understand this behavior when the value is shown as np.True_.
So I think we should be showing the NEP 51 style repr for scalars.
@mroeschke you have approved this PR so I assume that was a typo?
@mroeschke @WillAyd - it will be a significant change to start returning Python objects instead of NumPy across the board. Do we want to make that (in separate PRs) before merging this? |
pandas/core/arrays/datetimelike.py
Outdated
print(offset1, offset2) | ||
print(type(offset1), type(offset2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print
in pandas??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, thanks.
>>> s.argmax() | ||
2 | ||
np.int64(2) | ||
>>> s.argmin() | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think you need a deprecation on this, because people may have code that depends on the result being a numpy scalar. I think that the tests we have in pandas-stubs
for typing may depend on this.
@@ -14,7 +14,7 @@ pytest-localserver | |||
PyQt5>=5.15.9 | |||
coverage | |||
python-dateutil | |||
numpy<2 | |||
numpy<3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what point do we stop numpy < 2
support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec 0 says Q3 2025. I would suggest pandas 4.0, but perhaps it could be earlier. In any case, this doesn't seem like the best place to discuss this at length.
If it is a significant effort then I would say as long as we don't introduce new NumPy scalars in return values then we are good. We can leave it to separate exercises to hack away at that |
Right - this PR does not introduce any new instances of NumPy scalars, just adjusts the docstrings with the new repr as of NumPy 2.0. |
So prior to NumPy 2.0 these returned Python scalars but now they return NumPy scalars? |
No, the same object is being returned - always a NumPy scalar. Only the repr is changing when you go from NumPy < 2 to NumPy >= 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pandas/core/arrays/base.py
Outdated
@@ -1962,7 +1962,7 @@ def _formatter(self, boxed: bool = False) -> Callable[[Any], str | None]: | |||
... return lambda x: "*" + str(x) + "*" if boxed else repr(x) + "*" | |||
>>> MyExtensionArray(np.array([1, 2, 3, 4])) | |||
<MyExtensionArray> | |||
[1*, 2*, 3*, 4*] | |||
[np.int64(1)*, np.int64(2)*, np.int64(3)*, np.int64(4)*] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEP 51 is for the output of numpy scalars to help users distinguish the NumPy scalars from the Python builtin types and clarify their behavior and that array representation will not be affected since it already includes the dtype= when necessary.
So should probably change the example here to not use the NEP 51 repr in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Simon. Historically I don't think we have ever favored exposing NumPy scalars directly to end users. If that has happened as an implementation detail that is one thing, but explicitly showing that to an end user runs counter to what we have done in the psat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -1775,7 +1775,8 @@ def to_tuples(self, na_tuple: bool = True) -> np.ndarray: | |||
[(0, 1], (1, 2]] | |||
Length: 2, dtype: interval[int64, right] | |||
>>> idx.to_tuples() | |||
array([(0, 1), (1, 2)], dtype=object) | |||
array([(np.int64(0), np.int64(1)), (np.int64(1), np.int64(2))], | |||
dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In https://github.com/pandas-dev/pandas/pull/60987/files#r2016346826 I commented because the dtype of the array was the same as the dtype of the scalars.
Thinking some more, we currently have...
pd.Series([123,"123"])
# 0 123
# 1 123
# dtype: object
so for consistency, even object arrays should probably not show the NEP51 repr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's an option here, this is what happens when you call str
on a tuple of NumPy scalars.
print(str((np.int64(0), np.int64(1))))
# (np.int64(0), np.int64(1))
arr = pd.array([(np.int64(0), np.int64(1)), (np.int64(1), np.int64(2))], dtype=object)
print(type(arr[0]), str(arr[0]))
# <class 'tuple'> (np.int64(0), np.int64(1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable however extending the example I gave for the strings in an object array...
s = pd.Series([123, "123", ("123", "123")])
print(s)
print(s[2])
# 0 123
# 1 123
# 2 (123, 123)
# dtype: object
# ('123', '123')
however, we don't currently display the repr for strings in a collection either when the array is displayed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, the numpy array of values does show the repr for the strings in a tuple.
s.values
# array([123, '123', ('123', '123')], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rhshadrach lgtm
@Dr-Irv - friendly ping |
Sorry - I was out the past 5 days! Will look now |
Thanks @rhshadrach |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.