-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Make maybe_convert_object respect dtype itemsize #40908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Make maybe_convert_object respect dtype itemsize #40908
Conversation
pandas/_libs/tslibs/util.pxd
Outdated
@@ -195,6 +200,24 @@ cdef inline bint is_nan(object val): | |||
return is_complex_object(val) and val != val | |||
|
|||
|
|||
cdef inline int64_t get_itemsize(object val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is only used in lib.pyx, i think better to put it there
pandas/_libs/lib.pyx
Outdated
elif seen.is_bool and not seen.nan_: | ||
return bools.view(np.bool_) | ||
result = bools.view(np.bool_) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be tightened to something like result is floats or result is uints or result is ints
? i.e. exclude datetimes/timedeltas/bools
pandas/_libs/lib.pyx
Outdated
result = bools.view(np.bool_) | ||
if result is not None: | ||
if itemsize_max > 0: | ||
curr_itemsize = cnp.PyArray_ITEMSIZE(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id just use result.dtype.itemsize
and not bother with the C API.
can you add some explict tests for this, maybe around
|
Edit: Turns out, this was an issue with Series.count. Thanks @jbrockmendel and @jreback - will make changes and tests. But I want to figure out what the right output here is:
The count is computed by summing an ndarray of Booleans, which NumPy gives back on a 32-bit machine as an int32. This PR is making it stay an int32 (master would convert to int64). Do we want
(with perhaps a better way to get at the 4 vs 8 bytes) |
…ybe_convert_object_itemsize � Conflicts: � pandas/core/series.py
we diverge with numpy by always casting a list of python into to int64 (numpy would do int32) |
Thanks @jreback - this was actually an issue with Series.count. All paths of DataFrame.count cast to int64, and when level is not None, Series would do this as well. But when using Series.count with level=None, no cast was done. I'm hopeful adding this in (along with some other test fixes) will fix CI. |
The xfailed test on
results in |
@@ -1891,7 +1891,7 @@ def count(self, level=None): | |||
2 | |||
""" | |||
if level is None: | |||
return notna(self._values).sum() | |||
return notna(self._values).sum().astype("int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All paths for Series.count and DataFrame.count result in int64 except for this one.
@jbrockmendel - changes made, @jreback - test added. This still needs a whatsnew note, I think at least the impact on Series/DataFrame construction with numpy types should perhaps be in a section of its own. But maybe_convert_objects is used in many places, not sure if there should be more mentioned. When I follow this up with #40790 I'll add on the impact to UDF methods. |
Now I'm second guessing making its own section - I'd guess that construction via a list of NumPy scalars is not so common. Maybe just a single line in other enhancements? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment. any perf issues?
elif seen.is_bool and not seen.nan_: | ||
return bools.view(np.bool_) | ||
result = bools.view(np.bool_) | ||
if result is uints or result is ints or result is floats or result is complexes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line & comment here (e.g. casting to itemsize)
@jreback - changes made
I updated the ASV in the OP, and included some timeit using maybe_convert_objects directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm merge on greenish
@jreback - greenish. Failure is:
|
thanks @rhshadrach |
Precursor to #40790
This adds support for e.g.
float32
NumPy dtypes to maybe_convert_object. If any non-NumPy scalar is hit, the behavior is the same as master. This is my first foray into the NumPy C-API, so any tips are appreciated. In particular, I couldn't figure out how to use the C API to do the cast:Not sure if there should also be specific logic for EAs/nullable types.
From a full ASV run:
Specific timings via %timeit on
maybe_convert_objects
directly:timeit code