ENH: Make maybe_convert_object respect dtype itemsize #40908

rhshadrach · 2021-04-12T22:18:39Z

Ensure all linting tests pass, see here for how to run them

Precursor to #40790

This adds support for e.g. float32 NumPy dtypes to maybe_convert_object. If any non-NumPy scalar is hit, the behavior is the same as master. This is my first foray into the NumPy C-API, so any tips are appreciated. In particular, I couldn't figure out how to use the C API to do the cast:

result = result.astype(result.dtype.kind + str(itemsize))

Not sure if there should also be specific logic for EAs/nullable types.

From a full ASV run:

       before           after         ratio
     [7d4757b4]       [c1288962]
     <maybe_convert_object_itemsize~12>       <maybe_convert_object_itemsize>
+         768±8μs         1.01±0ms     1.32  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'float')
+         779±9μs         1.02±0ms     1.31  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'float')
+         579±5μs         710±20μs     1.23  arithmetic.NumericInferOps.time_divide(<class 'numpy.int8'>)
+        608±30μs         715±20μs     1.18  arithmetic.NumericInferOps.time_divide(<class 'numpy.uint8'>)
+      3.18±0.2μs       3.69±0.3μs     1.16  index_cached_properties.IndexCache.time_engine('UInt64Index')
+     1.58±0.01ms      1.83±0.01ms     1.16  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f5964729820>, False, 'float')
+      3.30±0.2μs       3.80±0.5μs     1.15  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
+        943±80ns       1.08±0.1μs     1.15  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
+     1.73±0.01ms         1.98±0ms     1.14  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
+     1.72±0.01ms         1.96±0ms     1.14  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
+     5.19±0.03μs       5.93±0.3μs     1.14  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone(datetime.timedelta(seconds=3600)))
+     1.65±0.01ms      1.89±0.01ms     1.14  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f5964729820>, True, 'float')
+     6.06±0.06μs       6.91±0.1μs     1.14  tslibs.tz_convert.TimeTZConvert.time_tz_convert_from_utc(1, datetime.timezone(datetime.timedelta(seconds=3600)))
+     7.46±0.04μs       8.47±0.5μs     1.14  tslibs.offsets.OffestDatetimeArithmetic.time_subtract(<BusinessDay>)
+     1.15±0.09μs      1.27±0.08μs     1.11  index_cached_properties.IndexCache.time_values('UInt64Index')
-        11.7±1ms       10.5±0.1ms     0.90  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.object_'>, 18, 'outside')
-        641±30ns         577±20ns     0.90  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
-        641±20ns         576±20ns     0.90  index_cached_properties.IndexCache.time_shape('Int64Index')
-     3.37±0.03ms      3.02±0.02ms     0.90  timeseries.ResampleSeries.time_resample('period', '5min', 'ohlc')
-        787±20ns         702±20ns     0.89  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
-     1.25±0.04μs      1.12±0.04μs     0.89  index_cached_properties.IndexCache.time_is_all_dates('Int64Index')
-      1.50±0.1ms      1.33±0.02ms     0.88  dtypes.SelectDtypes.time_select_dtype_string_exclude('float32')
-     1.19±0.09μs      1.05±0.08μs     0.88  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
-        57.6±1μs       50.3±0.2μs     0.87  frame_methods.Dtypes.time_frame_dtypes
-         168±3μs        146±0.6μs     0.87  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.uint64'>, 8000)
-        552±20ns         480±20ns     0.87  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
-      1.50±0.1ms         1.30±0ms     0.87  dtypes.SelectDtypes.time_select_dtype_string_exclude('complex64')
-        1.89±0ms      1.64±0.01ms     0.87  period.DataFramePeriodColumn.time_set_index
-      11.0±0.2μs      9.50±0.08μs     0.87  period.Indexing.time_series_loc
-        110±20ms      93.1±0.06ms     0.85  algos.isin.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'random_hits')
-        392±30ns         330±10ns     0.84  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
-      4.45±0.4ms      3.68±0.02ms     0.83  algorithms.Factorize.time_factorize(True, True, 'Int64')
-      9.60±0.8ms      7.75±0.01ms     0.81  algorithms.Factorize.time_factorize(True, False, 'string')
-      11.2±0.8ms       8.90±0.1ms     0.80  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.object_'>, 18, 'inside')
-        615±30ns         484±30ns     0.79  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')
-       73.8±20ms       55.4±0.5ms     0.75  algos.isin.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'monotone_hits')
-       180±0.1μs        128±0.5μs     0.71  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
-        128±40ms      90.8±0.03ms     0.71  algos.isin.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'monotone_hits')
-       186±0.3μs          131±2μs     0.70  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int16Engine'>, <class 'numpy.int16'>), 'monotonic_incr')
-      1.72±0.4μs       1.17±0.1μs     0.68  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')
-       216±0.8μs          140±3μs     0.65  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt32Engine'>, <class 'numpy.uint32'>), 'monotonic_incr')
-     1.55±0.01ms          972±7μs     0.63  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.int64'>, 100000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Specific timings via %timeit on maybe_convert_objects directly:

np.array of python integers:
127 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
120 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <--- master

np.array of int32:
179 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
133 ms ± 274 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    <--- master

np.array of int32 with last one a python int:
178 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
133 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    <--- master

timeit code

values = np.array(range(1_000_000), dtype="object")
print('np.array of python integers:')
%timeit maybe_convert_objects(values)
print()

values = np.array([np.int32(1)] * 1_000_000, dtype="object")
print('np.array of int32:')
%timeit maybe_convert_objects(values)
print()

values = np.array([np.int32(1)] * 999_999 + [1], dtype="object")
print('np.array of int32 with last one a python int:')
%timeit maybe_convert_objects(values)
print()

jbrockmendel · 2021-04-12T22:25:05Z

pandas/_libs/tslibs/util.pxd

@@ -195,6 +200,24 @@ cdef inline bint is_nan(object val):
    return is_complex_object(val) and val != val


+cdef inline int64_t get_itemsize(object val):


if this is only used in lib.pyx, i think better to put it there

jbrockmendel · 2021-04-12T22:26:47Z

pandas/_libs/lib.pyx

                elif seen.is_bool and not seen.nan_:
-                    return bools.view(np.bool_)
+                    result = bools.view(np.bool_)
+        if result is not None:


should this be tightened to something like result is floats or result is uints or result is ints? i.e. exclude datetimes/timedeltas/bools

jbrockmendel · 2021-04-12T22:28:30Z

pandas/_libs/lib.pyx

+                    result = bools.view(np.bool_)
+        if result is not None:
+            if itemsize_max > 0:
+                curr_itemsize = cnp.PyArray_ITEMSIZE(result)


id just use result.dtype.itemsize and not bother with the C API.

jreback · 2021-04-13T13:25:05Z

can you add some explict tests for this, maybe around

pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_uint64(self):
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_datetime(self):
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_timedelta64_nat(self):
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_timedelta=True)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_nullable_integer(self, exp):
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_to_nullable_integer=True)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_bool_nan(self):
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(ind.values, safe=1)
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_datetime=True)

rhshadrach · 2021-04-13T20:46:11Z

Edit: Turns out, this was an issue with Series.count.

Thanks @jbrockmendel and @jreback - will make changes and tests. But I want to figure out what the right output here is:

        df = DataFrame(
            {"A": [None, 2, 3], "B": [1.0, np.nan, 3.0], "C": ["foo", None, "bar"]}
        )
    
        # Function aggregate
        result = getattr(df, how)({"A": "count"})
        expected = Series({"A": 2})
    
>       tm.assert_series_equal(result, expected)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

The count is computed by summing an ndarray of Booleans, which NumPy gives back on a 32-bit machine as an int32. This PR is making it stay an int32 (master would convert to int64).

Do we want Series({"A": 2}) to come back on a 32-bit machine as being an int32 as well? If the answer is no, then I'm thinking this PR should be scrapped. If we do, I think we can implement that here as well:

    if result is uints or result is ints or result is floats:
        if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
            result = result.astype(result.dtype.kind + str(itemsize_max))
        else:
            result = result.astype(result.dtype.kind + str(np.intp().dtype.itemsize))
        return result

(with perhaps a better way to get at the 4 vs 8 bytes)

…ybe_convert_object_itemsize � Conflicts: � pandas/core/series.py

jreback · 2021-04-14T02:39:24Z

we diverge with numpy by always casting a list of python into to int64 (numpy would do int32)

rhshadrach · 2021-04-14T02:41:21Z

Thanks @jreback - this was actually an issue with Series.count. All paths of DataFrame.count cast to int64, and when level is not None, Series would do this as well. But when using Series.count with level=None, no cast was done. I'm hopeful adding this in (along with some other test fixes) will fix CI.

rhshadrach · 2021-04-15T04:05:42Z

The xfailed test on DataFrame.replace is due to a change in NumPy behavior:

values = np.array(['a'], dtype='object')
mask = np.array([True])
value = np.int8(1)
np.putmask(values, mask, value)
print(type(values[0]))

results in <class 'int'> with NumPy 1.19 but <class 'numpy.int8'> with NumPy >= 1.20

rhshadrach · 2021-04-15T04:30:06Z

pandas/core/series.py

@@ -1891,7 +1891,7 @@ def count(self, level=None):
        2
        """
        if level is None:
-            return notna(self._values).sum()
+            return notna(self._values).sum().astype("int64")


All paths for Series.count and DataFrame.count result in int64 except for this one.

rhshadrach · 2021-04-15T20:10:12Z

@jbrockmendel - changes made, @jreback - test added.

This still needs a whatsnew note, I think at least the impact on Series/DataFrame construction with numpy types should perhaps be in a section of its own. But maybe_convert_objects is used in many places, not sure if there should be more mentioned. When I follow this up with #40790 I'll add on the impact to UDF methods.

rhshadrach · 2021-04-15T20:15:30Z

Now I'm second guessing making its own section - I'd guess that construction via a list of NumPy scalars is not so common. Maybe just a single line in other enhancements?

jreback

minor comment. any perf issues?

jreback · 2021-04-16T00:42:00Z

pandas/_libs/lib.pyx

                elif seen.is_bool and not seen.nan_:
-                    return bools.view(np.bool_)
+                    result = bools.view(np.bool_)
+        if result is uints or result is ints or result is floats or result is complexes:


can you put a blank line & comment here (e.g. casting to itemsize)

rhshadrach · 2021-04-19T20:40:34Z

@jreback - changes made

any perf issues?

I updated the ASV in the OP, and included some timeit using maybe_convert_objects directly.

jreback

lgtm merge on greenish

rhshadrach · 2021-04-21T04:24:30Z

@jreback - greenish. Failure is:

      raise AssertionError(f"Caused unexpected warning(s): {repr(extra_warnings)}")
E AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x000001FA4204DFC8>'), 'C:\Miniconda\envs\pandas-dev\lib\asyncio\sslproto.py', 322)]

jreback · 2021-04-21T12:53:31Z

thanks @rhshadrach

ENH: Make maybe_convert_object respect dtype itemsize

314ed71

rhshadrach added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Apr 12, 2021

rhshadrach added 2 commits April 12, 2021 18:20

fixup

a8008d1

fixup

83c3ab5

jbrockmendel reviewed Apr 12, 2021

View reviewed changes

Move get_itemsize, use dtype.itemsize, safer casting

64086a4

rhshadrach added 2 commits April 13, 2021 22:34

Fix Series.count and tests

24044de

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

127158a

…ybe_convert_object_itemsize � Conflicts: � pandas/core/series.py

rhshadrach added 3 commits April 13, 2021 23:52

WIP

5950138

Added test

45f5f31

Fixed replace

8e1e7c2

rhshadrach marked this pull request as draft April 15, 2021 03:42

rhshadrach added 2 commits April 14, 2021 23:57

Remove higher itemsize for earlier versions of numpy

f13edc0

fixup

5d3ffe0

rhshadrach commented Apr 15, 2021

View reviewed changes

np_version_under1p19 -> np_version_under1p20

c128896

rhshadrach marked this pull request as ready for review April 15, 2021 20:06

jreback requested changes Apr 16, 2021

View reviewed changes

whatsnew and comment

c03fbf6

jreback added this to the 1.3 milestone Apr 20, 2021

Merge branch 'master' into maybe_convert_object_itemsize

968942c

jreback approved these changes Apr 20, 2021

View reviewed changes

jreback merged commit 549e39b into pandas-dev:master Apr 21, 2021

rhshadrach deleted the maybe_convert_object_itemsize branch April 21, 2021 15:37

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

ENH: Make maybe_convert_object respect dtype itemsize (pandas-dev#40908)

65b2f4a

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

ENH: Make maybe_convert_object respect dtype itemsize (pandas-dev#40908)

d0a43fc

simonjayhawkins mentioned this pull request Jun 15, 2022

BUG: uint16 inserted as int16 when assigning row with dict #47294

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Make maybe_convert_object respect dtype itemsize #40908

ENH: Make maybe_convert_object respect dtype itemsize #40908

rhshadrach commented Apr 12, 2021 •

edited

Loading

jbrockmendel Apr 12, 2021

jbrockmendel Apr 12, 2021

jbrockmendel Apr 12, 2021

jreback commented Apr 13, 2021

rhshadrach commented Apr 13, 2021 •

edited

Loading

jreback commented Apr 14, 2021

rhshadrach commented Apr 14, 2021

rhshadrach commented Apr 15, 2021 •

edited

Loading

rhshadrach Apr 15, 2021

rhshadrach commented Apr 15, 2021

rhshadrach commented Apr 15, 2021

jreback left a comment

jreback Apr 16, 2021

rhshadrach commented Apr 19, 2021

jreback left a comment

rhshadrach commented Apr 21, 2021

jreback commented Apr 21, 2021

		@@ -195,6 +200,24 @@ cdef inline bint is_nan(object val):
		return is_complex_object(val) and val != val


		cdef inline int64_t get_itemsize(object val):

ENH: Make maybe_convert_object respect dtype itemsize #40908

ENH: Make maybe_convert_object respect dtype itemsize #40908

Conversation

rhshadrach commented Apr 12, 2021 • edited Loading

jbrockmendel Apr 12, 2021

Choose a reason for hiding this comment

jbrockmendel Apr 12, 2021

Choose a reason for hiding this comment

jbrockmendel Apr 12, 2021

Choose a reason for hiding this comment

jreback commented Apr 13, 2021

rhshadrach commented Apr 13, 2021 • edited Loading

jreback commented Apr 14, 2021

rhshadrach commented Apr 14, 2021

rhshadrach commented Apr 15, 2021 • edited Loading

rhshadrach Apr 15, 2021

Choose a reason for hiding this comment

rhshadrach commented Apr 15, 2021

rhshadrach commented Apr 15, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback Apr 16, 2021

Choose a reason for hiding this comment

rhshadrach commented Apr 19, 2021

jreback left a comment

Choose a reason for hiding this comment

rhshadrach commented Apr 21, 2021

jreback commented Apr 21, 2021

rhshadrach commented Apr 12, 2021 •

edited

Loading

rhshadrach commented Apr 13, 2021 •

edited

Loading

rhshadrach commented Apr 15, 2021 •

edited

Loading