PERF/ENH: add fast astyping for Categorical #37355

arw2019 · 2020-10-23T04:14:28Z

closes ENH: decode for Categoricals #8628
tests added / passed
benchmarks added
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

To illustrate the speed-up, the set-up is (from OP):

import numpy as np
import pandas as pd

rng = np.random.default_rng()

df = pd.DataFrame(
    rng.choice(np.array(list("abcde")), 4_000_000).reshape(1_000_000, 4),
    columns=list("ABCD"),
)

for col in df.columns:
    df[col] = df[col].astype("category")

On master

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 
250 ms ± 601 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

versus on this branch

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 
5.38 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pandas/core/generic.py

topper-123 · 2020-10-23T18:07:39Z

This changes the behaviour unfortunately:

>>> n = 100_000
>>> cat = pd.Categorical(["a", "b"] * n)
>>> ser = pd.Series(cat)
>>> ser.astype(str).dtype
dtype('O')  # master
CategoricalDtype(categories=['a', 'b'], ordered=False)  # this PR, different dtype

Can you make it work in some different way?

arw2019 · 2020-10-23T18:33:40Z

This changes the behaviour unfortunately:

>>> n = 100_000
>>> cat = pd.Categorical(["a", "b"] * n)
>>> ser = pd.Series(cat)
>>> ser.astype(str).dtype
dtype('O')  # master
CategoricalDtype(categories=['a', 'b'], ordered=False)  # this PR, different dtype

Can you make it work in some different way?

Thanks for the catch! Fixed this now

arw2019 · 2020-10-23T18:35:33Z

pandas/core/internals/blocks.py

@@ -596,6 +597,17 @@ def astype(self, dtype, copy: bool = False, errors: str = "raise"):

            return self.make_block(Categorical(self.values, dtype=dtype))

+        elif (  # GH8628
+            is_categorical_dtype(self.values.dtype)
+            and not (is_object_dtype(dtype) or is_string_like_dtype(dtype))


Could define a new method for this in core/dtypes/common.py (is_string_like_or_object_dtype) ?

arw2019 · 2020-10-23T19:36:04Z

It seems like adding this to astype has a lot of repercussions downstream, casting objects in undesirable ways. The special casing needed to make the tests pass is a code smell.

I wonder if it might make sense, instead of adding this to astype, to create a dedicated method for astyping a Categorical via this method?

In []: cat = pd.Categorical(["a", "b", "a", "a"] ) 
   ...: s = pd.Series(cat) 
   ...: s.astype_categories(dtype=str)                                      
Out[]: 
0    a
1    b
2    a
3    a
dtype: object

jreback · 2020-10-23T21:36:11Z

see also #37371; this might be the same path.

jbrockmendel · 2020-10-26T01:27:36Z

pandas/core/internals/blocks.py

+                or is_datetime_or_timedelta_dtype(dtype)
+            )
+            and copy is True
+        ):


this seems really convoluted, in a method that is already too complicated as it is (xref #22369)

Do you have a good idea where the perf improvement comes from? e.g. could we push this down into Categorical.astype?

Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing)

The perf improvement is from astyping just the category labels instead of astyping each array entry separately.

Categorical.astype seems like a good location for this - will give that a go

jreback · 2020-10-26T12:31:22Z

pandas/core/arrays/categorical.py

+
+        new_categories = self.categories.astype(dtype)
+        obj = Categorical.from_codes(self.codes, categories=new_categories)
+        return np.array(obj.categories[self.codes], copy=copy)


you don't need to pass dtype here?

I think no because we've already astyped the category a few lines up

pandas/core/arrays/categorical.py

jreback · 2020-10-26T12:32:19Z

asv_bench/benchmarks/categoricals.py

+        for col in self.df.columns:
+            self.df[col] = self.df[col].astype("category")
+
+    def astype_unicode(self):


can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks.

Ok!

Posted int benchmark in main thread + will add/post more

right pls update these for int,float,string,datetime

arw2019 · 2020-10-29T04:45:50Z

Re: review comments:

`int` benchmark results

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...:  
   ...: rng = np.random.default_rng() 
   ...:  
   ...: df = pd.DataFrame( 
   ...:     rng.choice(np.arange(8), 4_000_000).reshape(1_000_000, 4), 
   ...:     columns=list("ABCD"), 
   ...: ) 
   ...:  
   ...: for col in df.columns: 
   ...:     df[col] = df[col].astype("category") 
   ...:

On master:

In [2]: %timeit [df[col].astype('int') for col in df.columns]                                           
20.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

In [4]: %timeit [df[col].astype('int') for col in df.columns]                                                                                                                                                      
18.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jreback · 2020-10-31T15:32:21Z

this is not showing much of an improvement. can you try someother dtypes as well.

…8628

jbrockmendel · 2020-11-17T23:09:58Z

pandas/core/arrays/categorical.py

+            try:
+                astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
+            except (TypeError, ValueError):
+                raise ValueError(


why change TypeError?

It's to fix the error message for CategoricalIndex. If we don't catch TypeError we end up with TypeError: Cannot cast Index to dtype float64 (below) versus something like TypeError: Cannot cast object to dtype float64

In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"]) In [3]: idx.astype('float') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 700 try: --> 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: ValueError: could not convert string to float: 'a' The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) <ipython-input-4-38d56ec15c36> in <module> ----> 1 idx.astype('float') /workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy) 369 @doc(Index.astype) 370 def astype(self, dtype, copy=True): --> 371 res_data = self._data.astype(dtype, copy=copy) 372 return Index(res_data, name=self.name) 373 /workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy) 427 # GH8628 (PERF): astype category codes instead of astyping array 428 try: --> 429 astyped_cats = self.categories.astype(dtype=dtype, copy=copy) 430 except (ValueError): 431 raise ValueError( /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: --> 703 raise TypeError( 704 f"Cannot cast {type(self).__name__} to dtype {dtype}" 705 ) from err TypeError: Cannot cast Index to dtype float64

ok this is fine, but can you add a comment for this then right here (so future readers understand)

jreback · 2020-11-18T01:09:17Z

ci / checks failing

arw2019 · 2020-11-18T01:16:40Z

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version
mypy 0.782
Performing static analysis using mypy
pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]"  [arg-type]
Found 1 error in 1 file (checked 1119 source files)
Performing static analysis using mypy DONE

Anyway will keep looking and ping when green

…8628

jreback · 2020-11-18T01:22:36Z

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version
mypy 0.782
Performing static analysis using mypy
pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]"  [arg-type]
Found 1 error in 1 file (checked 1119 source files)
Performing static analysis using mypy DONE

Anyway will keep looking and ping when green

oh i think that's fixed on master

jreback

small comment, ping on green.

jreback · 2020-11-18T13:51:12Z

pandas/core/arrays/categorical.py

+            try:
+                astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
+            except (TypeError, ValueError):
+                raise ValueError(


ok this is fine, but can you add a comment for this then right here (so future readers understand)

arw2019 · 2020-11-18T17:40:37Z

@jreback Green + addressed comment

jreback · 2020-11-18T18:21:56Z

thanks @arw2019 very nice

arw2019 · 2020-11-18T20:36:42Z

thanks @jreback @jbrockmendel for the reviews!

arw2019 changed the title ~~PERF/ENH: add fast astyping for categorical input~~ PERF/ENH: add fast astyping for Categorical Oct 23, 2020

jbrockmendel reviewed Oct 23, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

arw2019 force-pushed the GH8628 branch from 967a218 to d3c45c5 Compare October 23, 2020 16:02

arw2019 added 6 commits October 23, 2020 16:03

PERF/ENH: add fast astyping for categorical input

5d82b02

replace is_categorical -> is_categorical_dtype

c18ae4e

ASV: add astyping benchmark

d7c0575

DOC: whatsnew

856995f

feedback: move change core/generic -> internals

57817a4

rewrite categorical check in Block.astype

1050d9e

arw2019 force-pushed the GH8628 branch from d3c45c5 to 1050d9e Compare October 23, 2020 16:03

arw2019 added 2 commits October 23, 2020 18:04

rewrite the fix

c8c05cc

rewrite the fix

b8141c4

fix handling of strings

3d3bcf1

arw2019 commented Oct 23, 2020

View reviewed changes

arw2019 added 2 commits October 23, 2020 18:42

improve readability

3714d09

add more special casing...

f8f501f

jbrockmendel reviewed Oct 26, 2020

View reviewed changes

arw2019 added 2 commits October 26, 2020 01:50

Merge remote-tracking branch 'upstream/master' into GH8628

f4b5952

feedback: move changes to Categorical.astype

2ec7ded

jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Oct 26, 2020

jreback requested changes Oct 26, 2020

View reviewed changes

feedback: add comment in Categorical.astype

cd110bc

arw2019 marked this pull request as draft October 31, 2020 05:30

arw2019 added 6 commits November 11, 2020 05:50

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

19e22e2

…8628

TST/CI (32bit): fix up int conversions

d195d91

fix merge error

3351cb1

merge with upstream/master

38696d9

merge with master

071deec

TST: use np.intp in dtype tests

13fa086

jbrockmendel reviewed Nov 17, 2020

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

dda6804

…8628

CI: fix 32bit again

1016894

arw2019 closed this Nov 18, 2020

arw2019 reopened this Nov 18, 2020

jreback requested changes Nov 18, 2020

View reviewed changes

arw2019 added 4 commits November 18, 2020 10:46

DOC: add note re: CategoricalIndex TypeError catch

a9544b3

merge with upstream/master

527b15a

Merge branch 'GH8628' of https://github.com/arw2019/pandas into GH8628

9c29946

CI: fix merge error

7e9fc32

jreback approved these changes Nov 18, 2020

View reviewed changes

jreback merged commit cc957d1 into pandas-dev:master Nov 18, 2020

arw2019 deleted the GH8628 branch November 18, 2020 20:36

ma3da mentioned this pull request Nov 26, 2020

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

Closed

3 tasks

This was referenced Jan 26, 2021

BUG: astype from categorical to np.int32 conversion returns np.int64 in pandas 1.2.0 and 1.2.0 #39402

Closed

BUG: Concatenating categorical datetime columns raises a ValueError since v1.2 #39443

Closed

arw2019 mentioned this pull request Feb 5, 2021

BUG: fix Categorical.astype for dtype=np.int32 argument #39615

Merged

4 tasks

timlod mentioned this pull request Jun 3, 2021

BUG: undocumented astype("category").astype(str) type inconsistency between pandas 1.1 & 1.2 #41797

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF/ENH: add fast astyping for Categorical #37355

PERF/ENH: add fast astyping for Categorical #37355

arw2019 commented Oct 23, 2020

topper-123 commented Oct 23, 2020

arw2019 commented Oct 23, 2020

arw2019 Oct 23, 2020

arw2019 commented Oct 23, 2020

jreback commented Oct 23, 2020

jbrockmendel Oct 26, 2020

arw2019 Oct 26, 2020

jreback Oct 26, 2020

arw2019 Oct 29, 2020

jreback Oct 26, 2020

arw2019 Oct 29, 2020

jreback Oct 31, 2020

arw2019 commented Oct 29, 2020

jreback commented Oct 31, 2020

jbrockmendel Nov 17, 2020

arw2019 Nov 18, 2020

jreback Nov 18, 2020

jreback commented Nov 18, 2020

arw2019 commented Nov 18, 2020

jreback commented Nov 18, 2020

jreback left a comment

jreback Nov 18, 2020

arw2019 commented Nov 18, 2020

jreback commented Nov 18, 2020

arw2019 commented Nov 18, 2020

PERF/ENH: add fast astyping for Categorical #37355

PERF/ENH: add fast astyping for Categorical #37355

Conversation

arw2019 commented Oct 23, 2020

topper-123 commented Oct 23, 2020

arw2019 commented Oct 23, 2020

Choose a reason for hiding this comment

arw2019 commented Oct 23, 2020

jreback commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 commented Oct 29, 2020

int benchmark results

jreback commented Oct 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 18, 2020

arw2019 commented Nov 18, 2020

jreback commented Nov 18, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 commented Nov 18, 2020

jreback commented Nov 18, 2020

arw2019 commented Nov 18, 2020

`int` benchmark results