Skip to content

PERF/ENH: add fast astyping for Categorical #37355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Nov 18, 2020
Merged

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Oct 23, 2020

  • closes ENH: decode for Categoricals #8628
  • tests added / passed
  • benchmarks added
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

To illustrate the speed-up, the set-up is (from OP):

import numpy as np
import pandas as pd

rng = np.random.default_rng()

df = pd.DataFrame(
    rng.choice(np.array(list("abcde")), 4_000_000).reshape(1_000_000, 4),
    columns=list("ABCD"),
)

for col in df.columns:
    df[col] = df[col].astype("category")

On master

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 
250 ms ± 601 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

versus on this branch

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 
5.38 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@arw2019 arw2019 changed the title PERF/ENH: add fast astyping for categorical input PERF/ENH: add fast astyping for Categorical Oct 23, 2020
@topper-123
Copy link
Contributor

This changes the behaviour unfortunately:

>>> n = 100_000
>>> cat = pd.Categorical(["a", "b"] * n)
>>> ser = pd.Series(cat)
>>> ser.astype(str).dtype
dtype('O')  # master
CategoricalDtype(categories=['a', 'b'], ordered=False)  # this PR, different dtype

Can you make it work in some different way?

@arw2019
Copy link
Member Author

arw2019 commented Oct 23, 2020

This changes the behaviour unfortunately:

>>> n = 100_000
>>> cat = pd.Categorical(["a", "b"] * n)
>>> ser = pd.Series(cat)
>>> ser.astype(str).dtype
dtype('O')  # master
CategoricalDtype(categories=['a', 'b'], ordered=False)  # this PR, different dtype

Can you make it work in some different way?

Thanks for the catch! Fixed this now

@@ -596,6 +597,17 @@ def astype(self, dtype, copy: bool = False, errors: str = "raise"):

return self.make_block(Categorical(self.values, dtype=dtype))

elif ( # GH8628
is_categorical_dtype(self.values.dtype)
and not (is_object_dtype(dtype) or is_string_like_dtype(dtype))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could define a new method for this in core/dtypes/common.py (is_string_like_or_object_dtype) ?

@arw2019
Copy link
Member Author

arw2019 commented Oct 23, 2020

It seems like adding this to astype has a lot of repercussions downstream, casting objects in undesirable ways. The special casing needed to make the tests pass is a code smell.

I wonder if it might make sense, instead of adding this to astype, to create a dedicated method for astyping a Categorical via this method?

In []: cat = pd.Categorical(["a", "b", "a", "a"] ) 
   ...: s = pd.Series(cat) 
   ...: s.astype_categories(dtype=str)                                      
Out[]: 
0    a
1    b
2    a
3    a
dtype: object

@jreback
Copy link
Contributor

jreback commented Oct 23, 2020

see also #37371; this might be the same path.

or is_datetime_or_timedelta_dtype(dtype)
)
and copy is True
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems really convoluted, in a method that is already too complicated as it is (xref #22369)

Do you have a good idea where the perf improvement comes from? e.g. could we push this down into Categorical.astype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing)

The perf improvement is from astyping just the category labels instead of astyping each array entry separately.

Categorical.astype seems like a good location for this - will give that a go

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Oct 26, 2020

new_categories = self.categories.astype(dtype)
obj = Categorical.from_codes(self.codes, categories=new_categories)
return np.array(obj.categories[self.codes], copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to pass dtype here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no because we've already astyped the category a few lines up

for col in self.df.columns:
self.df[col] = self.df[col].astype("category")

def astype_unicode(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Posted int benchmark in main thread + will add/post more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right pls update these for int,float,string,datetime

@arw2019
Copy link
Member Author

arw2019 commented Oct 29, 2020

Re: review comments:

int benchmark results

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...:  
   ...: rng = np.random.default_rng() 
   ...:  
   ...: df = pd.DataFrame( 
   ...:     rng.choice(np.arange(8), 4_000_000).reshape(1_000_000, 4), 
   ...:     columns=list("ABCD"), 
   ...: ) 
   ...:  
   ...: for col in df.columns: 
   ...:     df[col] = df[col].astype("category") 
   ...:                                                                                                 

On master:

In [2]: %timeit [df[col].astype('int') for col in df.columns]                                           
20.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

In [4]: %timeit [df[col].astype('int') for col in df.columns]                                                                                                                                                      
18.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@arw2019 arw2019 marked this pull request as draft October 31, 2020 05:30
@jreback
Copy link
Contributor

jreback commented Oct 31, 2020

this is not showing much of an improvement. can you try someother dtypes as well.

try:
astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
except (TypeError, ValueError):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change TypeError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to fix the error message for CategoricalIndex. If we don't catch TypeError we end up with TypeError: Cannot cast Index to dtype float64 (below) versus something like TypeError: Cannot cast object to dtype float64

In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"])

In [3]: idx.astype('float')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
    700         try:
--> 701             casted = self._values.astype(dtype, copy=copy)
    702         except (TypeError, ValueError) as err:

ValueError: could not convert string to float: 'a'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-38d56ec15c36> in <module>
----> 1 idx.astype('float')

/workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy)
    369     @doc(Index.astype)
    370     def astype(self, dtype, copy=True):
--> 371         res_data = self._data.astype(dtype, copy=copy)
    372         return Index(res_data, name=self.name)
    373 

/workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy)
    427             # GH8628 (PERF): astype category codes instead of astyping array
    428             try:
--> 429                 astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
    430             except (ValueError):
    431                 raise ValueError(

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
    701             casted = self._values.astype(dtype, copy=copy)
    702         except (TypeError, ValueError) as err:
--> 703             raise TypeError(
    704                 f"Cannot cast {type(self).__name__} to dtype {dtype}"
    705             ) from err

TypeError: Cannot cast Index to dtype float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine, but can you add a comment for this then right here (so future readers understand)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

ci / checks failing

@arw2019
Copy link
Member Author

arw2019 commented Nov 18, 2020

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version
mypy 0.782
Performing static analysis using mypy
pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]"  [arg-type]
Found 1 error in 1 file (checked 1119 source files)
Performing static analysis using mypy DONE

Anyway will keep looking and ping when green

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version
mypy 0.782
Performing static analysis using mypy
pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]"  [arg-type]
Found 1 error in 1 file (checked 1119 source files)
Performing static analysis using mypy DONE

Anyway will keep looking and ping when green

oh i think that's fixed on master

@arw2019 arw2019 closed this Nov 18, 2020
@arw2019 arw2019 reopened this Nov 18, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, ping on green.

try:
astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
except (TypeError, ValueError):
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine, but can you add a comment for this then right here (so future readers understand)

@arw2019
Copy link
Member Author

arw2019 commented Nov 18, 2020

@jreback Green + addressed comment

@jreback jreback merged commit cc957d1 into pandas-dev:master Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

thanks @arw2019 very nice

@arw2019
Copy link
Member Author

arw2019 commented Nov 18, 2020

thanks @jreback @jbrockmendel for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: decode for Categoricals
5 participants