-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF/ENH: add fast astyping for Categorical #37355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This changes the behaviour unfortunately: >>> n = 100_000
>>> cat = pd.Categorical(["a", "b"] * n)
>>> ser = pd.Series(cat)
>>> ser.astype(str).dtype
dtype('O') # master
CategoricalDtype(categories=['a', 'b'], ordered=False) # this PR, different dtype Can you make it work in some different way? |
Thanks for the catch! Fixed this now |
pandas/core/internals/blocks.py
Outdated
@@ -596,6 +597,17 @@ def astype(self, dtype, copy: bool = False, errors: str = "raise"): | |||
|
|||
return self.make_block(Categorical(self.values, dtype=dtype)) | |||
|
|||
elif ( # GH8628 | |||
is_categorical_dtype(self.values.dtype) | |||
and not (is_object_dtype(dtype) or is_string_like_dtype(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could define a new method for this in core/dtypes/common.py
(is_string_like_or_object_dtype
) ?
It seems like adding this to astype has a lot of repercussions downstream, casting objects in undesirable ways. The special casing needed to make the tests pass is a code smell. I wonder if it might make sense, instead of adding this to
|
see also #37371; this might be the same path. |
pandas/core/internals/blocks.py
Outdated
or is_datetime_or_timedelta_dtype(dtype) | ||
) | ||
and copy is True | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems really convoluted, in a method that is already too complicated as it is (xref #22369)
Do you have a good idea where the perf improvement comes from? e.g. could we push this down into Categorical.astype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing)
The perf improvement is from astyping just the category labels instead of astyping each array entry separately.
Categorical.astype
seems like a good location for this - will give that a go
pandas/core/arrays/categorical.py
Outdated
|
||
new_categories = self.categories.astype(dtype) | ||
obj = Categorical.from_codes(self.codes, categories=new_categories) | ||
return np.array(obj.categories[self.codes], copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to pass dtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no because we've already astyped the category a few lines up
asv_bench/benchmarks/categoricals.py
Outdated
for col in self.df.columns: | ||
self.df[col] = self.df[col].astype("category") | ||
|
||
def astype_unicode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
Posted int benchmark in main thread + will add/post more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right pls update these for int,float,string,datetime
Re: review comments:
|
this is not showing much of an improvement. can you try someother dtypes as well. |
pandas/core/arrays/categorical.py
Outdated
try: | ||
astyped_cats = self.categories.astype(dtype=dtype, copy=copy) | ||
except (TypeError, ValueError): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change TypeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to fix the error message for CategoricalIndex
. If we don't catch TypeError
we end up with TypeError: Cannot cast Index to dtype float64
(below) versus something like TypeError: Cannot cast object to dtype float64
In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"])
In [3]: idx.astype('float')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
700 try:
--> 701 casted = self._values.astype(dtype, copy=copy)
702 except (TypeError, ValueError) as err:
ValueError: could not convert string to float: 'a'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-4-38d56ec15c36> in <module>
----> 1 idx.astype('float')
/workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy)
369 @doc(Index.astype)
370 def astype(self, dtype, copy=True):
--> 371 res_data = self._data.astype(dtype, copy=copy)
372 return Index(res_data, name=self.name)
373
/workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy)
427 # GH8628 (PERF): astype category codes instead of astyping array
428 try:
--> 429 astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
430 except (ValueError):
431 raise ValueError(
/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
701 casted = self._values.astype(dtype, copy=copy)
702 except (TypeError, ValueError) as err:
--> 703 raise TypeError(
704 f"Cannot cast {type(self).__name__} to dtype {dtype}"
705 ) from err
TypeError: Cannot cast Index to dtype float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine, but can you add a comment for this then right here (so future readers understand)
ci / checks failing |
Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)
Anyway will keep looking and ping when green |
oh i think that's fixed on master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, ping on green.
pandas/core/arrays/categorical.py
Outdated
try: | ||
astyped_cats = self.categories.astype(dtype=dtype, copy=copy) | ||
except (TypeError, ValueError): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine, but can you add a comment for this then right here (so future readers understand)
@jreback Green + addressed comment |
thanks @arw2019 very nice |
thanks @jreback @jbrockmendel for the reviews! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
To illustrate the speed-up, the set-up is (from OP):
On master
versus on this branch