-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: decode for Categoricals #8628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree would be nicer to support this directly (and it is quite easy). here is how:
love for you to make a chage to |
The slow part is converting even numpy ndarrays from object dtype into back unicode (though I'm not sure why exactly you want that, given that pandas normally uses object arrays for strings):
|
right but if u know you have a categorical then u only need to do str ops on a very small amount (relatively speaking ) amount of data (compared to object arrays) |
@jreback Yep, someone I missed that in your earlier comment. To complete my earlier benchmarks:
|
I think it would be useful to have an inverse equivalent to |
There seems no direct way to return to the original dtype and the documentation recommends: "To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical)"
That's slow and a
decode
ordecat
method would be trivial:I was working with ~10 categories (partially longer strings) on a 20 mio rows dataset where the difference was even bigger (unfortunately can't reproduce it with dummy data) and using
astype
felt rather buggy (minutes) than only a performance issue.Given the current limitations on exporting categorical data, having a fast
decode
method would be very convenient. Since category codes are most often strings an optional parameter for direct character set encoding would also be good to have for such a method.The text was updated successfully, but these errors were encountered: