Skip to content

ENH: decode for Categoricals #8628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fkaufer opened this issue Oct 24, 2014 · 5 comments · Fixed by #37355
Closed

ENH: decode for Categoricals #8628

fkaufer opened this issue Oct 24, 2014 · 5 comments · Fixed by #37355
Labels
API Design Categorical Categorical Data Type Performance Memory or execution speed performance
Milestone

Comments

@fkaufer
Copy link

fkaufer commented Oct 24, 2014

There seems no direct way to return to the original dtype and the documentation recommends: "To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical)"

That's slow and a decode or decat method would be trivial:

df=pd.DataFrame(np.random.choice(list(u'abcde'), 4e6).reshape(1e6, 4),
    columns=list(u'ABCD'))                                     
for col in df.columns: df[col] = df[col].astype('category')   

%timeit for col in df.columns: df[col].astype('unicode')      
1 loops, best of 3: 1.06 s per loop

%timeit for col in df.columns: cats=df[col].cat.categories; cats[df[col].cat.codes]    
10 loops, best of 3: 33.2 ms per loop   

I was working with ~10 categories (partially longer strings) on a 20 mio rows dataset where the difference was even bigger (unfortunately can't reproduce it with dummy data) and using astype felt rather buggy (minutes) than only a performance issue.

Given the current limitations on exporting categorical data, having a fast decode method would be very convenient. Since category codes are most often strings an optional parameter for direct character set encoding would also be good to have for such a method.

%timeit for col in df.columns: df[col].astype('unicode').str.encode('latin1')  
1 loops, best of 3: 3.95 s per loop
%timeit for col in df.columns: cats=pd.Series(df[col].cat.categories).str.encode('latin1'); cats[df[col].cat.codes]                                                                  
10 loops, best of 3: 74.5 ms per loop   
@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

I agree would be nicer to support this directly (and it is quite easy). here is how:

.astype simple needs to do this (its essentially not-implemented atm, and reverts to object type behavor), which is completely non-performant when you are doing things like this

In [14]: pd.Categorical.from_codes(df['A'].cat.codes,categories=df['A'].cat.categories.astype('unicode'))
Out[14]: 
[b, d, b, a, d, ..., c, a, a, d, a]
Length: 1000000
Categories (5, object): [a, b, c, d, e]

In [15]: %timeit pd.Categorical.from_codes(df['A'].cat.codes,categories=df['A'].cat.categories.astype('unicode'))  
100 loops, best of 3: 3.74 ms per loop

love for you to make a chage to astype to support this, its pretty straightorward

@jreback jreback added API Design Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 24, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 24, 2014
@shoyer
Copy link
Member

shoyer commented Oct 25, 2014

np.asarray is pretty fast, only slightly slower than doing it manually:

%timeit np.asarray(df['A'].cat.categories[df['A'].cat.codes])
100 loops, best of 3: 9.91 ms per loop
%timeit np.asarray(df['A'])
100 loops, best of 3: 13.3 ms per loop

The slow part is converting even numpy ndarrays from object dtype into back unicode (though I'm not sure why exactly you want that, given that pandas normally uses object arrays for strings):

x = np.asarray(df['A'])
%timeit np.asarray(x, unicode)
10 loops, best of 3: 112 ms per loop

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

right but if u know you have a categorical then u only need to do str ops on a very small amount (relatively speaking ) amount of data (compared to object arrays)

@shoyer
Copy link
Member

shoyer commented Oct 25, 2014

@jreback Yep, someone I missed that in your earlier comment.

To complete my earlier benchmarks:

%timeit df['A'].cat.categories.astype(unicode)[df['A'].cat.codes]
100 loops, best of 3: 8.34 ms per loop

@jreback jreback modified the milestones: 0.15.2, 0.16.0 Nov 29, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@wesm
Copy link
Member

wesm commented Jul 6, 2018

I think it would be useful to have an inverse equivalent to factorize and categorical conversions

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants