ENH: decode for Categoricals #8628

fkaufer · 2014-10-24T22:10:05Z

There seems no direct way to return to the original dtype and the documentation recommends: "To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical)"

That's slow and a decode or decat method would be trivial:

df=pd.DataFrame(np.random.choice(list(u'abcde'), 4e6).reshape(1e6, 4),
    columns=list(u'ABCD'))                                     
for col in df.columns: df[col] = df[col].astype('category')   

%timeit for col in df.columns: df[col].astype('unicode')      
1 loops, best of 3: 1.06 s per loop

%timeit for col in df.columns: cats=df[col].cat.categories; cats[df[col].cat.codes]    
10 loops, best of 3: 33.2 ms per loop

I was working with ~10 categories (partially longer strings) on a 20 mio rows dataset where the difference was even bigger (unfortunately can't reproduce it with dummy data) and using astype felt rather buggy (minutes) than only a performance issue.

Given the current limitations on exporting categorical data, having a fast decode method would be very convenient. Since category codes are most often strings an optional parameter for direct character set encoding would also be good to have for such a method.

%timeit for col in df.columns: df[col].astype('unicode').str.encode('latin1')  
1 loops, best of 3: 3.95 s per loop
%timeit for col in df.columns: cats=pd.Series(df[col].cat.categories).str.encode('latin1'); cats[df[col].cat.codes]                                                                  
10 loops, best of 3: 74.5 ms per loop

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-24T23:46:55Z

I agree would be nicer to support this directly (and it is quite easy). here is how:

.astype simple needs to do this (its essentially not-implemented atm, and reverts to object type behavor), which is completely non-performant when you are doing things like this

In [14]: pd.Categorical.from_codes(df['A'].cat.codes,categories=df['A'].cat.categories.astype('unicode'))
Out[14]: 
[b, d, b, a, d, ..., c, a, a, d, a]
Length: 1000000
Categories (5, object): [a, b, c, d, e]

In [15]: %timeit pd.Categorical.from_codes(df['A'].cat.codes,categories=df['A'].cat.categories.astype('unicode'))  
100 loops, best of 3: 3.74 ms per loop

love for you to make a chage to astype to support this, its pretty straightorward

shoyer · 2014-10-25T01:02:52Z

np.asarray is pretty fast, only slightly slower than doing it manually:

%timeit np.asarray(df['A'].cat.categories[df['A'].cat.codes])
100 loops, best of 3: 9.91 ms per loop
%timeit np.asarray(df['A'])
100 loops, best of 3: 13.3 ms per loop

The slow part is converting even numpy ndarrays from object dtype into back unicode (though I'm not sure why exactly you want that, given that pandas normally uses object arrays for strings):

x = np.asarray(df['A'])
%timeit np.asarray(x, unicode)
10 loops, best of 3: 112 ms per loop

jreback · 2014-10-25T01:09:55Z

right but if u know you have a categorical then u only need to do str ops on a very small amount (relatively speaking ) amount of data (compared to object arrays)

shoyer · 2014-10-25T01:13:49Z

@jreback Yep, someone I missed that in your earlier comment.

To complete my earlier benchmarks:

%timeit df['A'].cat.categories.astype(unicode)[df['A'].cat.codes]
100 loops, best of 3: 8.34 ms per loop

wesm · 2018-07-06T22:14:29Z

I think it would be useful to have an inverse equivalent to factorize and categorical conversions

jreback added API Design Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 24, 2014

jreback added this to the 0.15.1 milestone Oct 24, 2014

This was referenced Oct 25, 2014

ENH: categorical dataexport - graceful degradation #8633

Closed

ENH: Categorical serialized #7621

Closed

jreback modified the milestones: 0.15.2, 0.16.0 Nov 29, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

arw2019 mentioned this issue Oct 23, 2020

PERF/ENH: add fast astyping for Categorical #37355

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Oct 31, 2020

jreback closed this as completed in #37355 Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: decode for Categoricals #8628

ENH: decode for Categoricals #8628

fkaufer commented Oct 24, 2014

jreback commented Oct 24, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

shoyer commented Oct 25, 2014

wesm commented Jul 6, 2018

ENH: decode for Categoricals #8628

ENH: decode for Categoricals #8628

Comments

fkaufer commented Oct 24, 2014

jreback commented Oct 24, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

shoyer commented Oct 25, 2014

wesm commented Jul 6, 2018