ENH: serialization of categorical to HDF5 (GH7621) #8793

jreback · 2014-11-12T01:55:30Z

This is implemented by storing the codes directly in the table. And a metadata VLArray of the categories.
Query and appending work as expected. The only quirk is that I don't allow you to append to a table unless the new data has exactly the same categories. Otherwise the codes become meaningless.

This has the nice property of drastically shrinking the storage cost compared to regular strings (which are stored as fixed width of the maximum for that particular column).

I bumped the actual HDF5 storage version to current (was 0.10.1). Its not strictly necessary as this is a completely optional feature, but I am adding the sub-group space 'meta' (which FYI we can use for other things, e.g. to store the column labels and avoid the 64KB limit in attrs, their is an issue about this somewhere)

In [14]: df = DataFrame({'a' : Series(list('abccdef')).astype('category'), 'b' : np.random.randn(7)})

In [15]: df
Out[15]: 
   a         b
0  a -0.094609
1  b -1.814638
2  c  0.214974
3  c -0.195395
4  d  0.206022
5  e  1.130589
6  f -0.832810

In [19]: store = pd.HDFStore('test.h5',mode='w')

In [20]: store.append('df',df,data_columns=['a'])

In [21]: store.select('df',where=["a in ['b','d']"])
Out[21]: 
   a         b
1  b -1.814638
4  d  0.206022

In [22]: store.select('df',where=["a in ['b','d']"]).dtypes
Out[22]: 
a    category
b     float64
dtype: object

In [25]: store.get_storer('df').group
Out[25]: 
/df (Group) u''
  children := ['table' (Table), 'meta' (Group)]

In [26]: store.get_storer('df').group.table
Out[26]: 
/df/table (Table(7,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "a": Int8Col(shape=(), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (3855,)
  autoindex := True
  colindexes := {
    "a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [27]: store.get_storer('df').group.meta 
Out[27]: 
/df/meta (Group) u''
  children := ['a' (VLArray)]

jreback · 2014-11-12T01:56:46Z

cc @JanSchulz
cc @bashtage
cc @shoyer
cc @immerrr

jankatins · 2014-11-13T18:58:27Z

Not sure what to say here: I've no expertise in pytable, sorry... :-/

bashtage · 2014-11-13T23:28:39Z

Does using VLarray affect performance with compression? Fixed length strings can be compressed while I think VLArray data cannot.

jreback · 2014-11-14T14:02:15Z

@bashtage I actually changed this back to a regular Array, really for more 'visibility', e.g. you can actually inspect these objects, whereas a VLArray objects get pickled. I don't really think their is any actual perf issue. This is just a single array of the categories and compared to the size of a table usually is much much less.

jreback · 2014-11-14T14:07:05Z

this allows future expadiblity because the array can then be 2-d for example
cc @shoyer

In [2]: s = Series(list('aabbcdedfab')).astype('category').to_hdf('test.h5','s',mode='w',format='table')

In [3]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/s (Group) ''
  /s._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['values'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'values': {'ordered': True}, 'index': {}},
    levels := 1,
    metadata := ['values'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['values'])],
    pandas_type := 'series_table',
    pandas_version := '0.15.2',
    table_type := 'appendable_series',
    values_cols := ['values']]
/s/table (Table(11,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": Int8Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (7281,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "values": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /s/table._v_attrs (AttributeSet), 11 attributes:  [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'values',
    NROWS := 11,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_dtype := 'category',
    values_kind := ['values']]
  Data dump:
[0] (0, 0)
[1] (1, 0)
[2] (2, 1)
[3] (3, 1)
[4] (4, 2)
[5] (5, 3)
[6] (6, 4)
[7] (7, 3)
[8] (8, 5)
[9] (9, 0)
[10] (10, 1)
/s/meta (Group) ''
  /s/meta._v_attrs (AttributeSet), 3 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0']
/s/meta/values (Array(6,)) ''
  atom := StringAtom(itemsize=1, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
  /s/meta/values._v_attrs (AttributeSet), 5 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    kind := 'string']
  Data dump:
[0] a
[1] b
[2] c
[3] d
[4] e
[5] f

bashtage · 2014-11-14T14:51:37Z

That change makes sense. And with compression large chunks of whitespace might be less of an issue anyway.

jreback · 2014-11-14T15:03:03Z

for example. Its actually a function of the max_length of the strings stored.

In [27]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])})

In [28]: df_cat = df.copy()

In [29]: df_cat['B'] = df_cat['B'].astype('category')

In [30]: pd.concat([df]*10000).to_hdf('test1.h5','df',mode='w',format='table')

In [31]: pd.concat([df_cat]*10000).to_hdf('test_cat.h5','df',mode='w',format='table')

In [33]: !ls -ltr *.h5
-rw-rw-r--  1 jreback  staff  1876493 Nov 14 10:02 test1.h5
-rw-rw-r--  1 jreback  staff   895756 Nov 14 10:02 test_cat.h5

jreback · 2014-11-14T18:11:15Z

@jorisvandenbossche ?

jorisvandenbossche · 2014-11-15T11:02:10Z

doc/source/categorical.rst

+.. versionadded:: 0.15.2
+
+Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype was implemented
+in 0.15.2. Queries work the same as if it was an object array (but the ``Categorical`` is store in a more efficient manner)


store -> stored

jorisvandenbossche · 2014-11-15T11:11:03Z

Is this a format change? What will happen if someone wants to read with an older version of pandas an hdf file that is saved with 0.15.2 (or was such a thing never supported?)

jreback · 2014-11-15T16:22:51Z

Ok, I updated to make this more explicit. It is now backwards AND forwards compatible. In that you can read a >0.15.2 written file in a prior version.

You will get the codes in the table (as that is how they are stored).
The categories are now stored as a regular pathed array, so they can also be retrieved.
So it loss-less in a forward way (but requires the user to use them, as the Categorical type did not exist prior to 0.15.0.)

The following in 0.15.2

In [1]:    dfc = DataFrame({ 'A' : Series(list('aabbcdba')).astype('category'),
   ...:                      'B' : np.random.randn(8) })

In [2]:    store = pd.HDFStore('test.h5', mode='w')

In [3]:    store.append('df', dfc, format='table', data_columns=['A'])

In [4]:    result = store.select('df', where="A in ['b','c']")

In [5]:    result
Out[5]: 
   A         B
2  b  0.259910
3  b -0.489301
4  c -1.681019
6  b -2.147062

In [6]:    result.dtypes
Out[6]: 
A    category
B     float64
dtype: object

In [7]: store
Out[7]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df                        frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A])
/df/meta/A/meta            series       (shape->[1])                                                 

In [8]: store.select('df/meta/A/meta')
Out[8]: 
0    a
1    b
2    c
3    d
dtype: object

and in 0.15.1 reading the same file

In [1]: store = pd.HDFStore('pandas/test.h5')

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: pandas/test.h5
/df                        frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A])
/df/meta/A/meta            series       (shape->[1])                                                 

In [3]: store.select('df')
Out[3]: 
   A         B
0  0 -0.906125
1  0  1.324821
2  1  0.259910
3  1 -0.489301
4  2 -1.681019
5  3  0.711411
6  1 -2.147062
7  0  0.797939

In [4]: store.select('df').dtypes
Out[4]: 
A       int8
B    float64
dtype: object

In [5]: store.select('df/meta/A/meta')
Out[5]: 
0    a
1    b
2    c
3    d
dtype: object

jreback · 2014-11-15T16:30:09Z

doc/source/categorical.rst

+
+   The format of the ``Categoricals` is readable by prior versions of pandas (< 0.15.2), but will retrieve
+   the data as an integer based column (e.g. the ``codes``). However, the ``categories`` *can* be retrieved
+   but require the user to select them manually using the explicity meta path.


I already did explicity -> explict (but haven't pushed yet)

jorisvandenbossche · 2014-11-15T16:31:45Z

You know added some docs to categorical.rst, but maybe also add (or refer to) something in io.rst#pytables ?

jreback · 2014-11-15T16:45:12Z

hmm ok sure

jreback · 2014-11-15T16:59:17Z

ok, fixed up

jreback · 2014-11-16T14:40:09Z

@jorisvandenbossche any further comments?

jorisvandenbossche · 2014-11-16T21:49:52Z

nope, no further comments! (but for the actual pytables interaction, I am not familiar with that)

ENH: serialization of categorical to HDF5 (GH7621)

jankatins mentioned this pull request Nov 12, 2014

ENH: Categorical serialized #7621

Closed

4 tasks

jreback added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore labels Nov 12, 2014

jreback added this to the 0.15.2 milestone Nov 12, 2014

jreback force-pushed the cat_hdf branch 3 times, most recently from 6e25082 to 199de84 Compare November 13, 2014 11:33

jreback force-pushed the cat_hdf branch from 199de84 to c64348a Compare November 13, 2014 23:10

jreback force-pushed the cat_hdf branch from c64348a to 33998ff Compare November 14, 2014 18:12

jorisvandenbossche reviewed Nov 15, 2014
View reviewed changes

jreback force-pushed the cat_hdf branch from 33998ff to ed8d131 Compare November 15, 2014 16:18

jreback reviewed Nov 15, 2014
View reviewed changes

jreback force-pushed the cat_hdf branch from ed8d131 to d6735d6 Compare November 15, 2014 16:58

jreback force-pushed the cat_hdf branch from d6735d6 to 37f2f21 Compare November 15, 2014 18:54

jreback force-pushed the cat_hdf branch from 37f2f21 to 7c43d8e Compare November 15, 2014 22:32

ENH: serialization of categorical to HDF5 (GH7621)

fa378ab

jreback force-pushed the cat_hdf branch from 7c43d8e to fa378ab Compare November 16, 2014 13:32

jreback added a commit that referenced this pull request Nov 17, 2014

Merge pull request #8793 from jreback/cat_hdf

e0680ec

ENH: serialization of categorical to HDF5 (GH7621)

jreback merged commit e0680ec into pandas-dev:master Nov 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: serialization of categorical to HDF5 (GH7621) #8793

ENH: serialization of categorical to HDF5 (GH7621) #8793

jreback commented Nov 12, 2014

jreback commented Nov 12, 2014

jankatins commented Nov 13, 2014

bashtage commented Nov 13, 2014

jreback commented Nov 14, 2014

jreback commented Nov 14, 2014

bashtage commented Nov 14, 2014

jreback commented Nov 14, 2014

jreback commented Nov 14, 2014

jorisvandenbossche Nov 15, 2014

jorisvandenbossche commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback Nov 15, 2014

jorisvandenbossche commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback commented Nov 16, 2014

jorisvandenbossche commented Nov 16, 2014

ENH: serialization of categorical to HDF5 (GH7621) #8793

ENH: serialization of categorical to HDF5 (GH7621) #8793

Conversation

jreback commented Nov 12, 2014

jreback commented Nov 12, 2014

jankatins commented Nov 13, 2014

bashtage commented Nov 13, 2014

jreback commented Nov 14, 2014

jreback commented Nov 14, 2014

bashtage commented Nov 14, 2014

jreback commented Nov 14, 2014

jreback commented Nov 14, 2014

jorisvandenbossche Nov 15, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback Nov 15, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback commented Nov 15, 2014

jreback commented Nov 16, 2014

jorisvandenbossche commented Nov 16, 2014