Skip to content

ENH: serialization of categorical to HDF5 (GH7621) #8793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 17, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 12, 2014

This is implemented by storing the codes directly in the table. And a metadata VLArray of the categories.
Query and appending work as expected. The only quirk is that I don't allow you to append to a table unless the new data has exactly the same categories. Otherwise the codes become meaningless.

This has the nice property of drastically shrinking the storage cost compared to regular strings (which are stored as fixed width of the maximum for that particular column).

I bumped the actual HDF5 storage version to current (was 0.10.1). Its not strictly necessary as this is a completely optional feature, but I am adding the sub-group space 'meta' (which FYI we can use for other things, e.g. to store the column labels and avoid the 64KB limit in attrs, their is an issue about this somewhere)

In [14]: df = DataFrame({'a' : Series(list('abccdef')).astype('category'), 'b' : np.random.randn(7)})

In [15]: df
Out[15]: 
   a         b
0  a -0.094609
1  b -1.814638
2  c  0.214974
3  c -0.195395
4  d  0.206022
5  e  1.130589
6  f -0.832810

In [19]: store = pd.HDFStore('test.h5',mode='w')

In [20]: store.append('df',df,data_columns=['a'])

In [21]: store.select('df',where=["a in ['b','d']"])
Out[21]: 
   a         b
1  b -1.814638
4  d  0.206022

In [22]: store.select('df',where=["a in ['b','d']"]).dtypes
Out[22]: 
a    category
b     float64
dtype: object

In [25]: store.get_storer('df').group
Out[25]: 
/df (Group) u''
  children := ['table' (Table), 'meta' (Group)]

In [26]: store.get_storer('df').group.table
Out[26]: 
/df/table (Table(7,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "a": Int8Col(shape=(), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (3855,)
  autoindex := True
  colindexes := {
    "a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [27]: store.get_storer('df').group.meta 
Out[27]: 
/df/meta (Group) u''
  children := ['a' (VLArray)]

@jankatins jankatins mentioned this pull request Nov 12, 2014
4 tasks
@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2014

cc @JanSchulz
cc @bashtage
cc @shoyer
cc @immerrr

@jreback jreback added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore labels Nov 12, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 12, 2014
@jreback jreback force-pushed the cat_hdf branch 3 times, most recently from 6e25082 to 199de84 Compare November 13, 2014 11:33
@jankatins
Copy link
Contributor

Not sure what to say here: I've no expertise in pytable, sorry... :-/

@bashtage
Copy link
Contributor

Does using VLarray affect performance with compression? Fixed length strings can be compressed while I think VLArray data cannot.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

@bashtage I actually changed this back to a regular Array, really for more 'visibility', e.g. you can actually inspect these objects, whereas a VLArray objects get pickled. I don't really think their is any actual perf issue. This is just a single array of the categories and compared to the size of a table usually is much much less.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

this allows future expadiblity because the array can then be 2-d for example
cc @shoyer

In [2]: s = Series(list('aabbcdedfab')).astype('category').to_hdf('test.h5','s',mode='w',format='table')

In [3]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/s (Group) ''
  /s._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['values'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'values': {'ordered': True}, 'index': {}},
    levels := 1,
    metadata := ['values'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['values'])],
    pandas_type := 'series_table',
    pandas_version := '0.15.2',
    table_type := 'appendable_series',
    values_cols := ['values']]
/s/table (Table(11,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": Int8Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (7281,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "values": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /s/table._v_attrs (AttributeSet), 11 attributes:  [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'values',
    NROWS := 11,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_dtype := 'category',
    values_kind := ['values']]
  Data dump:
[0] (0, 0)
[1] (1, 0)
[2] (2, 1)
[3] (3, 1)
[4] (4, 2)
[5] (5, 3)
[6] (6, 4)
[7] (7, 3)
[8] (8, 5)
[9] (9, 0)
[10] (10, 1)
/s/meta (Group) ''
  /s/meta._v_attrs (AttributeSet), 3 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0']
/s/meta/values (Array(6,)) ''
  atom := StringAtom(itemsize=1, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
  /s/meta/values._v_attrs (AttributeSet), 5 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    kind := 'string']
  Data dump:
[0] a
[1] b
[2] c
[3] d
[4] e
[5] f

@bashtage
Copy link
Contributor

That change makes sense. And with compression large chunks of whitespace might be less of an issue anyway.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

for example. Its actually a function of the max_length of the strings stored.

In [27]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])})

In [28]: df_cat = df.copy()

In [29]: df_cat['B'] = df_cat['B'].astype('category')

In [30]: pd.concat([df]*10000).to_hdf('test1.h5','df',mode='w',format='table')

In [31]: pd.concat([df_cat]*10000).to_hdf('test_cat.h5','df',mode='w',format='table')

In [33]: !ls -ltr *.h5
-rw-rw-r--  1 jreback  staff  1876493 Nov 14 10:02 test1.h5
-rw-rw-r--  1 jreback  staff   895756 Nov 14 10:02 test_cat.h5

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

@jorisvandenbossche ?

.. versionadded:: 0.15.2

Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype was implemented
in 0.15.2. Queries work the same as if it was an object array (but the ``Categorical`` is store in a more efficient manner)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store -> stored

@jorisvandenbossche
Copy link
Member

Is this a format change? What will happen if someone wants to read with an older version of pandas an hdf file that is saved with 0.15.2 (or was such a thing never supported?)

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

Ok, I updated to make this more explicit. It is now backwards AND forwards compatible. In that you can read a >0.15.2 written file in a prior version.

You will get the codes in the table (as that is how they are stored).
The categories are now stored as a regular pathed array, so they can also be retrieved.
So it loss-less in a forward way (but requires the user to use them, as the Categorical type did not exist prior to 0.15.0.)

The following in 0.15.2

In [1]:    dfc = DataFrame({ 'A' : Series(list('aabbcdba')).astype('category'),
   ...:                      'B' : np.random.randn(8) })

In [2]:    store = pd.HDFStore('test.h5', mode='w')

In [3]:    store.append('df', dfc, format='table', data_columns=['A'])

In [4]:    result = store.select('df', where="A in ['b','c']")

In [5]:    result
Out[5]: 
   A         B
2  b  0.259910
3  b -0.489301
4  c -1.681019
6  b -2.147062

In [6]:    result.dtypes
Out[6]: 
A    category
B     float64
dtype: object

In [7]: store
Out[7]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df                        frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A])
/df/meta/A/meta            series       (shape->[1])                                                 

In [8]: store.select('df/meta/A/meta')
Out[8]: 
0    a
1    b
2    c
3    d
dtype: object

and in 0.15.1 reading the same file

In [1]: store = pd.HDFStore('pandas/test.h5')

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: pandas/test.h5
/df                        frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A])
/df/meta/A/meta            series       (shape->[1])                                                 

In [3]: store.select('df')
Out[3]: 
   A         B
0  0 -0.906125
1  0  1.324821
2  1  0.259910
3  1 -0.489301
4  2 -1.681019
5  3  0.711411
6  1 -2.147062
7  0  0.797939

In [4]: store.select('df').dtypes
Out[4]: 
A       int8
B    float64
dtype: object

In [5]: store.select('df/meta/A/meta')
Out[5]: 
0    a
1    b
2    c
3    d
dtype: object


The format of the ``Categoricals` is readable by prior versions of pandas (< 0.15.2), but will retrieve
the data as an integer based column (e.g. the ``codes``). However, the ``categories`` *can* be retrieved
but require the user to select them manually using the explicity meta path.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did explicity -> explict (but haven't pushed yet)

@jorisvandenbossche
Copy link
Member

You know added some docs to categorical.rst, but maybe also add (or refer to) something in io.rst#pytables ?

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

hmm ok sure

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

ok, fixed up

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2014

@jorisvandenbossche any further comments?

@jorisvandenbossche
Copy link
Member

nope, no further comments! (but for the actual pytables interaction, I am not familiar with that)

jreback added a commit that referenced this pull request Nov 17, 2014
ENH: serialization of categorical to HDF5 (GH7621)
@jreback jreback merged commit e0680ec into pandas-dev:master Nov 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants