-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: serialization of categorical to HDF5 (GH7621) #8793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @JanSchulz |
6e25082
to
199de84
Compare
Not sure what to say here: I've no expertise in pytable, sorry... :-/ |
Does using VLarray affect performance with compression? Fixed length strings can be compressed while I think VLArray data cannot. |
@bashtage I actually changed this back to a regular Array, really for more 'visibility', e.g. you can actually inspect these objects, whereas a |
this allows future expadiblity because the array can then be 2-d for example
|
That change makes sense. And with compression large chunks of whitespace might be less of an issue anyway. |
for example. Its actually a function of the max_length of the strings stored.
|
.. versionadded:: 0.15.2 | ||
|
||
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype was implemented | ||
in 0.15.2. Queries work the same as if it was an object array (but the ``Categorical`` is store in a more efficient manner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store -> stored
Is this a format change? What will happen if someone wants to read with an older version of pandas an hdf file that is saved with 0.15.2 (or was such a thing never supported?) |
Ok, I updated to make this more explicit. It is now backwards AND forwards compatible. In that you can read a >0.15.2 written file in a prior version. You will get the codes in the table (as that is how they are stored). The following in 0.15.2
and in 0.15.1 reading the same file
|
|
||
The format of the ``Categoricals` is readable by prior versions of pandas (< 0.15.2), but will retrieve | ||
the data as an integer based column (e.g. the ``codes``). However, the ``categories`` *can* be retrieved | ||
but require the user to select them manually using the explicity meta path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already did explicity -> explict (but haven't pushed yet)
You know added some docs to categorical.rst, but maybe also add (or refer to) something in io.rst#pytables ? |
hmm ok sure |
ok, fixed up |
@jorisvandenbossche any further comments? |
nope, no further comments! (but for the actual pytables interaction, I am not familiar with that) |
ENH: serialization of categorical to HDF5 (GH7621)
This is implemented by storing the codes directly in the table. And a metadata VLArray of the categories.
Query and appending work as expected. The only quirk is that I don't allow you to append to a table unless the new data has exactly the same categories. Otherwise the codes become meaningless.
This has the nice property of drastically shrinking the storage cost compared to regular strings (which are stored as fixed width of the maximum for that particular column).
I bumped the actual HDF5 storage version to current (was 0.10.1). Its not strictly necessary as this is a completely optional feature, but I am adding the sub-group space 'meta' (which FYI we can use for other things, e.g. to store the column labels and avoid the 64KB limit in attrs, their is an issue about this somewhere)