Skip to content

Adding arbitrary object serialization #1421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lewisacidic
Copy link
Contributor

This adds support for object serialization using the netCDF4-python backend..

Minimum working (at least appears to..) example, no tests yet.

I added allow_object kwarg (rather than allow_pickle, no reason to firmly attach pickle to the api, could use something else for other backends).

This is now for:

  • to_netcdf
  • AbstractDataStore (a True value raises NotImplementedError for everything but NetCDF4DataStore)
  • cf_encoder which when True alters its behaviour to allow dtype('O') through.

NetCDF4DataStore handles this independently from the cf_encoder/decoder. The dtype support made it hard to decouple, plus I think object serialization is a backend dependent issue.

There's a lot of potential for refactoring, just pushed this to get opinions about whether this
was a reasonable approach - I'm relatively new to open source, so would appreciate any constructive feedback/criticisms!

  • Closes #xxxx
  • Tests added / passed
  • Passes git diff upstream/master | flake8 --diff
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

^ these will come later!

This adds support for object serialization using the netCDF4-python
backend..

Minimum working (maybe?) example, no tests yet.
@shoyer
Copy link
Member

shoyer commented May 23, 2017

Thanks for giving this a shot!

I added allow_object kwarg (rather than allow_pickle, no reason to firmly attach pickle to the api, could use something else for other backends).

I'm having a hard time imagining any other serialization formats for serializing arbitrary Python objects. pickle is pretty standard, though we might switch the argument for to_netcdf to pickle_protocol to allow indicating the pickle version (which would default to None, for don't pickle).

One addition reason for favoring allow_pickle is that it's the argument used by np.save and np.load.

NetCDF4DataStore handles this independently from the cf_encoder/decoder. The dtype support made it hard to decouple, plus I think object serialization is a backend dependent issue.

Yes, this is a little tricky. The current design is not great here. Ideally, though, we would still keep all of the encoding/decoding logic separate from the datastores. I need to think about this a little more.


One other concern is how to represent this data on disk in netCDF/HDF5 variables. Ideally, we would have a format that could work -- at least in principle -- with h5netcdf/h5py as well as netCDF4-python.

Annoyingly, these libraries currently have incompatible dtype support:

  • netCDF4-python supports variable length types with custom name. It does not support HDF5's opaque type, though the netCDF-C libraries do support opaque types, so adding them to netCDF4-Python would be relatively straightforward.
  • h5py supports variable length types, but not their name field. It maps np.void to HDF5's opaque type, which would be a pretty sensible storage type if netCDF4-Python supported it.

So if we want something that works with both, we'll need to add some additional metadata field in the form of an attribute to indicate how do decoding. Maybe something like _PickleProtocol, which would store the version of the pickle protocol used to write the data?


I have some inline comments I'll add below.

@@ -9,6 +9,7 @@
import re
import warnings
from collections import Mapping, MutableMapping, Iterable
from six.moves import cPickle as pickle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray doesn't depend on six, so you need to use a try/except here:

try:
    import cPickle as pickle
except ImportError:
    import pickle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow sorry, I actually forgot it wasn't a builtin, and I lazily didn't set up a proper dev environment. Definitely a simple fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try/except statement should probably go in xarray's pycompat module.


@functools.partial(np.vectorize, otypes='O')
def encode_pickle(obj):
return np.frombuffer(pickle.dumps(obj), dtype=np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use a later version of the pickle format -- at least version 2 (which introduced the binary version) if not pickle.HIGHEST_PROTOCOL. Possibly this should be a user controllable argument.

For reference, numpy.save uses protocol=2 and pandas.DataFrame.to_pickle uses HIGHEST_PROTOCOL (which is protocol=2 on Python 2, and currently protocol=4 on Python 3).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think six handles the protocol issue, which is why I didn't do anything here, but no six so we can handle that manually. I don't know much about pickle 2 and 3 compatability (i.e. dump in 2, load in 3), perhaps that would be the nicest configuration to default to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that pickle works, any version of Python can load older pickles, but old versions of Python can't load newer pickles. So protocol=2 is a maximally backwards compatible option, but misses out on any later pickle improvements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set HIGHEST_PROTOCOL in pycompat as well.

# TODO: move this from conventions to backends? (it's not CF related)
if var.dtype.kind == 'O':
dims, data, attrs, encoding = _var_as_tuple(var)
missing = pd.isnull(data)
if missing.any():
# nb. this will fail for dask.array data
non_missing_values = data[~missing]
inferred_dtype = _infer_dtype(non_missing_values)
inferred_dtype = _infer_dtype(non_missing_values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _infer_dtype fails, we want to break out of this function and return the original var, not copy the data and put it on a new Variable object (which happens below).

@lewisacidic
Copy link
Contributor Author

Thanks for giving this a shot!

No problem, I really want this feature so I can use xarray for a cheminformatics library I'm working on! Hopefully we can work out the best way to do this whilst keeping everything as nice and organised as it is was before I touched the code...

I'm having a hard time imagining any other serialization formats for serializing arbitrary Python objects. pickle is pretty standard, though we might switch the argument for to_netcdf to pickle_protocol to allow indicating the pickle version (which would default to None, for don't pickle).

I couldn't think of others either - it made sense as a keyword for cf_encoder where no pickling currently happens which is why I changed it originally. allow_pickle definitely makes sense, I've used np.save loads but didn't know that keyword existed. Perhaps two kws, allow_pickle and pickle_protocol, so its more explicit, and so we can set a default protocol?

Yes, this is a little tricky. The current design is not great here. Ideally, though, we would still keep all of the encoding/decoding logic separate from the datastores. I need to think about this a little more.

Yeah, it definitely isn't great, I wanted a working example and that's what I managed to do before sleeping! I'll keep looking through the code to familiarize myself a bit more with it - I would be interested to see what you suggest!


One other concern is how to represent this data on disk in netCDF/HDF5 variables. Ideally, we would have a format that could work -- at least in principle -- with h5netcdf/h5py as well as netCDF4-python.

Yeah, that would definitely be good.

Annoyingly, these libraries currently have incompatible dtype support:

  • netCDF4-python supports variable length types with custom name. It does not support HDF5's opaque type, though the netCDF-C libraries do support opaque types, so adding them to netCDF4-Python would be relatively straightforward.
  • h5py supports variable length types, but not their name field. It maps np.void to HDF5's opaque type, which would be a pretty sensible storage type if netCDF4-Python supported it.
    So if we want something that works with both, we'll need to add some additional metadata field in the form of an attribute to indicate how do decoding. Maybe something like _PickleProtocol, which would store the version of the pickle protocol used to write the data?

The np.void type was pretty much designed for this sort of thing from what I can see, I was pretty surprised that netCDF4-python didn't have something similar, hence the strange np.uint8 stuff. We could do the same for h5py, using h5py.special_dtype, although as you say there is no name so it can't work the same for h5py as it does now. Pickle works out the protocol automatically (no protocol keyword for load or loads), so we wouldn't really need to save the protocol as an attribute, although it would be a way to work out which variables to unpickle once saved, if we went this route. It seems a shame not to use np.void though, so perhaps it makes sense to add the opaque types to netCDF4-python and forget the np.uint8 trick.

@lewisacidic
Copy link
Contributor Author

lewisacidic commented May 24, 2017

The pickle protocol is the second byte of any pickle btw, just had a look.

For protocol 2+ anyway...

@shoyer
Copy link
Member

shoyer commented May 24, 2017

Pickle works out the protocol automatically (no protocol keyword for load or loads), so we wouldn't really need to save the protocol as an attribute, although it would be a way to work out which variables to unpickle once saved, if we went this route.

I think we do want some sort of marker attribute, but I agree that it doesn't need to include the pickle version.

Maybe the attribute _FileFormat = 'python-pickle' would make sense? This would have the advantage of being obvious to anyone inspecting the netCDF file with standard tools (not xarray).

It seems a shame not to use np.void though, so perhaps it makes sense to add the opaque types to netCDF4-python and forget the np.uint8 trick.

I think netCDF actually maps np.int8 -> NC_BYTE, so that's at least some justification for this choice: http://www.unidata.ucar.edu/software/netcdf/docs/data_type.html

Certainly handling opaque types in netCDF4-python would be nice, though I don't think it should be a blocker for this. I suspect the reason this isn't done is that NumPy maps bytes -> np.string_ even on Python 3. Thus np.void is used far less often than it should. Also the repr for np.void has been pretty poor, though that's being worked on currently in numpy/numpy#8981.

@shoyer
Copy link
Member

shoyer commented May 24, 2017

How about something like the following:

In encode_cf_variable, create a new variable with pickle encoded data (if appropriate). This looks something like:

def encode_cf_variable(var, allow_pickle=False):
    ...
    if var.dtype == object:
        if allow_pickle:
            var = maybe_encode_pickle(var)
        else:
            raise TypeError
    return var

def maybe_encode_pickle(var):
    if var.dtype == object:
        attrs = var.attrs.copy()
        safe_setitem('_FileFormat', 'python-pickle')
        protocol = var.encoding.pop('pickle_protocol', 2)
        data = utils.encode_pickle(var.values, protocol=protocol)
        var = Variable(var.dims, data, attrs, var.encoding)
    return var

This reuses the encoding parameter for setting the pickle protocol version, which is already what we use for similar variable specific encoding details.

In the netCDF backends, add a check for variable with dtype == object with a _FileFormat attribute. If this is the case, call a create_vlen_int8_dtype method to create an appropriate dtype using backend specific methods (The behavior on the base class should raise an error), and proceed with writing the data in the usual way.

For decoding, reverse the process. Convert custom vlen dtypes to dtype=object in the appropriate backend specific array wrapper type, but don't decode data. If allow_pickle=True and var.attrs['_FileFormat'] == 'python-pickle', then decode_cf_variable should do the unpickling, moving _FileFormat from attrs to encoding.

For bonus points, generalize handling of vlen types with encoding. Something like encoding={'dtype': {'vlen': np.int8}} could indicate that a special vlen with np.int8 data should be created to encode this variable's data. (This is inspired by h5py's API for special types.)

@lewisacidic
Copy link
Contributor Author

Sounds great, thanks for the feedback! I'll probably not get to this until the weekend, but I'll have another crack at it then.

@lewisacidic lewisacidic changed the title First pass adding arbitrary object serialization Adding arbitrary object serialization May 28, 2017
@lewisacidic
Copy link
Contributor Author

Sorry for the holdup. I made progress 3 weekends ago, but didn't get a fully working example, and I've been swamped the whole of this month. This weekend I'll be free to finish this off.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richlewis42 - this seems to be coming along. Let us know if you need any help pushing it forward.

@@ -9,6 +9,7 @@
import re
import warnings
from collections import Mapping, MutableMapping, Iterable
from six.moves import cPickle as pickle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try/except statement should probably go in xarray's pycompat module.


@functools.partial(np.vectorize, otypes='O')
def encode_pickle(obj):
return np.frombuffer(pickle.dumps(obj), dtype=np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set HIGHEST_PROTOCOL in pycompat as well.

@lewisacidic
Copy link
Contributor Author

Hi all, sorry for the lack of communication on this. I'm writing up my PhD thesis at the moment, and my deadline is increasingly looming. Once I've handed in I'll finish this off, and am more than happy to help with several other things I've encountered (I'll create the issues now). Once again, sorry for the delay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants