-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Adding arbitrary object serialization #1421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds support for object serialization using the netCDF4-python backend.. Minimum working (maybe?) example, no tests yet.
Thanks for giving this a shot!
I'm having a hard time imagining any other serialization formats for serializing arbitrary Python objects. One addition reason for favoring
Yes, this is a little tricky. The current design is not great here. Ideally, though, we would still keep all of the encoding/decoding logic separate from the datastores. I need to think about this a little more. One other concern is how to represent this data on disk in netCDF/HDF5 variables. Ideally, we would have a format that could work -- at least in principle -- with Annoyingly, these libraries currently have incompatible dtype support:
So if we want something that works with both, we'll need to add some additional metadata field in the form of an attribute to indicate how do decoding. Maybe something like I have some inline comments I'll add below. |
@@ -9,6 +9,7 @@ | |||
import re | |||
import warnings | |||
from collections import Mapping, MutableMapping, Iterable | |||
from six.moves import cPickle as pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray doesn't depend on six
, so you need to use a try/except here:
try:
import cPickle as pickle
except ImportError:
import pickle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow sorry, I actually forgot it wasn't a builtin, and I lazily didn't set up a proper dev environment. Definitely a simple fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this try/except statement should probably go in xarray's pycompat module.
|
||
@functools.partial(np.vectorize, otypes='O') | ||
def encode_pickle(obj): | ||
return np.frombuffer(pickle.dumps(obj), dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to use a later version of the pickle format -- at least version 2 (which introduced the binary version) if not pickle.HIGHEST_PROTOCOL
. Possibly this should be a user controllable argument.
For reference, numpy.save
uses protocol=2
and pandas.DataFrame.to_pickle
uses HIGHEST_PROTOCOL
(which is protocol=2
on Python 2, and currently protocol=4
on Python 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think six handles the protocol issue, which is why I didn't do anything here, but no six so we can handle that manually. I don't know much about pickle 2 and 3 compatability (i.e. dump in 2, load in 3), perhaps that would be the nicest configuration to default to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that pickle works, any version of Python can load older pickles, but old versions of Python can't load newer pickles. So protocol=2
is a maximally backwards compatible option, but misses out on any later pickle improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's set HIGHEST_PROTOCOL
in pycompat as well.
# TODO: move this from conventions to backends? (it's not CF related) | ||
if var.dtype.kind == 'O': | ||
dims, data, attrs, encoding = _var_as_tuple(var) | ||
missing = pd.isnull(data) | ||
if missing.any(): | ||
# nb. this will fail for dask.array data | ||
non_missing_values = data[~missing] | ||
inferred_dtype = _infer_dtype(non_missing_values) | ||
inferred_dtype = _infer_dtype(non_missing_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _infer_dtype
fails, we want to break out of this function and return the original var
, not copy the data and put it on a new Variable object (which happens below).
No problem, I really want this feature so I can use xarray for a cheminformatics library I'm working on! Hopefully we can work out the best way to do this whilst keeping everything as nice and organised as it is was before I touched the code...
I couldn't think of others either - it made sense as a keyword for
Yeah, it definitely isn't great, I wanted a working example and that's what I managed to do before sleeping! I'll keep looking through the code to familiarize myself a bit more with it - I would be interested to see what you suggest!
Yeah, that would definitely be good.
The np.void type was pretty much designed for this sort of thing from what I can see, I was pretty surprised that netCDF4-python didn't have something similar, hence the strange |
The pickle protocol is the second byte of any pickle btw, just had a look. For protocol 2+ anyway... |
I think we do want some sort of marker attribute, but I agree that it doesn't need to include the pickle version. Maybe the attribute
I think netCDF actually maps Certainly handling opaque types in netCDF4-python would be nice, though I don't think it should be a blocker for this. I suspect the reason this isn't done is that NumPy maps |
How about something like the following: In def encode_cf_variable(var, allow_pickle=False):
...
if var.dtype == object:
if allow_pickle:
var = maybe_encode_pickle(var)
else:
raise TypeError
return var
def maybe_encode_pickle(var):
if var.dtype == object:
attrs = var.attrs.copy()
safe_setitem('_FileFormat', 'python-pickle')
protocol = var.encoding.pop('pickle_protocol', 2)
data = utils.encode_pickle(var.values, protocol=protocol)
var = Variable(var.dims, data, attrs, var.encoding)
return var This reuses the In the netCDF backends, add a check for variable with For decoding, reverse the process. Convert custom vlen dtypes to For bonus points, generalize handling of vlen types with |
Sounds great, thanks for the feedback! I'll probably not get to this until the weekend, but I'll have another crack at it then. |
Sorry for the holdup. I made progress 3 weekends ago, but didn't get a fully working example, and I've been swamped the whole of this month. This weekend I'll be free to finish this off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richlewis42 - this seems to be coming along. Let us know if you need any help pushing it forward.
@@ -9,6 +9,7 @@ | |||
import re | |||
import warnings | |||
from collections import Mapping, MutableMapping, Iterable | |||
from six.moves import cPickle as pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this try/except statement should probably go in xarray's pycompat module.
|
||
@functools.partial(np.vectorize, otypes='O') | ||
def encode_pickle(obj): | ||
return np.frombuffer(pickle.dumps(obj), dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's set HIGHEST_PROTOCOL
in pycompat as well.
Hi all, sorry for the lack of communication on this. I'm writing up my PhD thesis at the moment, and my deadline is increasingly looming. Once I've handed in I'll finish this off, and am more than happy to help with several other things I've encountered (I'll create the issues now). Once again, sorry for the delay! |
This adds support for object serialization using the netCDF4-python backend..
Minimum working (at least appears to..) example, no tests yet.
I added
allow_object
kwarg (rather thanallow_pickle
, no reason to firmly attach pickle to the api, could use something else for other backends).This is now for:
to_netcdf
AbstractDataStore
(aTrue
value raisesNotImplementedError
for everything butNetCDF4DataStore
)cf_encoder
which whenTrue
alters its behaviour to allowdtype('O')
through.NetCDF4DataStore
handles this independently from the cf_encoder/decoder. The dtype support made it hard to decouple, plus I think object serialization is a backend dependent issue.There's a lot of potential for refactoring, just pushed this to get opinions about whether this
was a reasonable approach - I'm relatively new to open source, so would appreciate any constructive feedback/criticisms!
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new API^ these will come later!