-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Numpy string coding #5264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numpy string coding #5264
Conversation
So far I've just added a single test which fails. I don't think the test should fail although I'm not sure what the |
What problem are you trying to solve here? This vlen string stuff is an internal API that isn't really intended for use outside Xarray. |
Somehow I ended up with import numpy as np
import pandas as pd
# I don't know how the strings ended up being np.str_....
scenarios = [np.str_(v) for v in ["scenario_a", "scenario_b", "scenario_c"]]
years = range(2015, 2100 + 1)
tdf = pd.DataFrame(
data=np.random.random((len(scenarios), len(years))),
columns=years,
index=scenarios,
)
tdf.index.name = "scenario"
tdf.columns.name = "year"
tdf = tdf.stack()
tdf.name = "tas"
txr = tdf.to_xarray()
# raises error shown below
txr.to_netcdf("test.nc")
# error
Traceback (most recent call last):
File "scratch.py", line 20, in <module>
txr.to_netcdf("test.nc")
File ".../lib/python3.7/site-packages/xarray/core/dataarray.py", line 2741, in to_netcdf
return dataset.to_netcdf(*args, **kwargs)
File ".../lib/python3.7/site-packages/xarray/core/dataset.py", line 1699, in to_netcdf
invalid_netcdf=invalid_netcdf,
File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1108, in to_netcdf
dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1154, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 256, in store
variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 294, in set_variables
name, v, check, unlimited_dims=unlimited_dims
File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 464, in prepare_variable
variable, self.format, raise_on_invalid_encoding=check_encoding
File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 131, in _get_datatype
datatype = _nc4_dtype(var)
File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 154, in _nc4_dtype
raise ValueError(f"unsupported dtype for netCDF4 variable: {var.dtype}")
ValueError: unsupported dtype for netCDF4 variable: object |
@shoyer any further thoughts on this now that the scope is clearer? |
I agree, this should totally work. It's not obvious to me how to best fix it, though. |
I assume it's not as trivial as just changing e.g. xarray/xarray/coding/strings.py Line 32 in f9a535c
np.str_ ?
|
I think the issue must be somewhere around this line, where xarray attempts to infer a dtype for object arrays: Line 215 in f9a535c
|
33f0e09
to
ca6abdb
Compare
I tried pushing a fix. It's unclear to me whether the change should be in how the dtypes are inferred (given that the inference code seems to do what it is meant to...) or whether |
My suggestion is that either Lines 160 to 161 in f9a535c
) or the underlying create_vlen_dtype should be updated, so it never puts np.str_ inside a custom vlen dtype. Instead, we should normalize element_type to always be str or bytes inside the vlen dtype.
|
To add a bit more clarification: the vlen dtype should correspond to an HDF5/netCDF4 compatible data-type, like a variable length string or bytes. |
Something like 59ed7d5? (Obviously missing proper tests but just to get a sense of whether the idea is plausible) |
|
f2e1550
to
5149cc7
Compare
Ignoring failing CI due to fsspec (see #5615 (comment)) |
5149cc7
to
5e15269
Compare
@shoyer can I bother you again now that CI is passing please? |
@lewisjarednz fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Please move the test, then we can merge this
Looks good to me, nice work! |
Thanks @znicholls! |
Fixes handling of numpy string types in coding
pre-commit run --all-files
whats-new.rst