Skip to content

Numpy string coding #5264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 30, 2021
Merged

Numpy string coding #5264

merged 13 commits into from
Dec 30, 2021

Conversation

znicholls
Copy link
Contributor

@znicholls znicholls commented May 6, 2021

Fixes handling of numpy string types in coding

  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@znicholls
Copy link
Contributor Author

So far I've just added a single test which fails. I don't think the test should fail although I'm not sure what the np.str_ type actually is so maybe this isn't a bug? Help/advice greatly appreciated.

@shoyer
Copy link
Member

shoyer commented May 6, 2021

So far I've just added a single test which fails. I don't think the test should fail although I'm not sure what the np.str_ type actually is so maybe this isn't a bug? Help/advice greatly appreciated.

What problem are you trying to solve here?

This vlen string stuff is an internal API that isn't really intended for use outside Xarray.

@znicholls
Copy link
Contributor Author

znicholls commented May 6, 2021

What problem are you trying to solve here?

Somehow I ended up with np.str_ in a pandas dataframe (how is unclear to me but this seems to be a valid string type), which then exploded when I converted to xarray and attempted to save as netCDF. Minimal example below.

import numpy as np
import pandas as pd


# I don't know how the strings ended up being np.str_....
scenarios = [np.str_(v) for v in ["scenario_a", "scenario_b", "scenario_c"]]
years = range(2015, 2100 + 1)
tdf = pd.DataFrame(
    data=np.random.random((len(scenarios), len(years))),
    columns=years,
    index=scenarios,
)
tdf.index.name = "scenario"
tdf.columns.name = "year"
tdf = tdf.stack()
tdf.name = "tas"

txr = tdf.to_xarray()
# raises error shown below
txr.to_netcdf("test.nc")

# error
Traceback (most recent call last):
  File "scratch.py", line 20, in <module>
    txr.to_netcdf("test.nc")
  File ".../lib/python3.7/site-packages/xarray/core/dataarray.py", line 2741, in to_netcdf
    return dataset.to_netcdf(*args, **kwargs)
  File ".../lib/python3.7/site-packages/xarray/core/dataset.py", line 1699, in to_netcdf
    invalid_netcdf=invalid_netcdf,
  File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1108, in to_netcdf
    dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1154, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 256, in store
    variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 294, in set_variables
    name, v, check, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 464, in prepare_variable
    variable, self.format, raise_on_invalid_encoding=check_encoding
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 131, in _get_datatype
    datatype = _nc4_dtype(var)
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 154, in _nc4_dtype
    raise ValueError(f"unsupported dtype for netCDF4 variable: {var.dtype}")
ValueError: unsupported dtype for netCDF4 variable: object

@znicholls
Copy link
Contributor Author

@shoyer any further thoughts on this now that the scope is clearer?

@shoyer
Copy link
Member

shoyer commented May 26, 2021

I agree, this should totally work. It's not obvious to me how to best fix it, though.

@znicholls
Copy link
Contributor Author

I agree, this should totally work. It's not obvious to me how to best fix it, though.

I assume it's not as trivial as just changing e.g.

return dtype.kind == "U" or check_vlen_dtype(dtype) == str
to also know about np.str_?

@shoyer
Copy link
Member

shoyer commented May 26, 2021

I think the issue must be somewhere around this line, where xarray attempts to infer a dtype for object arrays:

inferred_dtype = _infer_dtype(non_missing_values, name)

@znicholls
Copy link
Contributor Author

I tried pushing a fix. It's unclear to me whether the change should be in how the dtypes are inferred (given that the inference code seems to do what it is meant to...) or whether is_unicode_dtype simply needs to be updated to know about np.str_ (which is the fix I just tried).

@github-actions
Copy link
Contributor

github-actions bot commented Jul 14, 2021

Unit Test Results

         6 files           6 suites   55m 20s ⏱️
16 230 tests 14 495 ✔️ 1 735 💤 0
90 576 runs  82 392 ✔️ 8 184 💤 0

Results for commit fc8252e.

♻️ This comment has been updated with latest results.

@shoyer
Copy link
Member

shoyer commented Jul 14, 2021

My suggestion is that either _infer_dtype (

if isinstance(element, (bytes, str)):
return strings.create_vlen_dtype(type(element))

) or the underlying create_vlen_dtype should be updated, so it never puts np.str_ inside a custom vlen dtype. Instead, we should normalize element_type to always be str or bytes inside the vlen dtype.

@shoyer
Copy link
Member

shoyer commented Jul 14, 2021

To add a bit more clarification: the vlen dtype should correspond to an HDF5/netCDF4 compatible data-type, like a variable length string or bytes. np.str_ is just a NumPy variant of str, so the correct dtype is create_vlen_dtype(str).

@znicholls
Copy link
Contributor Author

znicholls commented Jul 14, 2021

Something like 59ed7d5? (Obviously missing proper tests but just to get a sense of whether the idea is plausible)

@znicholls
Copy link
Contributor Author

xarray/tests/test_backends.py::test_open_fsspec appears to fail because of the release of ffspec 2021.7.0 so this PR will probably have to wait until a fix for that is added (presumably elsewhere to keep the changes clear).

@znicholls znicholls force-pushed the numpy-str-encoding branch from f2e1550 to 5149cc7 Compare July 19, 2021 23:11
@znicholls
Copy link
Contributor Author

Ignoring failing CI due to fsspec (see #5615 (comment))

@znicholls znicholls marked this pull request as ready for review October 1, 2021 01:26
@znicholls
Copy link
Contributor Author

@shoyer can I bother you again now that CI is passing please?

@znicholls
Copy link
Contributor Author

@lewisjarednz fyi

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Please move the test, then we can merge this

@znicholls znicholls requested a review from shoyer October 2, 2021 08:36
@Illviljan
Copy link
Contributor

Looks good to me, nice work!

@Illviljan Illviljan added the plan to merge Final call for comments label Nov 11, 2021
@Illviljan Illviljan merged commit f75c3be into pydata:main Dec 30, 2021
@Illviljan
Copy link
Contributor

Thanks @znicholls!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants