Numpy string coding #5264

znicholls · 2021-05-06T02:35:55Z

Fixes handling of numpy string types in coding

Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

znicholls · 2021-05-06T02:37:57Z

So far I've just added a single test which fails. I don't think the test should fail although I'm not sure what the np.str_ type actually is so maybe this isn't a bug? Help/advice greatly appreciated.

shoyer · 2021-05-06T04:06:04Z

So far I've just added a single test which fails. I don't think the test should fail although I'm not sure what the np.str_ type actually is so maybe this isn't a bug? Help/advice greatly appreciated.

What problem are you trying to solve here?

This vlen string stuff is an internal API that isn't really intended for use outside Xarray.

znicholls · 2021-05-06T05:21:19Z

What problem are you trying to solve here?

Somehow I ended up with np.str_ in a pandas dataframe (how is unclear to me but this seems to be a valid string type), which then exploded when I converted to xarray and attempted to save as netCDF. Minimal example below.

import numpy as np
import pandas as pd


# I don't know how the strings ended up being np.str_....
scenarios = [np.str_(v) for v in ["scenario_a", "scenario_b", "scenario_c"]]
years = range(2015, 2100 + 1)
tdf = pd.DataFrame(
    data=np.random.random((len(scenarios), len(years))),
    columns=years,
    index=scenarios,
)
tdf.index.name = "scenario"
tdf.columns.name = "year"
tdf = tdf.stack()
tdf.name = "tas"

txr = tdf.to_xarray()
# raises error shown below
txr.to_netcdf("test.nc")

# error
Traceback (most recent call last):
  File "scratch.py", line 20, in <module>
    txr.to_netcdf("test.nc")
  File ".../lib/python3.7/site-packages/xarray/core/dataarray.py", line 2741, in to_netcdf
    return dataset.to_netcdf(*args, **kwargs)
  File ".../lib/python3.7/site-packages/xarray/core/dataset.py", line 1699, in to_netcdf
    invalid_netcdf=invalid_netcdf,
  File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1108, in to_netcdf
    dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/api.py", line 1154, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 256, in store
    variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/common.py", line 294, in set_variables
    name, v, check, unlimited_dims=unlimited_dims
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 464, in prepare_variable
    variable, self.format, raise_on_invalid_encoding=check_encoding
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 131, in _get_datatype
    datatype = _nc4_dtype(var)
  File ".../lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 154, in _nc4_dtype
    raise ValueError(f"unsupported dtype for netCDF4 variable: {var.dtype}")
ValueError: unsupported dtype for netCDF4 variable: object

znicholls · 2021-05-26T06:55:48Z

@shoyer any further thoughts on this now that the scope is clearer?

shoyer · 2021-05-26T07:05:38Z

I agree, this should totally work. It's not obvious to me how to best fix it, though.

znicholls · 2021-05-26T08:58:23Z

I agree, this should totally work. It's not obvious to me how to best fix it, though.

I assume it's not as trivial as just changing e.g.

xarray/xarray/coding/strings.py

Line 32 in f9a535c

return dtype.kind == "U" or check_vlen_dtype(dtype) == str

to also know about np.str_?

shoyer · 2021-05-26T16:58:29Z

I think the issue must be somewhere around this line, where xarray attempts to infer a dtype for object arrays:

xarray/xarray/conventions.py

Line 215 in f9a535c

inferred_dtype = _infer_dtype(non_missing_values, name)

znicholls · 2021-07-14T01:41:38Z

I tried pushing a fix. It's unclear to me whether the change should be in how the dtypes are inferred (given that the inference code seems to do what it is meant to...) or whether is_unicode_dtype simply needs to be updated to know about np.str_ (which is the fix I just tried).

github-actions · 2021-07-14T01:52:10Z

Unit Test Results

        6 files         6 suites 55m 20s ⏱️
16 230 tests 14 495 ✔️ 1 735 💤 0 ❌
90 576 runs 82 392 ✔️ 8 184 💤 0 ❌

Results for commit fc8252e.

♻️ This comment has been updated with latest results.

shoyer · 2021-07-14T02:06:26Z

My suggestion is that either _infer_dtype (

xarray/xarray/conventions.py

Lines 160 to 161 in f9a535c

    
           if isinstance(element, (bytes, str)): 
        
               return strings.create_vlen_dtype(type(element))

) or the underlying create_vlen_dtype should be updated, so it never puts np.str_ inside a custom vlen dtype. Instead, we should normalize element_type to always be str or bytes inside the vlen dtype.

shoyer · 2021-07-14T02:08:23Z

To add a bit more clarification: the vlen dtype should correspond to an HDF5/netCDF4 compatible data-type, like a variable length string or bytes. np.str_ is just a NumPy variant of str, so the correct dtype is create_vlen_dtype(str).

znicholls · 2021-07-14T02:19:55Z

Something like 59ed7d5? (Obviously missing proper tests but just to get a sense of whether the idea is plausible)

znicholls · 2021-07-14T02:35:18Z

xarray/tests/test_backends.py::test_open_fsspec appears to fail because of the release of ffspec 2021.7.0 so this PR will probably have to wait until a fix for that is added (presumably elsewhere to keep the changes clear).

znicholls · 2021-07-19T23:19:01Z

Ignoring failing CI due to fsspec (see #5615 (comment))

znicholls · 2021-10-01T01:27:15Z

@shoyer can I bother you again now that CI is passing please?

znicholls · 2021-10-01T01:27:24Z

@lewisjarednz fyi

shoyer

Looks great, thanks! Please move the test, then we can merge this

xarray/tests/test_coding_strings.py

Illviljan · 2021-11-11T21:03:23Z

Looks good to me, nice work!

Illviljan · 2021-12-30T23:40:28Z

Thanks @znicholls!

max-sixty added the needs work label Jun 12, 2021

znicholls force-pushed the numpy-str-encoding branch from 33f0e09 to ca6abdb Compare July 14, 2021 01:40

znicholls force-pushed the numpy-str-encoding branch from f2e1550 to 5149cc7 Compare July 19, 2021 23:11

znicholls added 6 commits October 1, 2021 08:57

Add failing test

c286199

Try fix

f2edd52

Lint

d8fa99f

Require netCDF4 for test

35cab0e

Move fix to infer dtype

bf5edd0

Update thanks to @shoyer

5e15269

znicholls force-pushed the numpy-str-encoding branch from 5149cc7 to 5e15269 Compare September 30, 2021 22:58

Whats new

61f63df

znicholls marked this pull request as ready for review October 1, 2021 01:26

shoyer reviewed Oct 1, 2021

View reviewed changes

xarray/tests/test_coding_strings.py Outdated Show resolved Hide resolved

Move test and add comment

fc8252e

znicholls requested a review from shoyer October 2, 2021 08:36

Illviljan added 2 commits November 11, 2021 21:16

Merge branch 'main' into pr/5264

cb29c5d

Update whats-new.rst

28a96c2

Illviljan removed the needs work label Nov 11, 2021

Illviljan added the plan to merge Final call for comments label Nov 11, 2021

Illviljan added 3 commits December 29, 2021 11:48

Merge branch 'main' into pr/5264

4182666

Update whats-new.rst

7af84da

Merge branch 'main' into pr/5264

e450cd2

Illviljan merged commit f75c3be into pydata:main Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy string coding #5264

Numpy string coding #5264

znicholls commented May 6, 2021 •

edited

Loading

znicholls commented May 6, 2021

shoyer commented May 6, 2021

znicholls commented May 6, 2021 •

edited

Loading

znicholls commented May 26, 2021

shoyer commented May 26, 2021

znicholls commented May 26, 2021

shoyer commented May 26, 2021

znicholls commented Jul 14, 2021

github-actions bot commented Jul 14, 2021 •

edited

Loading

shoyer commented Jul 14, 2021

shoyer commented Jul 14, 2021

znicholls commented Jul 14, 2021 •

edited

Loading

znicholls commented Jul 14, 2021

znicholls commented Jul 19, 2021

znicholls commented Oct 1, 2021

znicholls commented Oct 1, 2021

shoyer left a comment

Illviljan commented Nov 11, 2021

Illviljan commented Dec 30, 2021

Numpy string coding #5264

Numpy string coding #5264

Conversation

znicholls commented May 6, 2021 • edited Loading

znicholls commented May 6, 2021

shoyer commented May 6, 2021

znicholls commented May 6, 2021 • edited Loading

znicholls commented May 26, 2021

shoyer commented May 26, 2021

znicholls commented May 26, 2021

shoyer commented May 26, 2021

znicholls commented Jul 14, 2021

github-actions bot commented Jul 14, 2021 • edited Loading

Unit Test Results

shoyer commented Jul 14, 2021

shoyer commented Jul 14, 2021

znicholls commented Jul 14, 2021 • edited Loading

znicholls commented Jul 14, 2021

znicholls commented Jul 19, 2021

znicholls commented Oct 1, 2021

znicholls commented Oct 1, 2021

shoyer left a comment

Choose a reason for hiding this comment

Illviljan commented Nov 11, 2021

Illviljan commented Dec 30, 2021

znicholls commented May 6, 2021 •

edited

Loading

znicholls commented May 6, 2021 •

edited

Loading

github-actions bot commented Jul 14, 2021 •

edited

Loading

znicholls commented Jul 14, 2021 •

edited

Loading