`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

ghiggi · 2023-05-23T16:23:07Z

What is your issue?

I noticed that open_dataset with chunks="auto" fails when netCDF4 variables/coordinates are encoded as NC_STRING.
The reason is that xarray reads netCDF4 NC_STRING as object type, and dask cannot estimate the size of a object dtype.

As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) encoding(s) as a fixed-length string type (i.e "S2" if max string length is 2) so that the data are written as NC_CHAR and xarray read it back as byte-encoded fixed-length string type.

Here below I provide a reproducible example

import xarray as xr
import numpy as np

# Define string datarray
arr = np.array(["M6", "M3"], dtype=str)
print(arr.dtype)  # <U2
da = xr.DataArray(data=arr, dims=("time"))
data_vars = {"str_arr": da}

# Create dataset
ds_nc_string = xr.Dataset(data_vars=data_vars)

# Set chunking to see behaviour at read-time
ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1)  # chunks ((1,1),)
 
# Write dataset with NC_STRING
ds_nc_string["str_arr"].encoding["dtype"] = str
ds_nc_string.to_netcdf("/tmp/nc_string.nc")

# Write dataset with NC_CHAR
ds_nc_char = xr.Dataset(data_vars=data_vars)
ds_nc_char["str_arr"].encoding["dtype"] = "S2"
ds_nc_char.to_netcdf("/tmp/nc_char.nc")

# When NC_STRING, chunks="auto" does not work when string are saved as 
# --> NC STRING is read as object, and dask can not estimate chunk size !
# If chunks={} it reads the NC_STRING array in a single dask chunk !!!
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={})     # Works
ds_nc_string.chunks  # chunks (2,)

# With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})   
ds_nc_char.chunks # chunks (2,)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto")
ds_nc_char.chunks # chunks (2,)

# NC_STRING is read back as object 
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None)
ds_nc_string["str_arr"].dtype  #  object 

# NC_CHAR is read back as fixed length byte-string representation (S2) 
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None)
ds_nc_char["str_arr"].dtype            #  S2 
ds_nc_char["str_arr"].data.astype(str) #  U2

Questions:

open_dataset should not take care of automatically deserializing the NC_CHAR fixed-length byte-string representation into a Unicode string?
open_dataset should not take care of automatically reading NC_STRING as Unicode string (converting object to str)?

Related issues are:

The text was updated successfully, but these errors were encountered:

kmuehlbauer · 2023-05-23T18:42:55Z

@ghiggi Thanks for getting this back into action. I got dragged away from the one string object issue in #7654. I'll split this out and add a PR.

kmuehlbauer · 2023-05-23T18:55:46Z

@ghiggi I'd appreciate if you could test your workflows against #7869. Your example and the one over in #7652 are working AFAICT.

ghiggi · 2023-05-24T08:12:18Z

Thanks @kmuehlbauer ! #7869 solve the issues !

Summarizing:

With preserve vlen string dtypes, allow vlen string fill_values #7869, netCDF4 with NC_STRING variable arrays are now read into xarray as Unicode dtype (instead of object)
As a consequence dask can estimate the array's size and xr.open_dataset(fpath, chunks="auto") does not raise anymore the NotImplementedError.
NC_CHAR variable arrays continue to be read into xarray as fixed-length byte-string dtype. Maybe something more could be done to deserialize also NC_CHAR to Unicode. However, this might cause some backward incompatibilities and might be better to address this in a separate PR.

Thanks again @kmuehlbauer for having resolved the problem in less than 2 hours 🥇

kmuehlbauer · 2023-05-24T08:27:11Z

@ghiggi Glad it works, but we still have to check if that is the correct location for the fix, as it's not CF specific.

kmuehlbauer · 2023-05-24T13:58:16Z

My main question here is, why is dask not trying to retrieve the object types from dtype.metadata? Or does it and fail for some reason?.

ghiggi · 2023-05-24T15:20:00Z

Dask array with dtype object can contain whatever python object (i.e. I saw examples of geometry and matplotlib collections within dask arrays with object dtype). As a consequence, dask do not try the conversion to i.e. str to estimate the array size, since there is no clean way AFAIK to attach an attribute to dtype suggesting that the object is actually a string.

With your PR, the dtype is not anymore object when creating the dask.array and this solves the issue I guess.
Did I overlooked something?

kmuehlbauer · 2023-05-24T16:50:34Z

Thanks @ghiggi for your comment.

The problem is we have at least two contradicting user requests here, see #7328 and #7862.

I'm sure there is a solution to accommodate both sides.

ghiggi · 2023-06-28T09:41:21Z

UPDATE
I updated the reproducible code snippets after seeing the PR #7948.
By using chunks={} instead of chunks="auto" is already possible to read NC_STRING arrays into a (single chunk) dask array. If the PR #7948 is merged, then using chunks={} would allow obtaining a dask.array with the same chunk size as in the netCDF file.

NOTE:
PR #7948 does not address the problem related to ``chunks="auto" ` nor the decoding of the string. This side of the problem is addressed by the PR #7869

ghiggi added the needs triage Issue that has not been reviewed by xarray team member label May 23, 2023

kmuehlbauer added topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) and removed needs triage Issue that has not been reviewed by xarray team member labels May 23, 2023

kmuehlbauer mentioned this issue May 23, 2023

preserve vlen string dtypes, allow vlen string fill_values #7869

Merged

6 tasks

kmuehlbauer mentioned this issue May 24, 2023

CF encoding should preserve vlen dtype for empty arrays #7862

Merged

2 tasks

dcherian closed this as completed in #7869 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

ghiggi commented May 23, 2023 •

edited

Loading

kmuehlbauer commented May 23, 2023 •

edited

Loading

kmuehlbauer commented May 23, 2023

ghiggi commented May 24, 2023

kmuehlbauer commented May 24, 2023

kmuehlbauer commented May 24, 2023

ghiggi commented May 24, 2023

kmuehlbauer commented May 24, 2023

ghiggi commented Jun 28, 2023 •

edited

Loading

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Comments

ghiggi commented May 23, 2023 • edited Loading

What is your issue?

kmuehlbauer commented May 23, 2023 • edited Loading

kmuehlbauer commented May 23, 2023

ghiggi commented May 24, 2023

kmuehlbauer commented May 24, 2023

kmuehlbauer commented May 24, 2023

ghiggi commented May 24, 2023

kmuehlbauer commented May 24, 2023

ghiggi commented Jun 28, 2023 • edited Loading

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

ghiggi commented May 23, 2023 •

edited

Loading

kmuehlbauer commented May 23, 2023 •

edited

Loading

ghiggi commented Jun 28, 2023 •

edited

Loading