Skip to content

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghiggi opened this issue May 23, 2023 · 8 comments · Fixed by #7869
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@ghiggi
Copy link

ghiggi commented May 23, 2023

What is your issue?

I noticed that open_dataset with chunks="auto" fails when netCDF4 variables/coordinates are encoded as NC_STRING.
The reason is that xarray reads netCDF4 NC_STRING as object type, and dask cannot estimate the size of a object dtype.

As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) encoding(s) as a fixed-length string type (i.e "S2" if max string length is 2) so that the data are written as NC_CHAR and xarray read it back as byte-encoded fixed-length string type.

Here below I provide a reproducible example

import xarray as xr
import numpy as np

# Define string datarray
arr = np.array(["M6", "M3"], dtype=str)
print(arr.dtype)  # <U2
da = xr.DataArray(data=arr, dims=("time"))
data_vars = {"str_arr": da}

# Create dataset
ds_nc_string = xr.Dataset(data_vars=data_vars)

# Set chunking to see behaviour at read-time
ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1)  # chunks ((1,1),)
 
# Write dataset with NC_STRING
ds_nc_string["str_arr"].encoding["dtype"] = str
ds_nc_string.to_netcdf("/tmp/nc_string.nc")

# Write dataset with NC_CHAR
ds_nc_char = xr.Dataset(data_vars=data_vars)
ds_nc_char["str_arr"].encoding["dtype"] = "S2"
ds_nc_char.to_netcdf("/tmp/nc_char.nc")

# When NC_STRING, chunks="auto" does not work when string are saved as 
# --> NC STRING is read as object, and dask can not estimate chunk size !
# If chunks={} it reads the NC_STRING array in a single dask chunk !!!
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={})     # Works
ds_nc_string.chunks  # chunks (2,)

# With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})   
ds_nc_char.chunks # chunks (2,)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto")
ds_nc_char.chunks # chunks (2,)

# NC_STRING is read back as object 
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None)
ds_nc_string["str_arr"].dtype  #  object 

# NC_CHAR is read back as fixed length byte-string representation (S2) 
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None)
ds_nc_char["str_arr"].dtype            #  S2 
ds_nc_char["str_arr"].data.astype(str) #  U2 

Questions:

  • open_dataset should not take care of automatically deserializing the NC_CHAR fixed-length byte-string representation into a Unicode string?
  • open_dataset should not take care of automatically reading NC_STRING as Unicode string (converting object to str)?

Related issues are:

@ghiggi ghiggi added the needs triage Issue that has not been reviewed by xarray team member label May 23, 2023
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented May 23, 2023

@ghiggi Thanks for getting this back into action. I got dragged away from the one string object issue in #7654. I'll split this out and add a PR.

@kmuehlbauer kmuehlbauer added topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) and removed needs triage Issue that has not been reviewed by xarray team member labels May 23, 2023
@kmuehlbauer
Copy link
Contributor

@ghiggi I'd appreciate if you could test your workflows against #7869. Your example and the one over in #7652 are working AFAICT.

@ghiggi
Copy link
Author

ghiggi commented May 24, 2023

Thanks @kmuehlbauer ! #7869 solve the issues !

Summarizing:

  • With preserve vlen string dtypes, allow vlen string fill_values #7869, netCDF4 with NC_STRING variable arrays are now read into xarray as Unicode dtype (instead of object)
  • As a consequence dask can estimate the array's size and xr.open_dataset(fpath, chunks="auto") does not raise anymore the NotImplementedError.
  • NC_CHAR variable arrays continue to be read into xarray as fixed-length byte-string dtype. Maybe something more could be done to deserialize also NC_CHAR to Unicode. However, this might cause some backward incompatibilities and might be better to address this in a separate PR.

Thanks again @kmuehlbauer for having resolved the problem in less than 2 hours 🥇

@kmuehlbauer
Copy link
Contributor

@ghiggi Glad it works, but we still have to check if that is the correct location for the fix, as it's not CF specific.

@kmuehlbauer
Copy link
Contributor

My main question here is, why is dask not trying to retrieve the object types from dtype.metadata? Or does it and fail for some reason?.

@ghiggi
Copy link
Author

ghiggi commented May 24, 2023

Dask array with dtype object can contain whatever python object (i.e. I saw examples of geometry and matplotlib collections within dask arrays with object dtype). As a consequence, dask do not try the conversion to i.e. str to estimate the array size, since there is no clean way AFAIK to attach an attribute to dtype suggesting that the object is actually a string.

With your PR, the dtype is not anymore object when creating the dask.array and this solves the issue I guess.
Did I overlooked something?

@kmuehlbauer
Copy link
Contributor

Thanks @ghiggi for your comment.

The problem is we have at least two contradicting user requests here, see #7328 and #7862.

I'm sure there is a solution to accommodate both sides.

@ghiggi
Copy link
Author

ghiggi commented Jun 28, 2023

UPDATE
I updated the reproducible code snippets after seeing the PR #7948.
By using chunks={} instead of chunks="auto" is already possible to read NC_STRING arrays into a (single chunk) dask array. If the PR #7948 is merged, then using chunks={} would allow obtaining a dask.array with the same chunk size as in the netCDF file.

NOTE:
PR #7948 does not address the problem related to ``chunks="auto" ` nor the decoding of the string. This side of the problem is addressed by the PR #7869

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants