You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently tried to use xarray to open some netCDF files stored in a bucket, and was surprised how hard it was to figure out the right incantation to make this work.
The fact that passing an fsspec URL (like "s3://bucket/path/data.zarr") to open_dataset "just works" for zarr is a little misleading, since it makes you think you could do something similar for other types of files. However, this doesn't work for netCDF, GRIB, and I assume most others.
However, h5netcdf does work if you pass an fsspec file-like object (not sure if other engines support this as well?). But to add to the confusion, you can't pass the fsspec.OpenFile you get from fsspec.open; you have to pass a concrete type like S3File, GCSFile, etc:
>>>importxarrayasxr>>>importfsspec>>>url="s3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp"# a netCDF file in s3
You can't use the URL as a string directly:
>>>xr.open_dataset(url, engine='h5netcdf')
---------------------------------------------------------------------------KeyErrorTraceback (mostrecentcalllast)
...
FileNotFoundError: [Errno2] Unabletoopenfile (unabletoopenfile: name='s3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp', errno=2, errormessage='No such file or directory', flags=0, o_flags=0)
Ok, what about fsspec.open?
>>>f=fsspec.open(url)
... f<OpenFile'noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp'>>>>xr.open_dataset(f, engine='h5netcdf')
---------------------------------------------------------------------------AttributeErrorTraceback (mostrecentcalllast)
...
File~/miniconda3/envs/xarray-buckets/lib/python3.10/site-packages/xarray/backends/common.py:23, in_normalize_path(path)
21def_normalize_path(path):
22ifisinstance(path, os.PathLike):
--->23path=os.fspath(path)
25ifisinstance(path, str) andnotis_remote_uri(path):
26path=os.path.abspath(os.path.expanduser(path))
File~/miniconda3/envs/xarray-buckets/lib/python3.10/site-packages/fsspec/core.py:98, inOpenFile.__fspath__(self)
96def__fspath__(self):
97# may raise if cannot be resolved to local file--->98returnself.open().__fspath__()
AttributeError: 'S3File'objecthasnoattribute'__fspath__'
But if you somehow know that an fsspec.OpenFile isn't actually a file-like object, and you double-open it, then it works! (xref #5879 (comment))
(And even then, you have to know to use the h5netcdf engine, and not netcdf4 or scipy.)
Some things that might be nice:
Explicit documentation on working with data in cloud storage, perhaps broken down by file type/engine (xref improve docs on zarr + cloud storage #2712). It might be nice to have a table/quick reference of which engines support reading from cloud storage, and how to pass in the URL (string? fsspec file object?)
Informative error linking to these docs when opening fails and is_remote_uri(filename_or_obj)
Either make fsspec.OpenFile objects work, so you don't have to do the double-open, or raise an informative error when one is passed in telling you what to do instead.
As more and more data is available on cloud storage, newcomers to xarray will probably be increasingly looking to use it with remote data. Since xarray already supports this in some cases, this is great! With a few tweaks to docs and error messages, I think we could change an experience that took me multiple hours of debugging and reading the source into an easy 30sec experience for new users.
What is your issue?
I recently tried to use xarray to open some netCDF files stored in a bucket, and was surprised how hard it was to figure out the right incantation to make this work.
The fact that passing an fsspec URL (like
"s3://bucket/path/data.zarr"
) toopen_dataset
"just works" for zarr is a little misleading, since it makes you think you could do something similar for other types of files. However, this doesn't work for netCDF, GRIB, and I assume most others.However, h5netcdf does work if you pass an fsspec file-like object (not sure if other engines support this as well?). But to add to the confusion, you can't pass the
fsspec.OpenFile
you get fromfsspec.open
; you have to pass a concrete type likeS3File
,GCSFile
, etc:You can't use the URL as a string directly:
Ok, what about
fsspec.open
?But if you somehow know that an
fsspec.OpenFile
isn't actually a file-like object, and you double-open
it, then it works! (xref #5879 (comment))(And even then, you have to know to use the
h5netcdf
engine, and notnetcdf4
orscipy
.)Some things that might be nice:
is_remote_uri(filename_or_obj)
fsspec.OpenFile
objects work, so you don't have to do the double-open, or raise an informative error when one is passed in telling you what to do instead.As more and more data is available on cloud storage, newcomers to xarray will probably be increasingly looking to use it with remote data. Since xarray already supports this in some cases, this is great! With a few tweaks to docs and error messages, I think we could change an experience that took me multiple hours of debugging and reading the source into an easy 30sec experience for new users.
cc @martindurant @phobson
The text was updated successfully, but these errors were encountered: