Skip to content

Commit fffb03c

Browse files
flamingbearTomNicholaspre-commit-ci[bot]keewis
authored
add open_datatree to xarray (#8697)
* DAS-2060: Skips datatree_ CI Adds additional ignore to mypy Adds additional ignore to doctests Excludes xarray/datatree_ from all pre-commmit.ci * DAS-2070: Migrate open_datatree into xarray. First stab. Will need to add/move tests. * DAS-2060: replace relative import of datatree to library * DAS-2060: revert the exporting of NodePath from datatree I mistakenly thought we wanted to use the hidden version of datatree_ and we do not. * Don't expose open_datatree at top level We do not want to expose open_datatree at top level until all of the code is migrated. * Point datatree imports to xarray.datatree_.datatree * Updates function signatures for mypy. * Move io tests, remove undefined reference to documentation. Also starts fixing simple mypy errors * Pass bare-minimum tests. * Update pyproject.toml to exclude imported datatree_ modules. Add some typing for mygrated tests. Adds display_expand_groups to core options. * Adding back type ignores This is cargo-cult. I wonder if there's a different CI test that wanted these and since this is now excluded at the top level. I'm putting them back until migration into main codebase. * Refactor open_datatree back together. puts common parts in common. * Removes TODO comment * typo fix Co-authored-by: Tom Nicholas <[email protected]> * typo 2 Co-authored-by: Tom Nicholas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Call raised exception * Add unpacking notation to kwargs * Use final location for DataTree doc strings Co-authored-by: Justus Magin <[email protected]> * fix comment from open_dataset to open_datatree Co-authored-by: Justus Magin <[email protected]> * Revert "fix comment from open_dataset to open_datatree" This reverts commit aab1744. * Change sphynx link from meth to func * Update whats-new.rst * Fix what-new.rst formatting. --------- Co-authored-by: Tom Nicholas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Justus Magin <[email protected]>
1 parent 4806412 commit fffb03c

25 files changed

+263
-193
lines changed

doc/roadmap.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ types would also be highly useful for xarray users.
156156
By pursuing these improvements in NumPy we hope to extend the benefits
157157
to the full scientific Python community, and avoid tight coupling
158158
between xarray and specific third-party libraries (e.g., for
159-
implementing untis). This will allow xarray to maintain its domain
159+
implementing units). This will allow xarray to maintain its domain
160160
agnostic strengths.
161161

162162
We expect that we may eventually add some minimal interfaces in xarray

doc/whats-new.rst

+8-1
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,16 @@ Internal Changes
9090
when the data isn't datetime-like. (:issue:`8718`, :pull:`8724`)
9191
By `Maximilian Roos <https://github.com/max-sixty>`_.
9292

93-
- Move `parallelcompat` and `chunk managers` modules from `xarray/core` to `xarray/namedarray`. (:pull:`8319`)
93+
- Move ``parallelcompat`` and ``chunk managers`` modules from ``xarray/core`` to ``xarray/namedarray``. (:pull:`8319`)
9494
By `Tom Nicholas <https://github.com/TomNicholas>`_ and `Anderson Banihirwe <https://github.com/andersy005>`_.
9595

96+
- Imports ``datatree`` repository and history into internal
97+
location. (:pull:`8688`) By `Matt Savoie <https://github.com/flamingbear>`_
98+
and `Justus Magin <https://github.com/keewis>`_.
99+
100+
- Adds :py:func:`open_datatree` into ``xarray/backends`` (:pull:`8697`) By `Matt
101+
Savoie <https://github.com/flamingbear>`_.
102+
96103
.. _whats-new.2024.01.1:
97104

98105
v2024.01.1 (23 Jan, 2024)

pyproject.toml

+5
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,11 @@ warn_redundant_casts = true
9696
warn_unused_configs = true
9797
warn_unused_ignores = true
9898

99+
# Ignore mypy errors for modules imported from datatree_.
100+
[[tool.mypy.overrides]]
101+
module = "xarray.datatree_.*"
102+
ignore_errors = true
103+
99104
# Much of the numerical computing stack doesn't have type annotations yet.
100105
[[tool.mypy.overrides]]
101106
ignore_missing_imports = true

xarray/backends/api.py

+29
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@
6969
T_NetcdfTypes = Literal[
7070
"NETCDF4", "NETCDF4_CLASSIC", "NETCDF3_64BIT", "NETCDF3_CLASSIC"
7171
]
72+
from xarray.datatree_.datatree import DataTree
7273

7374
DATAARRAY_NAME = "__xarray_dataarray_name__"
7475
DATAARRAY_VARIABLE = "__xarray_dataarray_variable__"
@@ -788,6 +789,34 @@ def open_dataarray(
788789
return data_array
789790

790791

792+
def open_datatree(
793+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
794+
engine: T_Engine = None,
795+
**kwargs,
796+
) -> DataTree:
797+
"""
798+
Open and decode a DataTree from a file or file-like object, creating one tree node for each group in the file.
799+
800+
Parameters
801+
----------
802+
filename_or_obj : str, Path, file-like, or DataStore
803+
Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
804+
engine : str, optional
805+
Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
806+
**kwargs : dict
807+
Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group.
808+
Returns
809+
-------
810+
xarray.DataTree
811+
"""
812+
if engine is None:
813+
engine = plugins.guess_engine(filename_or_obj)
814+
815+
backend = plugins.get_backend(engine)
816+
817+
return backend.open_datatree(filename_or_obj, **kwargs)
818+
819+
791820
def open_mfdataset(
792821
paths: str | NestedSequence[str | os.PathLike],
793822
chunks: T_Chunks | None = None,

xarray/backends/common.py

+58-1
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,12 @@
1919
if TYPE_CHECKING:
2020
from io import BufferedIOBase
2121

22+
from h5netcdf.legacyapi import Dataset as ncDatasetLegacyH5
23+
from netCDF4 import Dataset as ncDataset
24+
2225
from xarray.core.dataset import Dataset
2326
from xarray.core.types import NestedSequence
27+
from xarray.datatree_.datatree import DataTree
2428

2529
# Create a logger object, but don't add any handlers. Leave that to user code.
2630
logger = logging.getLogger(__name__)
@@ -127,6 +131,43 @@ def _decode_variable_name(name):
127131
return name
128132

129133

134+
def _open_datatree_netcdf(
135+
ncDataset: ncDataset | ncDatasetLegacyH5,
136+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
137+
**kwargs,
138+
) -> DataTree:
139+
from xarray.backends.api import open_dataset
140+
from xarray.datatree_.datatree import DataTree
141+
from xarray.datatree_.datatree.treenode import NodePath
142+
143+
ds = open_dataset(filename_or_obj, **kwargs)
144+
tree_root = DataTree.from_dict({"/": ds})
145+
with ncDataset(filename_or_obj, mode="r") as ncds:
146+
for path in _iter_nc_groups(ncds):
147+
subgroup_ds = open_dataset(filename_or_obj, group=path, **kwargs)
148+
149+
# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
150+
node_name = NodePath(path).name
151+
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
152+
tree_root._set_item(
153+
path,
154+
new_node,
155+
allow_overwrite=False,
156+
new_nodes_along_path=True,
157+
)
158+
return tree_root
159+
160+
161+
def _iter_nc_groups(root, parent="/"):
162+
from xarray.datatree_.datatree.treenode import NodePath
163+
164+
parent = NodePath(parent)
165+
for path, group in root.groups.items():
166+
gpath = parent / path
167+
yield str(gpath)
168+
yield from _iter_nc_groups(group, parent=gpath)
169+
170+
130171
def find_root_and_group(ds):
131172
"""Find the root and group name of a netCDF4/h5netcdf dataset."""
132173
hierarchy = ()
@@ -458,6 +499,11 @@ class BackendEntrypoint:
458499
- ``guess_can_open`` method: it shall return ``True`` if the backend is able to open
459500
``filename_or_obj``, ``False`` otherwise. The implementation of this
460501
method is not mandatory.
502+
- ``open_datatree`` method: it shall implement reading from file, variables
503+
decoding and it returns an instance of :py:class:`~datatree.DataTree`.
504+
It shall take in input at least ``filename_or_obj`` argument. The
505+
implementation of this method is not mandatory. For more details see
506+
<reference to open_datatree documentation>.
461507
462508
Attributes
463509
----------
@@ -496,7 +542,7 @@ def open_dataset(
496542
Backend open_dataset method used by Xarray in :py:func:`~xarray.open_dataset`.
497543
"""
498544

499-
raise NotImplementedError
545+
raise NotImplementedError()
500546

501547
def guess_can_open(
502548
self,
@@ -508,6 +554,17 @@ def guess_can_open(
508554

509555
return False
510556

557+
def open_datatree(
558+
self,
559+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
560+
**kwargs: Any,
561+
) -> DataTree:
562+
"""
563+
Backend open_datatree method used by Xarray in :py:func:`~xarray.open_datatree`.
564+
"""
565+
566+
raise NotImplementedError()
567+
511568

512569
# mapping of engine name to (module name, BackendEntrypoint Class)
513570
BACKEND_ENTRYPOINTS: dict[str, tuple[str | None, type[BackendEntrypoint]]] = {}

xarray/backends/h5netcdf_.py

+11
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
BackendEntrypoint,
1212
WritableCFDataStore,
1313
_normalize_path,
14+
_open_datatree_netcdf,
1415
find_root_and_group,
1516
)
1617
from xarray.backends.file_manager import CachingFileManager, DummyFileManager
@@ -38,6 +39,7 @@
3839

3940
from xarray.backends.common import AbstractDataStore
4041
from xarray.core.dataset import Dataset
42+
from xarray.datatree_.datatree import DataTree
4143

4244

4345
class H5NetCDFArrayWrapper(BaseNetCDF4Array):
@@ -423,5 +425,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
423425
)
424426
return ds
425427

428+
def open_datatree(
429+
self,
430+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
431+
**kwargs,
432+
) -> DataTree:
433+
from h5netcdf.legacyapi import Dataset as ncDataset
434+
435+
return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)
436+
426437

427438
BACKEND_ENTRYPOINTS["h5netcdf"] = ("h5netcdf", H5netcdfBackendEntrypoint)

xarray/backends/netCDF4_.py

+11
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
BackendEntrypoint,
1717
WritableCFDataStore,
1818
_normalize_path,
19+
_open_datatree_netcdf,
1920
find_root_and_group,
2021
robust_getitem,
2122
)
@@ -44,6 +45,7 @@
4445

4546
from xarray.backends.common import AbstractDataStore
4647
from xarray.core.dataset import Dataset
48+
from xarray.datatree_.datatree import DataTree
4749

4850
# This lookup table maps from dtype.byteorder to a readable endian
4951
# string used by netCDF4.
@@ -667,5 +669,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
667669
)
668670
return ds
669671

672+
def open_datatree(
673+
self,
674+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
675+
**kwargs,
676+
) -> DataTree:
677+
from netCDF4 import Dataset as ncDataset
678+
679+
return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)
680+
670681

671682
BACKEND_ENTRYPOINTS["netcdf4"] = ("netCDF4", NetCDF4BackendEntrypoint)

xarray/backends/zarr.py

+44
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434

3535
from xarray.backends.common import AbstractDataStore
3636
from xarray.core.dataset import Dataset
37+
from xarray.datatree_.datatree import DataTree
3738

3839

3940
# need some special secret attributes to tell us the dimensions
@@ -1039,5 +1040,48 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
10391040
)
10401041
return ds
10411042

1043+
def open_datatree(
1044+
self,
1045+
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
1046+
**kwargs,
1047+
) -> DataTree:
1048+
import zarr
1049+
1050+
from xarray.backends.api import open_dataset
1051+
from xarray.datatree_.datatree import DataTree
1052+
from xarray.datatree_.datatree.treenode import NodePath
1053+
1054+
zds = zarr.open_group(filename_or_obj, mode="r")
1055+
ds = open_dataset(filename_or_obj, engine="zarr", **kwargs)
1056+
tree_root = DataTree.from_dict({"/": ds})
1057+
for path in _iter_zarr_groups(zds):
1058+
try:
1059+
subgroup_ds = open_dataset(
1060+
filename_or_obj, engine="zarr", group=path, **kwargs
1061+
)
1062+
except zarr.errors.PathNotFoundError:
1063+
subgroup_ds = Dataset()
1064+
1065+
# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
1066+
node_name = NodePath(path).name
1067+
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
1068+
tree_root._set_item(
1069+
path,
1070+
new_node,
1071+
allow_overwrite=False,
1072+
new_nodes_along_path=True,
1073+
)
1074+
return tree_root
1075+
1076+
1077+
def _iter_zarr_groups(root, parent="/"):
1078+
from xarray.datatree_.datatree.treenode import NodePath
1079+
1080+
parent = NodePath(parent)
1081+
for path, group in root.groups():
1082+
gpath = parent / path
1083+
yield str(gpath)
1084+
yield from _iter_zarr_groups(group, parent=gpath)
1085+
10421086

10431087
BACKEND_ENTRYPOINTS["zarr"] = ("zarr", ZarrBackendEntrypoint)

xarray/core/options.py

+3
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
"display_expand_coords",
2121
"display_expand_data_vars",
2222
"display_expand_data",
23+
"display_expand_groups",
2324
"display_expand_indexes",
2425
"display_default_indexes",
2526
"enable_cftimeindex",
@@ -44,6 +45,7 @@ class T_Options(TypedDict):
4445
display_expand_coords: Literal["default", True, False]
4546
display_expand_data_vars: Literal["default", True, False]
4647
display_expand_data: Literal["default", True, False]
48+
display_expand_groups: Literal["default", True, False]
4749
display_expand_indexes: Literal["default", True, False]
4850
display_default_indexes: Literal["default", True, False]
4951
enable_cftimeindex: bool
@@ -68,6 +70,7 @@ class T_Options(TypedDict):
6870
"display_expand_coords": "default",
6971
"display_expand_data_vars": "default",
7072
"display_expand_data": "default",
73+
"display_expand_groups": "default",
7174
"display_expand_indexes": "default",
7275
"display_default_indexes": False,
7376
"enable_cftimeindex": True,

xarray/datatree_/datatree/__init__.py

-10
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,15 @@
11
# import public API
22
from .datatree import DataTree
33
from .extensions import register_datatree_accessor
4-
from .io import open_datatree
54
from .mapping import TreeIsomorphismError, map_over_subtree
65
from .treenode import InvalidTreeError, NotFoundInTreeError
76

8-
try:
9-
# NOTE: the `_version.py` file must not be present in the git repository
10-
# as it is generated by setuptools at install time
11-
from ._version import __version__
12-
except ImportError: # pragma: no cover
13-
# Local copy or not installed with setuptools
14-
__version__ = "999"
157

168
__all__ = (
179
"DataTree",
18-
"open_datatree",
1910
"TreeIsomorphismError",
2011
"InvalidTreeError",
2112
"NotFoundInTreeError",
2213
"map_over_subtree",
2314
"register_datatree_accessor",
24-
"__version__",
2515
)

xarray/datatree_/datatree/datatree.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
List,
1717
Mapping,
1818
MutableMapping,
19+
NoReturn,
1920
Optional,
2021
Set,
2122
Tuple,
@@ -160,7 +161,7 @@ def __setitem__(self, key, val) -> None:
160161
"use `.copy()` first to get a mutable version of the input dataset."
161162
)
162163

163-
def update(self, other) -> None:
164+
def update(self, other) -> NoReturn:
164165
raise AttributeError(
165166
"Mutation of the DatasetView is not allowed, please use `.update` on the wrapping DataTree node, "
166167
"or use `dt.to_dataset()` if you want a mutable dataset. If calling this from within `map_over_subtree`,"

xarray/datatree_/datatree/formatting_html.py

-3
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@
1010
datavar_section,
1111
dim_section,
1212
)
13-
from xarray.core.options import OPTIONS
14-
15-
OPTIONS["display_expand_groups"] = "default"
1613

1714

1815
def summarize_children(children: Mapping[str, Any]) -> str:

0 commit comments

Comments
 (0)