Add asynchronous load method #10327

TomNicholas · 2025-05-16T16:05:49Z

Adds an .async_load() method to Variable, which works by plumbing async get_duck_array all the way down until it finally gets to the async methods zarr v3 exposes.

Needs a lot of refactoring before it could be merged, but it works.

Closes Add an asynchronous load method? #10326
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

API:

for more information, see https://pre-commit.ci

dcherian · 2025-05-28T17:03:35Z

xarray/backends/common.py

+    async def async_getitem(key: indexing.ExplicitIndexer) -> np.typing.ArrayLike:
+        raise NotImplementedError("Backend does not not support asynchronous loading")


Yes absolutely.

dcherian · 2025-05-28T17:03:53Z

xarray/backends/common.py

@@ -267,13 +268,23 @@ def robust_getitem(array, key, catch=Exception, max_retries=6, initial_delay=500
            time.sleep(1e-3 * next_delay)


-class BackendArray(NdimSizeLenMixin, indexing.ExplicitlyIndexed):
+class BackendArray(ABC, NdimSizeLenMixin, indexing.ExplicitlyIndexed):


This class is public API and this is a backwards incompatible change.

dcherian · 2025-05-28T17:17:09Z

xarray/core/dataset.py

+        # load everything else concurrently
+        coros = [
+            v.load_async() for k, v in self.variables.items() if k not in chunked_data
+        ]
+        await asyncio.gather(*coros)


We should rate-limite all gather calls with a Semaphore using something like this:

async def async_gather(*coros, concurrency: Optional[int] = None, return_exceptions: bool = False) -> list[Any]: """Execute a gather while limiting the number of concurrent tasks. Args: coros: coroutines list of coroutines to execute concurrency: int concurrency limit if None, defaults to config_obj.get('async.concurrency', 4) if <= 0, no concurrency limit """ if concurrency is None: concurrency = int(config_obj.get("async.concurrency", 4)) if concurrency > 0: # if concurrency > 0, we use a semaphore to limit the number of concurrent coroutines semaphore = asyncio.Semaphore(concurrency) async def sem_coro(coro): async with semaphore: return await coro results = await asyncio.gather(*(sem_coro(c) for c in coros), return_exceptions=return_exceptions) else: results = await asyncio.gather(*coros, return_exceptions=return_exceptions) return results

dcherian · 2025-05-28T17:19:58Z

xarray/tests/test_async.py

+            case "ds":
+                return ds
+
+    def assert_time_as_expected(


Let's instead use mocks to assert the async methods were called. Xarray's job is to do that only

xarray/tests/test_async.py

dcherian · 2025-05-28T17:22:31Z

xarray/core/indexing.py

+    async def _async_ensure_cached(self):
+        duck_array = await self.array.async_get_duck_array()
+        self.array = as_indexable(duck_array)
+
    def get_duck_array(self):
        self._ensure_cached()


_ensure_cached seems like pointless indirection, it is only used once. let's consolidate.

Removed in 884ce13, but I still feel like it could be simplified further. Does it really need to have the side-effect of re-assigning to self.array?

dcherian · 2025-05-28T17:24:19Z

xarray/core/dataset.py

+        return self
+
+    async def load_async(self, **kwargs) -> Self:
+        # TODO refactor this to pull out the common chunked_data codepath


let's instead just have the sync methods issue a blocking call to the async versions.

I don't think that would solve the use case in xpublish though? You need to be able to asynchronously trigger loading for a bunch of separate dataset objects, which requires an async load api to be exposed, no?

Oh I understand what you mean now, you're not talking about the API, you're just talking about my comment about internal refactoring. You're proposing we do what zarr does internally, which makes sense.

for more information, see https://pre-commit.ci

dcherian · 2025-05-29T22:12:29Z

xarray/core/indexing.py

    def get_duck_array(self):
-        self._ensure_cached()
-        return self.array.get_duck_array()
+        duck_array = self.array.get_duck_array()


i changed this logic to call get_duck_arrayonly once

xarray/namedarray/pycompat.py

dcherian · 2025-05-29T22:14:51Z

xarray/tests/test_indexing.py

@@ -490,6 +490,23 @@ def test_sub_array(self) -> None:
        assert isinstance(child.array, indexing.NumpyIndexingAdapter)
        assert isinstance(wrapped.array, indexing.LazilyIndexedArray)

+    async def test_async_wrapper(self) -> None:


added new tests.

dcherian · 2025-05-29T22:15:17Z

xarray/tests/test_async.py

+        [
+            ("sel", {"x": 2}),
+            ("sel", {"x": [2, 3]}),
+            ("sel", {"x": slice(2, 4)}),


dcherian · 2025-05-29T23:38:40Z

xarray/core/indexing.py

+        print("inside LazilyVectorizedIndexedArray.async_get_duck_array")
+        from xarray.backends.common import BackendArray
+
+        if isinstance(self.array, BackendArray):


I think this is a lot cleaner. In my previous refactor, I was trying hard to not depend on BackendArray but that's unavoidable now for async stuff AFAICT

dcherian · 2025-05-29T23:39:52Z

xarray/core/indexing.py

+    def get_duck_array():
+        raise NotImplementedError
+
+    async def async_get_duck_array():


80/20 on this. Alternatively, we special case IndexingAdapter inside async_get_duck_array which seems worse to me.

dcherian · 2025-05-29T23:40:41Z

xarray/namedarray/pycompat.py

+    if isinstance(data, IndexingAdapter):
+        # These wrap in-memory arrays, and async isn't needed
+        return data.get_duck_array()


could be removed now that I added async_get_duck_array to the base class

Suggested change

if isinstance(data, IndexingAdapter):

# These wrap in-memory arrays, and async isn't needed

return data.get_duck_array()

dcherian · 2025-05-29T23:41:01Z

xarray/tests/test_async.py

+        chunks=(5, 5),
+        dtype="f4",
+        dimension_names=["x", "y"],
+        attributes={"add_offset": 1, "scale_factor": 2},


very important: now we test the decoding infra

dcherian · 2025-05-30T00:20:14Z

xarray/core/indexing.py

+        duck_array = await self.array.async_get_duck_array()
+        # ensure the array object is cached in-memory
+        self.array = as_indexable(duck_array)
+        return duck_array


Might need a deep copy here to match previous behavior

TomNicholas · 2025-05-30T16:02:47Z

Notes to self:

Try to consolidate indexing tests with those in test_variable.py, potentially by defining a subclass of Variable that only implements async methods
Use create_test_data, write to a zarr (memory)store, and open lazily - this will help test decoding machinery.

TomNicholas and others added 21 commits October 24, 2024 17:48

new blank whatsnew

01e7518

Merge branch 'main' of https://github.com/pydata/xarray

83e553b

Merge branch 'main' of https://github.com/pydata/xarray

e44326d

Merge branch 'main' of https://github.com/pydata/xarray

4e4eeb0

Merge branch 'main' of https://github.com/pydata/xarray

d858059

Merge branch 'main' of https://github.com/pydata/xarray

d377780

Merge branch 'main' of https://github.com/pydata/xarray

3132f6a

Merge branch 'main' of https://github.com/pydata/xarray

900eef5

Merge branch 'main' of https://github.com/pydata/xarray

4c4462f

Merge branch 'main' of https://github.com/pydata/xarray

5b9b749

Merge branch 'main' of https://github.com/pydata/xarray

fadb953

Merge branch 'main' of https://github.com/TomNicholas/xarray

57d9d23

Merge branch 'main' of https://github.com/pydata/xarray

11170fc

Merge branch 'main' of https://github.com/pydata/xarray

0b8fa41

Merge branch 'main' of https://github.com/pydata/xarray

f769f85

Merge branch 'main' of https://github.com/pydata/xarray

4eef318

Merge branch 'main' of https://github.com/pydata/xarray

29242a4

test async load using special zarr LatencyStore

e6b3b3b

don't use dask

3ceab60

async all the way down

071c35a

remove assert False

29374f9

TomNicholas added the enhancement label May 16, 2025

github-actions bot added topic-backends topic-indexing topic-documentation topic-zarr Related to zarr storage library io topic-NamedArray Lightweight version of Variable labels May 16, 2025

pre-commit-ci bot and others added 2 commits May 16, 2025 16:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab12bb8

for more information, see https://pre-commit.ci

add pytest-asyncio to CI envs

62aa39d

TomNicholas added 3 commits May 23, 2025 20:32

add test for basic indexing

842a06c

correct test to actually use vectorized indexing

e19ab55

refactor to parametrize indexing test

b9e8e06

dcherian reviewed May 28, 2025

View reviewed changes

TomNicholas and others added 5 commits May 29, 2025 17:18

implement async vectorized indexing

8bc7bea

revert breaking change to BackendArray

6c47e3f

[pre-commit.ci] auto fixes from pre-commit.com hooks

a86f646

for more information, see https://pre-commit.ci

remove indirection in _ensure_cached method

884ce13

IndexingAdapters don't need async get

a43af86

dcherian reviewed May 29, 2025

View reviewed changes

xarray/namedarray/pycompat.py Outdated Show resolved Hide resolved

dcherian reviewed May 29, 2025

View reviewed changes

xarray/tests/test_async.py Outdated

[

("sel", {"x": 2}),

("sel", {"x": [2, 3]}),

("sel", {"x": slice(2, 4)}),

Copy link

Contributor

dcherian May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test

dcherian force-pushed the async.load branch from 590713c to 81af93e Compare May 29, 2025 23:37

dcherian reviewed May 29, 2025

View reviewed changes

dcherian reviewed May 30, 2025

View reviewed changes

dcherian added 3 commits May 29, 2025 21:59

Add tests

17d7a0e

Add decoding test

d824a2d

Add IndexingAdapter mixin

6a13611

dcherian force-pushed the async.load branch from 81af93e to 1865f23 Compare May 30, 2025 04:04

[cherry] Making decoding arrays lazy too

d79ed54

dcherian force-pushed the async.load branch from 1865f23 to d79ed54 Compare May 30, 2025 04:10

dcherian mentioned this pull request May 30, 2025

Clean up backend indexing some more #10376

Draft

TomNicholas added 2 commits May 30, 2025 21:27

parametrized over isel and sel

1da3359

mock zarr.AsyncArray.getitem in test

dded9e0

tidy up the mocking

4c347ad

		async def async_getitem(key: indexing.ExplicitIndexer) -> np.typing.ArrayLike:
		raise NotImplementedError("Backend does not not support asynchronous loading")

	if isinstance(data, IndexingAdapter):
	# These wrap in-memory arrays, and async isn't needed
	return data.get_duck_array()

Uh oh!

Add asynchronous load method #10327

Are you sure you want to change the base?

Add asynchronous load method #10327

Uh oh!

Conversation

TomNicholas commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented May 30, 2025

Uh oh!

Uh oh!

TomNicholas commented May 16, 2025 •

edited

Loading

TomNicholas May 29, 2025 •

edited

Loading