Skip to content

Add asynchronous load method #10327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 65 commits into
base: main
Choose a base branch
from
Draft

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 16, 2025

Adds an .async_load() method to Variable, which works by plumbing async get_duck_array all the way down until it finally gets to the async methods zarr v3 exposes.

Needs a lot of refactoring before it could be merged, but it works.

API:

  • Variable.load_async
  • DataArray.load_async
  • Dataset.load_async
  • DataTree.load_async
  • load_dataset?
  • load_dataarray?

Comment on lines +277 to +278
async def async_getitem(key: indexing.ExplicitIndexer) -> np.typing.ArrayLike:
raise NotImplementedError("Backend does not not support asynchronous loading")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely.

@@ -267,13 +268,23 @@ def robust_getitem(array, key, catch=Exception, max_retries=6, initial_delay=500
time.sleep(1e-3 * next_delay)


class BackendArray(NdimSizeLenMixin, indexing.ExplicitlyIndexed):
class BackendArray(ABC, NdimSizeLenMixin, indexing.ExplicitlyIndexed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is public API and this is a backwards incompatible change.

Comment on lines +574 to +578
# load everything else concurrently
coros = [
v.load_async() for k, v in self.variables.items() if k not in chunked_data
]
await asyncio.gather(*coros)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rate-limite all gather calls with a Semaphore using something like this:

async def async_gather(*coros, concurrency: Optional[int] = None, return_exceptions: bool = False) -> list[Any]:
    """Execute a gather while limiting the number of concurrent tasks.

    Args:
        coros: coroutines
            list of coroutines to execute
        concurrency: int
            concurrency limit
            if None, defaults to config_obj.get('async.concurrency', 4)
            if <= 0, no concurrency limit

    """
    if concurrency is None:
        concurrency = int(config_obj.get("async.concurrency", 4))

    if concurrency > 0:
        # if concurrency > 0, we use a semaphore to limit the number of concurrent coroutines
        semaphore = asyncio.Semaphore(concurrency)

        async def sem_coro(coro):
            async with semaphore:
                return await coro

        results = await asyncio.gather(*(sem_coro(c) for c in coros), return_exceptions=return_exceptions)
    else:
        results = await asyncio.gather(*coros, return_exceptions=return_exceptions)

    return results

case "ds":
return ds

def assert_time_as_expected(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's instead use mocks to assert the async methods were called. Xarray's job is to do that only

async def _async_ensure_cached(self):
duck_array = await self.array.async_get_duck_array()
self.array = as_indexable(duck_array)

def get_duck_array(self):
self._ensure_cached()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ensure_cached seems like pointless indirection, it is only used once. let's consolidate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 884ce13, but I still feel like it could be simplified further. Does it really need to have the side-effect of re-assigning to self.array?

return self

async def load_async(self, **kwargs) -> Self:
# TODO refactor this to pull out the common chunked_data codepath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's instead just have the sync methods issue a blocking call to the async versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would solve the use case in xpublish though? You need to be able to asynchronously trigger loading for a bunch of separate dataset objects, which requires an async load api to be exposed, no?

Copy link
Member Author

@TomNicholas TomNicholas May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand what you mean now, you're not talking about the API, you're just talking about my comment about internal refactoring. You're proposing we do what zarr does internally, which makes sense.

def get_duck_array(self):
self._ensure_cached()
return self.array.get_duck_array()
duck_array = self.array.get_duck_array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed this logic to call get_duck_arrayonly once

@@ -490,6 +490,23 @@ def test_sub_array(self) -> None:
assert isinstance(child.array, indexing.NumpyIndexingAdapter)
assert isinstance(wrapped.array, indexing.LazilyIndexedArray)

async def test_async_wrapper(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added new tests.

[
("sel", {"x": 2}),
("sel", {"x": [2, 3]}),
("sel", {"x": slice(2, 4)}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test

print("inside LazilyVectorizedIndexedArray.async_get_duck_array")
from xarray.backends.common import BackendArray

if isinstance(self.array, BackendArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a lot cleaner. In my previous refactor, I was trying hard to not depend on BackendArray but that's unavoidable now for async stuff AFAICT

def get_duck_array():
raise NotImplementedError

async def async_get_duck_array():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80/20 on this. Alternatively, we special case IndexingAdapter inside async_get_duck_array which seems worse to me.

Comment on lines +160 to +162
if isinstance(data, IndexingAdapter):
# These wrap in-memory arrays, and async isn't needed
return data.get_duck_array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be removed now that I added async_get_duck_array to the base class

Suggested change
if isinstance(data, IndexingAdapter):
# These wrap in-memory arrays, and async isn't needed
return data.get_duck_array()

chunks=(5, 5),
dtype="f4",
dimension_names=["x", "y"],
attributes={"add_offset": 1, "scale_factor": 2},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very important: now we test the decoding infra

duck_array = await self.array.async_get_duck_array()
# ensure the array object is cached in-memory
self.array = as_indexable(duck_array)
return duck_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a deep copy here to match previous behavior

@TomNicholas
Copy link
Member Author

Notes to self:

  • Try to consolidate indexing tests with those in test_variable.py, potentially by defining a subclass of Variable that only implements async methods
  • Use create_test_data, write to a zarr (memory)store, and open lazily - this will help test decoding machinery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file enhancement io topic-backends topic-documentation topic-indexing topic-NamedArray Lightweight version of Variable topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an asynchronous load method?
4 participants