-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add asynchronous load method #10327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add asynchronous load method #10327
Conversation
for more information, see https://pre-commit.ci
async def async_getitem(key: indexing.ExplicitIndexer) -> np.typing.ArrayLike: | ||
raise NotImplementedError("Backend does not not support asynchronous loading") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely.
xarray/backends/common.py
Outdated
@@ -267,13 +268,23 @@ def robust_getitem(array, key, catch=Exception, max_retries=6, initial_delay=500 | |||
time.sleep(1e-3 * next_delay) | |||
|
|||
|
|||
class BackendArray(NdimSizeLenMixin, indexing.ExplicitlyIndexed): | |||
class BackendArray(ABC, NdimSizeLenMixin, indexing.ExplicitlyIndexed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is public API and this is a backwards incompatible change.
# load everything else concurrently | ||
coros = [ | ||
v.load_async() for k, v in self.variables.items() if k not in chunked_data | ||
] | ||
await asyncio.gather(*coros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rate-limite all gather
calls with a Semaphore using something like this:
async def async_gather(*coros, concurrency: Optional[int] = None, return_exceptions: bool = False) -> list[Any]:
"""Execute a gather while limiting the number of concurrent tasks.
Args:
coros: coroutines
list of coroutines to execute
concurrency: int
concurrency limit
if None, defaults to config_obj.get('async.concurrency', 4)
if <= 0, no concurrency limit
"""
if concurrency is None:
concurrency = int(config_obj.get("async.concurrency", 4))
if concurrency > 0:
# if concurrency > 0, we use a semaphore to limit the number of concurrent coroutines
semaphore = asyncio.Semaphore(concurrency)
async def sem_coro(coro):
async with semaphore:
return await coro
results = await asyncio.gather(*(sem_coro(c) for c in coros), return_exceptions=return_exceptions)
else:
results = await asyncio.gather(*coros, return_exceptions=return_exceptions)
return results
case "ds": | ||
return ds | ||
|
||
def assert_time_as_expected( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's instead use mocks to assert the async methods were called. Xarray's job is to do that only
xarray/core/indexing.py
Outdated
async def _async_ensure_cached(self): | ||
duck_array = await self.array.async_get_duck_array() | ||
self.array = as_indexable(duck_array) | ||
|
||
def get_duck_array(self): | ||
self._ensure_cached() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ensure_cached
seems like pointless indirection, it is only used once. let's consolidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 884ce13, but I still feel like it could be simplified further. Does it really need to have the side-effect of re-assigning to self.array
?
return self | ||
|
||
async def load_async(self, **kwargs) -> Self: | ||
# TODO refactor this to pull out the common chunked_data codepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's instead just have the sync methods issue a blocking call to the async versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would solve the use case in xpublish though? You need to be able to asynchronously trigger loading for a bunch of separate dataset objects, which requires an async load api to be exposed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I understand what you mean now, you're not talking about the API, you're just talking about my comment about internal refactoring. You're proposing we do what zarr does internally, which makes sense.
def get_duck_array(self): | ||
self._ensure_cached() | ||
return self.array.get_duck_array() | ||
duck_array = self.array.get_duck_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed this logic to call get_duck_array
only once
@@ -490,6 +490,23 @@ def test_sub_array(self) -> None: | |||
assert isinstance(child.array, indexing.NumpyIndexingAdapter) | |||
assert isinstance(wrapped.array, indexing.LazilyIndexedArray) | |||
|
|||
async def test_async_wrapper(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added new tests.
xarray/tests/test_async.py
Outdated
[ | ||
("sel", {"x": 2}), | ||
("sel", {"x": [2, 3]}), | ||
("sel", {"x": slice(2, 4)}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new test
print("inside LazilyVectorizedIndexedArray.async_get_duck_array") | ||
from xarray.backends.common import BackendArray | ||
|
||
if isinstance(self.array, BackendArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a lot cleaner. In my previous refactor, I was trying hard to not depend on BackendArray but that's unavoidable now for async stuff AFAICT
xarray/core/indexing.py
Outdated
def get_duck_array(): | ||
raise NotImplementedError | ||
|
||
async def async_get_duck_array(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80/20 on this. Alternatively, we special case IndexingAdapter inside async_get_duck_array
which seems worse to me.
if isinstance(data, IndexingAdapter): | ||
# These wrap in-memory arrays, and async isn't needed | ||
return data.get_duck_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be removed now that I added async_get_duck_array
to the base class
if isinstance(data, IndexingAdapter): | |
# These wrap in-memory arrays, and async isn't needed | |
return data.get_duck_array() |
chunks=(5, 5), | ||
dtype="f4", | ||
dimension_names=["x", "y"], | ||
attributes={"add_offset": 1, "scale_factor": 2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very important: now we test the decoding infra
duck_array = await self.array.async_get_duck_array() | ||
# ensure the array object is cached in-memory | ||
self.array = as_indexable(duck_array) | ||
return duck_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need a deep copy here to match previous behavior
Notes to self:
|
Adds an
.async_load()
method toVariable
, which works by plumbing asyncget_duck_array
all the way down until it finally gets to the async methods zarr v3 exposes.Needs a lot of refactoring before it could be merged, but it works.
whats-new.rst
api.rst
API:
Variable.load_async
DataArray.load_async
Dataset.load_async
DataTree.load_async
load_dataset
?load_dataarray
?