Skip to content

Commit 1f9eb68

Browse files
committed
Micro optimize dataset.isel for speed on large datasets
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
1 parent deb2082 commit 1f9eb68

File tree

1 file changed

+14
-4
lines changed

1 file changed

+14
-4
lines changed

xarray/core/dataset.py

+14-4
Original file line numberDiff line numberDiff line change
@@ -2983,20 +2983,30 @@ def isel(
29832983
coord_names = self._coord_names.copy()
29842984

29852985
indexes, index_variables = isel_indexes(self.xindexes, indexers)
2986+
all_keys = set(indexers.keys())
29862987

29872988
for name, var in self._variables.items():
29882989
# preserve variable order
29892990
if name in index_variables:
29902991
var = index_variables[name]
2991-
else:
2992-
var_indexers = {k: v for k, v in indexers.items() if k in var.dims}
2993-
if var_indexers:
2992+
dims.update(zip(var.dims, var.shape))
2993+
# Fastpath, skip all of this for variables with no dimensions
2994+
# Keep the result cached for future dictionary update
2995+
elif var_dims := var.dims:
2996+
# Large datasets with alot of metadata may have many scalars
2997+
# without any relevant dimensions for slicing.
2998+
# Pick those out quickly and avoid paying the cost below
2999+
# of resolving the var_indexers variables
3000+
if var_indexer_keys := all_keys.intersection(var_dims):
3001+
var_indexers = {k: indexers[k] for k in var_indexer_keys}
29943002
var = var.isel(var_indexers)
29953003
if drop and var.ndim == 0 and name in coord_names:
29963004
coord_names.remove(name)
29973005
continue
3006+
# Update our reference to `var_dims` after the call to isel
3007+
var_dims = var.dims
3008+
dims.update(zip(var_dims, var.shape))
29983009
variables[name] = var
2999-
dims.update(zip(var.dims, var.shape))
30003010

30013011
return self._construct_direct(
30023012
variables=variables,

0 commit comments

Comments
 (0)