Skip to content

Clean-up indexing adapter classes #10355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented May 26, 2025

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

Mostly cherry-picked from #10296 (cleaner in a separate PR).

This PR also prevents converting a whole pd.RangeIndex or a CoordinateTransform object into a numpy array when showing the array repr. The object is sliced / indexed into a smaller one before doing the conversion. For large indexes this can make a big difference:

>>> import pandas as pd
>>> import xarray as xr
>>> from xarray.indexes import PandasIndex

>>> xidx = PandasIndex(pd.RangeIndex(2_000_000_000), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx))
>>> ds   # before this PR, was loading 16GB into memory!

However, I'm now wondering how we should format the array (inline) repr for those two cases. It is useful to show some concrete values but at the same time it is misleading to show them like plain numpy arrays.

>>> print(ds.x.variable)
>>> <xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB
array([[         0,          1,          2, ..., 1999999997, 1999999998,
        1999999999]])

>>> ds.x.variable._in_memory
True       # (not really, actually!)

It would certainly help to have some visual mark for lazy variables (perhaps next to their size), but that's another topic.

benbovy added 5 commits May 26, 2025 15:29
Grouping the logic into one method will make it easier overriding the
behavior in subclasses (interval index) without affecting much
readability. Also it yield more DRY code.
Prevent numpy.dtype conversions or castings implemented in various
places, gather the logic into one method.
Maybe slice PandasIndexingAdapter / CoordinateTransformAdapter before
formatting them as arrays.

For PandasIndexingAdapter, this prevents converting a large
pd.RangeIndex into a explicit index or array.

result = self.array[key]
def _get_array_subset(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only to share code between PandasIndexingAdapter and PandasMultiIndexAdapter?

Copy link
Member Author

@benbovy benbovy May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now used directly in PandasIndexingAdapter to avoid converting a large pd.RangeIndex into a numpy array just for showing the first and last few values.

In #10296 it is also used (overridden in PandasIntervalIndexingAdapter) to avoid converting a large pd.IntervalIndex into a 2-d array. EDIT: actually it isn't (yet).

@@ -651,6 +655,12 @@ def short_array_repr(array):
def short_data_repr(array):
"""Format "data" for DataArray and Variable."""
internal_data = getattr(array, "variable", array)._data

if isinstance(
internal_data, PandasIndexingAdapter | CoordinateTransformIndexingAdapter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get away with adding_repr_inline to CoordinateTransfromIndexingAdapter? I think it's preferable to avoid this kind of special casing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not format the inline repr but the data of a DataArray or variable.

We could add a short_data_repr method to those adapter classes and check if internal_data has such a method here, although it is not much different from this special casing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a better request is to see whether we can just reuse format_array_flat which already does indexing and should just work for these classes.

Copy link
Member Author

@benbovy benbovy May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm again format_array_flat is for formatting the inline repr, whereas the special case introduced here is for formatting the data repr.

Without this special case, short_data_repr() will convert the indexing adapters into numpy arrays via PandasIndexingAdpater.get_duck_array() and CoordinateTransformIndexingAdapter.get_duck_array() over their full shape / size. For both the inline and the data reprs, we want to select only the first and last relevant items before doing this conversion.

first_n_items() and last_n_items() in xarray.core.formatting do similar things than PandasIndexingAdapter._get_array_subset() and CoordinateTransformIndexingAdapter._get_array_subset(). We could perhaps reuse the two former instead, although for the data repr (at least for CoordinateTransform, and possibly later for pd.IntervalIndex) we don't want a flattened result. So this would involve more refactoring. Also this wouldn't remove the special case here.

Alternatively we could tweak Variable._in_memory such that it returns False when ._data is a PandasIndexingAdapter (only when it wraps a pd.RangeIndex) or a CoordinateTransformIndexingAdapter, which will turn their data repr from, e.g.,

>>> xidx = PandasIndex(pd.RangeIndex(2_000_000_000), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx))
>>> print(ds.x.variable)
<xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB
array([[         0,          1,          2, ..., 1999999997, 1999999998,
        1999999999]])

into

>>> print(ds.x.variable)
<xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB
[2000000000 values with dtype=dtype('int64')]

On one hand I like seeing the actual first and last values of a (lazy) range index or coordinate transform. On the other hand I find a bit confusing that is it shown like a plain numpy array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants