Add `nbytes` to repr? #8690

max-sixty · 2024-01-31T20:13:59Z

Is your feature request related to a problem?

Would having the nbytes value in the Dataset repr be reasonable?

I frequently find myself logging this separately. For example:

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
-    air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
+    air      (time, lat, lon) float32 15MB dask.array<chunksize=(2920, 25, 53), meta=np.ndarray> 
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Describe the solution you'd like

No response

Describe alternatives you've considered

Status quo :)

Additional context

No response

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-02-01T18:49:15Z

I agree - I'm constantly checking this attribute. It would be nice to also quickly see the total nbytes of the whole dataset, but I'm not sure where that would go in the repr.

max-sixty · 2024-02-01T19:37:57Z

It would be nice to also quickly see the total nbytes of the whole dataset,

Yes very much agree. Maaaybe after Dimensions: (lat: 25, time: 2920, lon: 53); 16MB??

Or if there's some consensus about adding to data vars, we could start with that. Though arguably it's more useful to have for the whole object...

norlandrhagen · 2024-02-01T23:52:42Z

This would be a really nice addition!

etienneschalk · 2024-02-04T16:30:21Z

Hello,

I can suggest the following:

Use "natural human units" (multiples of 1000 like "MB"), not binary units (1024) ;
Max unit is "YB" (Yotta, 10**24) (this is arbitrary; is there real-life use cases of required larger units ❓ ) ;
Put the nbytes representation in the "Data variables" section like suggested in the initial post ;
Do not print decimal part, as suggested in all posts above. Helps conciseness ;
For DataArray only representation and total size of a Dataset, put the rendered size into the header of the repr. There is room for short string content like a size. DataArrays representation already uses this place to put the name and dimensions. Datasets don't make use of this space yet and there is plenty of room.

If more customization capabilities are needed, eg choosing between "human" and "binary" prefixes, there exists a library under MIT license, humanize, that specializes into rendering various numbers, including file sizes. Some of its code could potentially be extracted and integrated into xarray.

Examples

<xarray.Dataset 10kB>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int64 888B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*


<xarray.DataArray 'foo' (foo: 1200) 10kB>
array([   0,    1,    2, ..., 1197, 1198, 1199])
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

etienneschalk · 2024-02-04T22:19:40Z

Updated examples after update on the PR. The update is: the size in the header is outside of the <>-delimited header string, and prefixed with Size: :

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

<xarray.DataArray 'foo' (foo: 1200)> Size: 2kB
array([   0,    1,    2, ..., 1197, 1198, 1199], dtype=int16)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

As there are many representations scattered accross the code, both in tests and doctests (800+ occurences when looking for the string <xarray. in the codebase`, I would like to get your feedback on this representation before updating all of them.

Summary of the current repr:

For both Dataset and DataArray, the header is appended with a Size: {size} string
There should be at most 3 digits for size as long as we do not exceed "YB" (Yotta, 10**24)
For the inline representation of a Variable of a Dataset, the size occupies a fixed 5-char width (as long as we do not exceed "YB" (Yotta, 10**24) ): Examples:

[999kB]
[  8B ]

This padding may be irrelevant, as it only preserves layout in some specific case (same dimension tuple width and same dtype width). It could make sense to (1) move the size before the dimension tuple and dtype and keeping its fixed width, or (2) keep its current position but remove the padding. (1) allows quick size comparison of variables by eye thanks to fixed width but puts focus on the size, while (2) is more minimalistic but maybe less readable for size comparison


(1)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo         2kB (foo) int16 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar       222B  (bar) int16 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

(2)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

dcherian · 2024-02-05T03:55:59Z

Just a quick comment: @max-sixty wrote pytest-accept for this kind of thing. It's pretty great :)

I prefer (2) because dimension names are usually what i look for first in the variable repr.

djhoese · 2024-02-19T20:49:11Z

I really wish I would have read your comment @dcherian after getting hit with this change in my own CI and spending an hour tracking down each new difference (and only in 4 small example documents). Sphinx apparently really doesn't want to render docs in a consistent order...or maybe doctest isn't running the tests in the same order.

As someone who has never checked the raw bytes size of an array, I was surprised when I tracked down this change and saw so many people eager to have it. I guess that just shows how many use cases there are for something so "simple" (the repr).

etienneschalk · 2024-02-19T21:55:28Z

Maybe we should add an option to opt-out? Or is it better to have a canonical repr?

djhoese · 2024-02-19T22:18:46Z

Probably just one repr since it seems like enough people want this. If more people come here to complain then maybe revisit the idea?

shoyer · 2024-06-03T16:27:04Z

I'd like to reopen this issue to discuss the repr. My main complaint is that the repr says something like "Size: 24B" but .size on xarray.DataArray refers to the number of elements, not the size in memory!

I also think this information is generally less relevant on individual coordinates/variables inside the repr, and would prefer not to include it there in favor of showing more data values (unless that data values are hidden, e.g., for a lazy array, in which case we could show memory usage instead).

So my suggestion is something like:

<xarray.Dataset nbytes=3k>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

max-sixty · 2024-06-03T18:15:48Z

FWIW I find the size per variable quite helpful. I'm often working with datasets with very differently sized variables.

We should do the utilitarian thing, very possibly I'm less representative

My main complaint is that the repr says something like "Size: 24B" but .size on xarray.DataArray refers to the number of elements, not the size in memory!

We don't technically need the term Size there; it's self-explanatory if we have the "15GB", no objection to removing that. Though I also don't think we need nbytes for the same reason...

etienneschalk · 2024-06-07T04:44:06Z

Hello,

About display of individual variables' `nbytes`

Two different feedbacks here:

I also think this information is generally less relevant on individual coordinates/variables inside the repr, and would prefer not to include it there in favor of showing more data values (unless that data values are hidden, e.g., for a lazy array, in which case we could show memory usage instead).

and

FWIW I find the size per variable quite helpful. I'm often working with datasets with very differently sized variables.

I can add mine: I also find the size per variable helpful

Could a global option control this preference maybe? I see on the Configuration doc page that other options already exist to control the repr. A display_size_per_variable option or display_variable_nbytes could be added to suit everyone's needs.

❔ What should be the default for display_variable_nbytes if such a global option is introduced?

➡️ I would suggest True to keep the existing behaviour.

❔ Regarding a different display behaviour according to the laziness of a variable, do you know a reliable attribute in xarray that can answer the question: is the variable lazy?

I had in mind that the current way was to check for the value of the chunks attribute, for Dask arrays at least

https://docs.xarray.dev/en/stable/generated/xarray.Dataset.chunks.html

Mapping from dimension names to block lengths for this dataset’s data, or None if the underlying data is not a dask array.

About the removal of the term `Size`

We don't technically need the term Size there; it's self-explanatory if we have the "15GB", no objection to removing that. Though I also don't think we need nbytes for the same reason...

I agree with that. For instance, dtypes are not prefixed, and the "Size: " prefix is absent from the variables' representations to gain space. We can assume that the suffix solely (B, kB, MB...) is a well-known hint that the number represents a memory footprint.

Proposal of new `reprs`

What do you think of this? ⬇️

For non-lazy arrays:

with xr.set_options(display_variable_nbytes=True):
    print(xds) # Existing behaviour, without the Size: prefix

<xarray.Dataset> 27B 
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B 11 22 33
    
with xr.set_options(display_variable_nbytes=False):
    print(xds) # Only top-level nbytes is represented

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 11 22 33

For lazy arrays:

with xr.set_options(display_variable_nbytes=True):
    print(xds) # Existing behaviour, without the Size: prefix

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B dask.array<chunksize=(1,), meta=np.ndarray>
    
with xr.set_options(display_variable_nbytes=False):
    print(xds) # The `myvar` nbytes is still shown because it is a dask array.

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B dask.array<chunksize=(1,), meta=np.ndarray>

❔ Should the nbytes be integrated into the "class name tag" ? <xarray.Dataset 27B> for Dataset and <xarray.DataArray (x: 3) 3B> (size last) or <xarray.DataArray 3B (x: 3)> (size first) for DataArray

In the examples I kept the nbytes out of the class name enclosed in chevrons (<xarray.Dataset> 27B and not <xarray.Dataset 27B>), as in the existing behaviour.

Also, I note that the fact that the nbytes is shown for a variable that is a dask array even when display_variable_nbytes=False could be a bit surprising. It would be less surprising if there was no global option in the first place. However, some might find the printing of nbytes per variable helpful even for non-lazy arrays and this would penalize them 🤔

HCookie · 2024-07-26T00:10:37Z

This nbytes representation can be quite useful,
I'm curious is there a reason as to why it is not included in the html repr?
While dask arrays show their size, pure numpy arrays do not, and both have no top level complete size representation.

max-sixty added the enhancement label Jan 31, 2024

etienneschalk mentioned this issue Feb 4, 2024

Add a simple nbytes representation in DataArrays and Dataset repr #8702

Merged

3 tasks

etienneschalk mentioned this issue Feb 5, 2024

Issue with docstrings containing backslashes at the end of line max-sixty/pytest-accept#146

Closed

max-sixty closed this as completed in #8702 Feb 7, 2024

seisman mentioned this issue Feb 21, 2024

Fix the DataArray repr in the load_tile_map doctest for xarray v2024.02 GenericMappingTools/pygmt#3062

Merged

tomwhite mentioned this issue Mar 4, 2024

Change in Xarray repr is causing doctest failures sgkit-dev/sgkit#1204

Closed

keewis mentioned this issue Apr 17, 2024

RFC: add support for determining the size of arrays in bytes data-apis/array-api#789

Open

shoyer reopened this Jun 3, 2024

etienneschalk linked a pull request Jun 8, 2024 that will close this issue

Remove "Size: " prefix in Dataset and DataArray repr + add option to show nbytes on individual variables #9078

Open

3 tasks

maxrjones mentioned this issue Jul 31, 2024

WIP: Add array SVG image and table to _repr_html_ #9301

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `nbytes` to repr? #8690

Add `nbytes` to repr? #8690

max-sixty commented Jan 31, 2024

TomNicholas commented Feb 1, 2024

max-sixty commented Feb 1, 2024

norlandrhagen commented Feb 1, 2024

etienneschalk commented Feb 4, 2024

etienneschalk commented Feb 4, 2024 •

edited

Loading

dcherian commented Feb 5, 2024

djhoese commented Feb 19, 2024

etienneschalk commented Feb 19, 2024

djhoese commented Feb 19, 2024

shoyer commented Jun 3, 2024

max-sixty commented Jun 3, 2024

etienneschalk commented Jun 7, 2024

HCookie commented Jul 26, 2024

Add nbytes to repr? #8690

Add nbytes to repr? #8690

Comments

max-sixty commented Jan 31, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

TomNicholas commented Feb 1, 2024

max-sixty commented Feb 1, 2024

norlandrhagen commented Feb 1, 2024

etienneschalk commented Feb 4, 2024

Examples

etienneschalk commented Feb 4, 2024 • edited Loading

dcherian commented Feb 5, 2024

djhoese commented Feb 19, 2024

etienneschalk commented Feb 19, 2024

djhoese commented Feb 19, 2024

shoyer commented Jun 3, 2024

max-sixty commented Jun 3, 2024

etienneschalk commented Jun 7, 2024

About display of individual variables' nbytes

About the removal of the term Size

Proposal of new reprs

HCookie commented Jul 26, 2024

Add `nbytes` to repr? #8690

Add `nbytes` to repr? #8690

etienneschalk commented Feb 4, 2024 •

edited

Loading

About display of individual variables' `nbytes`

About the removal of the term `Size`

Proposal of new `reprs`