xarray 2022.10.0 much slower then 2022.6.0 #7181

dschwoerer · 2022-10-18T09:38:52Z

What is your issue?

xbout's test suite finishes with 2022.6.0 in less than an our, with 2022.10.0 it gets aborted after 6 hours.

I haven't managed to debug what is the issue.

Git bisect will not work, as 2022.9.0 is broken due to #7111

ZedThree · 2022-10-18T11:10:38Z

Here's the output of pytest --profile* on a single test:

With xarray 2022.10 (re-read the .prof file with pstats module to print more functions):

         317962521 function calls (279203754 primitive calls) in 148.688 seconds

   Ordered by: cumulative time
   List reduced from 2791 to 42 due to restriction <0.015>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  148.690  148.690 runner.py:108(pytest_runtest_protocol)
    30/11    0.000    0.000  148.690   13.517 _hooks.py:244(__call__)
    30/11    0.000    0.000  148.690   13.517 _manager.py:77(_hookexec)
    30/11    0.000    0.000  148.690   13.517 _callers.py:9(_multicall)
        1    0.000    0.000  148.690  148.690 runner.py:116(runtestprotocol)
        3    0.000    0.000  148.690   49.563 runner.py:216(call_and_report)
        3    0.000    0.000  148.689   49.563 runner.py:244(call_runtest_hook)
        3    0.000    0.000  148.689   49.563 runner.py:315(from_call)
        3    0.000    0.000  148.689   49.563 runner.py:259(<lambda>)
        1    0.000    0.000  148.687  148.687 runner.py:157(pytest_runtest_call)
        1    0.000    0.000  148.687  148.687 python.py:1759(runtest)
        1    0.000    0.000  148.687  148.687 python.py:185(pytest_pyfunc_call)
        1    0.002    0.002  148.687  148.687 test_plot.py:18(test_region_disconnecteddoublenull)
31664481/29029   33.752    0.000  140.450    0.005 copy.py:128(deepcopy)
4099751/20042    9.002    0.000  140.195    0.007 copy.py:227(_deepcopy_dict)
       17    0.003    0.000  139.922    8.231 plotfuncs.py:79(plot2d_wrapper)
166283/166220    0.723    0.000  137.501    0.001 copy.py:259(_reconstruct)
650506/181818    4.208    0.000  136.386    0.001 dataarray.py:1179(_copy)
3330522/909322    7.121    0.000  133.892    0.000 variable.py:985(_copy)
   649848    0.463    0.000  124.588    0.000 dataarray.py:1200(__deepcopy__)
   326/86    0.001    0.000  120.356    1.399 boutdataarray.py:205(from_region)
   326/86    0.138    0.000  120.356    1.399 region.py:1713(_from_region)
       17    0.000    0.000  114.522    6.737 utils.py:65(_decompose_regions)
       17    0.000    0.000  114.520    6.736 utils.py:69(<dictcomp>)
      658    0.001    0.000  102.625    0.156 dataarray.py:1109(copy)
        6    0.000    0.000   47.508    7.918 boutdataarray.py:1062(contourf)
        6    0.000    0.000   46.851    7.808 boutdataarray.py:1068(pcolormesh)
        5    0.000    0.000   45.564    9.113 boutdataarray.py:1056(contour)
  3342076    5.816    0.000   44.786    0.000 variable.py:1018(_replace)
  3355730    4.448    0.000   36.445    0.000 variable.py:297(__init__)
     3642    0.115    0.000   34.953    0.010 variable.py:2897(copy)
  326/242    0.005    0.000   33.692    0.139 region.py:1258(_concat_inner_guards)
  326/242    0.003    0.000   33.212    0.137 region.py:1479(_concat_lower_guards)
  326/242    0.003    0.000   32.887    0.136 region.py:1596(_concat_upper_guards)
  326/242    0.004    0.000   31.750    0.131 region.py:1369(_concat_outer_guards)
  3356435   14.551    0.000   23.840    0.000 variable.py:192(as_compatible_data)
     1936    0.005    0.000   23.714    0.012 alignment.py:567(align)
     1936    0.010    0.000   23.673    0.012 alignment.py:549(align)
     1936    0.007    0.000   23.216    0.012 alignment.py:541(reindex_all)
     4769    0.006    0.000   23.208    0.005 alignment.py:542(<genexpr>)
     2833    0.013    0.000   23.203    0.008 alignment.py:521(_reindex_one)
     2833    0.007    0.000   18.699    0.007 alignment.py:498(_get_indexes_and_vars)

xarray 2022.06:

         11730856 function calls (11602239 primitive calls) in 7.233 seconds

   Ordered by: cumulative time
   List reduced from 2759 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.236    7.236 runner.py:108(pytest_runtest_protocol)
    30/11    0.000    0.000    7.235    0.658 _hooks.py:244(__call__)
    30/11    0.000    0.000    7.235    0.658 _manager.py:77(_hookexec)
    30/11    0.000    0.000    7.235    0.658 _callers.py:9(_multicall)
        1    0.000    0.000    7.235    7.235 runner.py:116(runtestprotocol)
        3    0.000    0.000    7.235    2.412 runner.py:216(call_and_report)
        3    0.000    0.000    7.235    2.412 runner.py:244(call_runtest_hook)
        3    0.000    0.000    7.235    2.412 runner.py:315(from_call)
        3    0.000    0.000    7.235    2.412 runner.py:259(<lambda>)
        1    0.000    0.000    7.233    7.233 runner.py:157(pytest_runtest_call)
        1    0.000    0.000    7.233    7.233 python.py:1759(runtest)
        1    0.000    0.000    7.233    7.233 python.py:185(pytest_pyfunc_call)
        1    0.001    0.001    7.232    7.232 test_plot.py:18(test_region_disconnecteddoublenull)
       17    0.001    0.000    4.878    0.287 plotfuncs.py:79(plot2d_wrapper)
   326/86    0.000    0.000    2.644    0.031 boutdataarray.py:205(from_region)
   326/86    0.009    0.000    2.644    0.031 region.py:1713(_from_region)
       17    0.000    0.000    2.604    0.153 utils.py:65(_decompose_regions)
       17    0.000    0.000    2.602    0.153 utils.py:69(<dictcomp>)
     1249    0.017    0.000    1.807    0.001 merge.py:696(merge_core)
        5    0.000    0.000    1.710    0.342 boutdataarray.py:1056(contour)

It looks like it's basically all coming from copying dataarrays -- has there been some change in expected behaviour that we need to update our code?

* pytest -k test_region_disconnecteddoublenull[lower-disconnected-double-null-0-False-False-guards0] --profile --long --pstats-dir=prof2022.10

headtr1ck · 2022-10-18T13:38:31Z

Yes the behavior of deep copy was changed.
But I don't see why it should affect runtime since we only pass on the memo dict now which is required for recursive structures.

Maybe there is some bug in the logic, or maybe before it was not doing true deep copies?

ZedThree · 2022-10-18T14:03:40Z

Our tests all pass, but there is something like a x20 slowdown, and it's basically entirely due to copies. It's plausible we're doing way too many copies as it is, but this is obviously still concerning.

I tried adding the following asv benchmark, based off the combine benchmark and the deepcopy tests:

class Copy:
    def setup(self):
        """Create 4 datasets with two different variables"""

        t_size, x_size, y_size = 50, 450, 400
        t = np.arange(t_size)
        data = np.random.randn(t_size, x_size, y_size)

        self.ds = xr.Dataset(
            {"A": xr.DataArray(data, coords={"T": t}, dims=("T", "X", "Y"))}
        )

    def time_copy(self) -> None:
        copy(self.ds)

    def time_deepcopy(self) -> None:
        deepcopy(self.ds)

    def time_copy_method(self) -> None:
        self.ds.copy(deep=False)

    def time_copy_method_deep(self) -> None:
        self.ds.copy(deep=True)

But I didn't see any regressions between v2022.06.0..HEAD, so that simplistic case is clearly not enough.

There are a few differences between our test datasets and the one in the benchmark above:

4D vs 3D
smaller grid: (2, 5, 4, 3) vs (50, 450, 400)
more variables: ~20 vs 1
variable attributes vs none
multiple files read in via open_mfdataset vs entirely in-memory

keewis · 2022-10-18T14:27:57Z

the only difference you should see is in the deep-copy of attrs and encoding dictionaries. Could check if the attributes in your tests are complex, big, deeply nested, and / or recursive?

ZedThree · 2022-10-18T15:02:40Z

Nope, either single values or flat dicts:

In [6]: ds.n
Out[6]: 
<xarray.DataArray 'n' (t: 6, x: 6, y: 12, z: 7)>
dask.array<concatenate, shape=(6, 6, 12, 7), dtype=float64, chunksize=(6, 2, 4, 7), chunktype=numpy.ndarray>
Coordinates:
    dx       (x, y) float64 dask.array<chunksize=(2, 4), meta=np.ndarray>
    dy       (x, y) float64 dask.array<chunksize=(2, 4), meta=np.ndarray>
    dz       float64 0.8976
  * t        (t) float64 0.0 10.0 20.0 30.0 40.0 50.0
  * x        (x) int64 0 1 2 3 4 5
  * y        (y) float64 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0 19.0 21.0 23.0
  * z        (z) float64 0.0 0.8976 1.795 2.693 3.59 4.488 5.386
Attributes:
    direction_y:    Standard
    cell_location:  CELL_CENTRE
    direction_z:    Standard
    metadata:       {'BOUT_VERSION': 4.3, 'use_metric_3d': 0, 'NXPE': 3, 'NYP...
    options:        None
    geometry:

Our coordinates are dask.array though -- is that likely to affect things?

headtr1ck · 2022-10-19T07:07:51Z

You call DataArray.copy 658 times, that seems ok for a test suite. But I don't understand why it calls DataArray.__deepcopy__ 650k times.
There must be something that deepcopies you arrays all over the place...

dschwoerer · 2022-10-19T09:57:47Z

A call graph was posted in the referenced thread:

boutproject/xBOUT#252 (comment)
https://user-images.githubusercontent.com/1486942/196415148-ca7ea730-34f6-4622-8f0c-1e98d8b06e26.svg

headtr1ck · 2022-10-19T10:04:10Z

I see, I think the problem is Variable.to_index_variable is deep copying.
Probably the fault of deep=True being the default for copy...

headtr1ck · 2022-10-24T21:01:19Z

@dschwoerer could you test if you see any improvement with this PR: #7209 ?

The graph is not super helpfull, since most calls to deep_copy are indirectly from deep_copy itself. I tried my best to find some points of improvement.

ZedThree · 2022-10-25T08:06:14Z

@headtr1ck It does seem to have helped a bit, but looks like it's still around x15 slower:

         243192363 function calls (213211086 primitive calls) in 118.014 seconds

   Ordered by: cumulative time
   List reduced from 2819 to 42 due to restriction <0.015>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  118.017  118.017 runner.py:108(pytest_runtest_protocol)
    30/11    0.000    0.000  118.016   10.729 _hooks.py:244(__call__)
    30/11    0.000    0.000  118.016   10.729 _manager.py:77(_hookexec)
    30/11    0.000    0.000  118.016   10.729 _callers.py:9(_multicall)
        1    0.000    0.000  118.016  118.016 runner.py:116(runtestprotocol)
        3    0.000    0.000  118.016   39.339 runner.py:216(call_and_report)
        3    0.000    0.000  118.015   39.338 runner.py:244(call_runtest_hook)
        3    0.000    0.000  118.015   39.338 runner.py:315(from_call)
        3    0.000    0.000  118.015   39.338 runner.py:259(<lambda>)
        1    0.000    0.000  118.013  118.013 runner.py:157(pytest_runtest_call)
        1    0.000    0.000  118.013  118.013 python.py:1759(runtest)
        1    0.000    0.000  118.013  118.013 python.py:185(pytest_pyfunc_call)
        1    0.002    0.002  118.012  118.012 test_plot.py:18(test_region_disconnecteddoublenull)
2534922/113722    5.553    0.000  111.083    0.001 variable.py:985(_copy)
23917714/22342   26.196    0.000  109.780    0.005 copy.py:128(deepcopy)
491386/22698    3.290    0.000  109.750    0.005 dataarray.py:1179(_copy)
3096327/15601    6.894    0.000  109.589    0.007 copy.py:227(_deepcopy_dict)
       17    0.003    0.000  109.229    6.425 plotfuncs.py:79(plot2d_wrapper)
   326/86    0.001    0.000  108.467    1.261 boutdataarray.py:205(from_region)
   326/86    0.047    0.000  108.467    1.261 region.py:1713(_from_region)
   125079    0.576    0.000  107.460    0.001 copy.py:259(_reconstruct)
      658    0.001    0.000  105.804    0.161 dataarray.py:1109(copy)
       17    0.000    0.000  102.369    6.022 utils.py:65(_decompose_regions)
       17    0.000    0.000  102.367    6.022 utils.py:69(<dictcomp>)
   490728    0.375    0.000   97.421    0.000 dataarray.py:1200(__deepcopy__)
        6    0.000    0.000   36.768    6.128 boutdataarray.py:1062(contourf)
        6    0.000    0.000   36.750    6.125 boutdataarray.py:1068(pcolormesh)
        5    0.000    0.000   35.711    7.142 boutdataarray.py:1056(contour)
  2546476    4.140    0.000   34.363    0.000 variable.py:1018(_replace)
  326/242    0.004    0.000   28.475    0.118 region.py:1369(_concat_outer_guards)
  2560130    3.442    0.000   28.335    0.000 variable.py:297(__init__)
  326/242    0.003    0.000   28.184    0.116 region.py:1596(_concat_upper_guards)
  326/242    0.005    0.000   28.105    0.116 region.py:1258(_concat_inner_guards)
  326/242    0.003    0.000   27.939    0.115 region.py:1479(_concat_lower_guards)
  2560835   11.470    0.000   18.445    0.000 variable.py:192(as_compatible_data)
   495517    1.986    0.000    7.439    0.000 indexes.py:1275(copy_indexes)
  6174487    4.905    0.000    6.233    0.000 copy.py:243(_keep_alive)
     1936    0.005    0.000    5.797    0.003 alignment.py:567(align)
     1936    0.010    0.000    5.757    0.003 alignment.py:549(align)
   497497    4.988    0.000    5.341    0.000 dataarray.py:893(xindexes)
     1936    0.007    0.000    5.297    0.003 alignment.py:541(reindex_all)
     4769    0.004    0.000    5.290    0.001 alignment.py:542(<genexpr>)

I think previously dataarray.py:1109(copy) was the main culprit and now it's dataarray.py:1200(__deepcopy__)

dschwoerer · 2022-10-25T08:12:05Z

Indeed, it does help. In 6 hours the CI completed 50% of the tests, compared to 17%.

This however still very much slower than before, where we finished in around half an hour - so around 24x slower ...

ZedThree · 2022-10-25T08:13:41Z

Looking at the call tree (sorry, big image!):

there's a couple of recursive calls that have all the time:

the tight one from copy:128:deepcopy to copy:227:_deepcopy_dict
the larger one from variable:985:_copy to dataarray:1179:_copy

Is this implying we do actually have some sort of complex or recursive structure hiding in our datasets?

headtr1ck · 2022-10-25T08:24:37Z

Every time I look at this graph Im getting more confused, haha.
I still don't know why copy.deepcopy is called 24mio times.

headtr1ck · 2022-10-25T08:27:43Z

You call DataArray.copy ~650 times which sounds reasonable (you could check if you really need deep copies!).

But then somehow DataArray._copy is called ~500k times. So somewhere you must have some DataArrays in places where there should be none (in attrs, e.g.).

There are some copy calls from alignments, but I don't see how that adds up to this number...

ZedThree · 2022-10-25T08:31:50Z

But then somehow DataArray._copy is called ~500k times. So somewhere you must have some DataArrays in places where there should be none (in attrs, e.g.)

I think it has to be something like that. I'll see if I can dig into what these Datasets look like.

I'm sure we can get away with fewer deep copies here. I think it's something like we're caching generated Datasets between different test cases, but some of them need to modify them so they're deep copied. I guess previously that was fine, but maybe now it's just cheaper to make new instances from scratch.

ZedThree · 2022-10-25T10:44:16Z

Yes, it turns out we do have Datasets with DataArrays in attributes. They're not nested very deep, and they're not cyclic, but it looks like there might be two or three levels of nesting.

Is this nesting now disallowed?

headtr1ck · 2022-10-25T10:47:55Z

Is this nesting now disallowed?

No, it is allowed.
But a deep-copy will also deep-copy the attrs now, and therefore the DataArrays in there.

dschwoerer added the needs triage Issue that has not been reviewed by xarray team member label Oct 18, 2022

dschwoerer mentioned this issue Oct 18, 2022

Drop support for Python 3.7 boutproject/xBOUT#252

Merged

TomNicholas added topic-performance regression and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 18, 2022

headtr1ck mentioned this issue Oct 24, 2022

Optimize some copying #7209

Merged

3 tasks

dcherian closed this as completed in #7209 Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray 2022.10.0 much slower then 2022.6.0 #7181

xarray 2022.10.0 much slower then 2022.6.0 #7181

dschwoerer commented Oct 18, 2022

ZedThree commented Oct 18, 2022

headtr1ck commented Oct 18, 2022

ZedThree commented Oct 18, 2022

keewis commented Oct 18, 2022 •

edited

Loading

ZedThree commented Oct 18, 2022

headtr1ck commented Oct 19, 2022

dschwoerer commented Oct 19, 2022

headtr1ck commented Oct 19, 2022

headtr1ck commented Oct 24, 2022 •

edited

Loading

ZedThree commented Oct 25, 2022

dschwoerer commented Oct 25, 2022

ZedThree commented Oct 25, 2022

headtr1ck commented Oct 25, 2022

headtr1ck commented Oct 25, 2022 •

edited

Loading

ZedThree commented Oct 25, 2022

ZedThree commented Oct 25, 2022

headtr1ck commented Oct 25, 2022

xarray 2022.10.0 much slower then 2022.6.0 #7181

xarray 2022.10.0 much slower then 2022.6.0 #7181

Comments

dschwoerer commented Oct 18, 2022

What is your issue?

ZedThree commented Oct 18, 2022

headtr1ck commented Oct 18, 2022

ZedThree commented Oct 18, 2022

keewis commented Oct 18, 2022 • edited Loading

ZedThree commented Oct 18, 2022

headtr1ck commented Oct 19, 2022

dschwoerer commented Oct 19, 2022

headtr1ck commented Oct 19, 2022

headtr1ck commented Oct 24, 2022 • edited Loading

ZedThree commented Oct 25, 2022

dschwoerer commented Oct 25, 2022

ZedThree commented Oct 25, 2022

headtr1ck commented Oct 25, 2022

headtr1ck commented Oct 25, 2022 • edited Loading

ZedThree commented Oct 25, 2022

ZedThree commented Oct 25, 2022

headtr1ck commented Oct 25, 2022

keewis commented Oct 18, 2022 •

edited

Loading

headtr1ck commented Oct 24, 2022 •

edited

Loading

headtr1ck commented Oct 25, 2022 •

edited

Loading