Skip to content

REF: dont set ndarray.data in libreduction #34997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 94 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @WillAyd I'm still seeing 88 test failures locally and could use a fresh pair of eyes on this. Any thoughts?

@jbrockmendel
Copy link
Member Author

Tentatively looks like the issue is that this patches _index_data without updating _data

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2020

If it helps I've noticed that changing this line:

chunk = slider.dummy

To chunk = slider.frame[starts[i]:ends[i]].copy() takes 88 failures down to about 20. Not a real solution as we don't want the copy, but I think this new design isn't fully compatible with some of the dummy work that the sliders do, so might need to patch together

Seems heading in the right direction though

@jbrockmendel
Copy link
Member Author

closing to clear the queue

@WillAyd WillAyd reopened this Jul 22, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel made a few updates to get failures down to a small handful. may or may not be the right way of doing things but hopefully continues this conversation

@WillAyd
Copy link
Member

WillAyd commented Jul 22, 2020

Of the four remaining failures one of them is a result of groupby.tshift which maybe we will deprecate in #34452

@jbrockmendel
Copy link
Member Author

@WillAyd is there cause for optimism here? A solution here might have a bearing on one of the test failures in #35417

@WillAyd
Copy link
Member

WillAyd commented Jul 28, 2020

Well the optimism is that we are down to a handful of failures :-)

I think though that this is uncovering deep seeded issues in other parts of the code base. For instance, this is causing the test that #33439 added to fail which right now I've traced back to a potential bug somewhere in our hashing ocde, but I think it's only happen stance that that test works in the first place.

Not sure if the other 4 failures are related to hashing as well but will hopefully figure out soon

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

I've noticed that as an apply slides along various groups that this line of code looks problematic with the current changes:

values = self._get_index_values()

The problem is that the call to _get_index_values hits a cached property that with this doesn't seem to be updated, and only ever references the index of the first group. So if you tried to split a DataFrame with 3 elements into [0], and [1, 2] groupings, a hashtable only ever gets built for the [0] group, the next iteration throws a KeyError when trying to figure out the location of [1, 2] and a whole slew of exception handling routes that off to space from there

@jbrockmendel any idea how the caching should be handled here? In the Python space the above call traces back to here:

target_values = self._get_engine_target()

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Here's the MRE I'm using to tackle the above comment.

df = pd.DataFrame({"A": ["S", "W", "W"], "B": [1.0, 1.0, 2.0]}) 
res = df.groupby("A").agg({"B": lambda x: x.get(x.index[-1])})

On master this yields

     B
A     
S  1.0
W  2.0

But with this PR yields

     B
A     
S  1.0
W  NaN

Because of the aforementioned KeyError when trying to locate elements by index in the second grouping

@jbrockmendel
Copy link
Member Author

The problem is that the call to _get_index_values hits a cached property

I'm not sure I follow. In index.pyx this is a call to self.vgetter(), which I think is supposed to return _index_data, which gets patched within libreduction. Am I misunderstanding?

cache_readonlys on the Index object definitely seem like a footgun. In the general case, the only way I can think of to avoid this is to create a fresh Index object, but avoiding that overhead is basically the whole point of the shenanigans in libreduction.

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

@jbrockmendel
Copy link
Member Author

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

I expected to see it in Index._engine, but apparently not

@WillAyd
Copy link
Member

WillAyd commented Aug 4, 2020

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@jbrockmendel jbrockmendel mentioned this pull request Aug 11, 2020
5 tasks
@jbrockmendel
Copy link
Member Author

Just pushed after applying a patch from #35417. Locally that gets me to 3 test failures (which now that i reread the thread may not actually be an improvement)

@jbrockmendel jbrockmendel mentioned this pull request Sep 12, 2020
2 tasks
@jbrockmendel
Copy link
Member Author

Updated to implement NDFrame._can_use_libreduction to de-duplicate a bunch of similar (and inconsistently strict) checks done elsewhere. (Note an inline-comment I'll make about a place where this PR retains a different, inconsistent check)

This fixes an xfailed tests.groupby.test_apply.test_apply_with_timezones_aware

# see see test_groupby.test_basic
result = self._aggregate_named(func, *args, **kwargs)
if isinstance(
self._selected_obj.index, (DatetimeIndex, TimedeltaIndex, PeriodIndex)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the one place where i'm not using self._selected_obj._can_use_libreduction, as doing so would require 2.5 more kludges to get the tests passing:

  1. below on 283 after ret = create_series_with_explicit_dtype would need to do
            # Inference in the Series constructor may not infer
            #  custom EA dtypes, so try here
            ret = maybe_cast_result(ret._values, obj, numeric_only=True)
            ret = Series(ret, index=index, name=obj.name)

1.5) in create_series_with_explicit_dtype would need to change dtype_if_empty=object to dtype_if_empty=obj.dtype (which im not 100% sure about)

  1. The kludge here would have to be amended from and name in output.index to and (name in output.index or 0 in output.index)

@jbrockmendel jbrockmendel marked this pull request as ready for review September 17, 2020 21:10
@jbrockmendel
Copy link
Member Author

Closing.

This doesn't do quite what I thought it did, will need a new approach, xref #36459.

@WillAyd
Copy link
Member

WillAyd commented Sep 19, 2020

This thing is super thorny...nice effort in any case here. We will figure it out one of these days

@jbrockmendel jbrockmendel deleted the ref-libreduction-5 branch March 2, 2021 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.