-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrayManager] Array version of fillna logic #41104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -964,3 +974,107 @@ def _rolling_window(a: np.ndarray, window: int): | |||
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) | |||
strides = a.strides + (a.strides[-1],) | |||
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) | |||
|
|||
|
|||
def _can_hold_element(values, element: Any) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a version of this that makes the existing can_hold_element more accurate instead of implementing another function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully sure what I was thinking in April / if that was already the case, but so right now the existing can_hold_element
actually fully covers what I needed here.
It only doesn't do extract_array
compared to Block._can_hold_element
, so kept that here for now. But probably that could be moved to can_hold_element
as well (will check later).
pandas/core/missing.py
Outdated
# We support filling a DatetimeTZ with a `value` whose timezone | ||
# is different by coercing to object. | ||
# TODO: don't special-case td64 | ||
values = values.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK if you want to keep this copy/pasted version, but i think using coerce_to_target_dtype would be more Technically Correct here
pandas/core/missing.py
Outdated
values, value, limit=limit, inplace=True, downcast=downcast | ||
) | ||
|
||
values = values if inplace else values.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think EA.fillna is never inplace, so this may be making an unnecessary copy (or not being inplace even when the user specifies it) (applicable to status quo too)
Some exposition on ideas I've been looking at for this problem (for Block methods in general, not specific to fillna)
|
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
can you address @jbrockmendel comments |
Started looking at this again, for now only simplified the |
I now shared the implementation for the ExtensionBlock (subclasses), which was essentially just the datetime64 special casing for allowing object fallback (but good to have this in one place). But for sharing the generic Block implementation something like the above would indeed be needed. This splitting into multiple arrays is very much block specific, though, and personally I would prefer to keep that logic on the Blocks itself. |
return values.fillna(value, limit=limit) | ||
|
||
|
||
def fillna_array(values, value, limit=None, inplace: bool = False, downcast=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be used for Block.fillna too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't directly see an easy way to do that. The Block.fillna additionally has the "split" logic, which I personally don't want to put in here. And it's not directly straightforward to split a part of Block.fillna.
I could for example let this function return some sentinel like NotImplemented
or None to signal to Block.fillna that the filling hasn't been done yet, and it should split the block and call fillna again. In which case most of the splitting logic stays in the internals. I could try something like this if you want, but it's also a bit of a hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Block.fillna additionally has the "split" logic, which I personally don't want to put in here
100% agree
I've been looking at simplifying the downcasting in a way that would make the splitting much easier to separate the splitting from non-splitting (not just for fillna) (xref #44241, #40988). It would be an API change so wouldn't be implement in time to simplify this (and your putmask PR), but would eventually allow for some cleanup.
ignored), except for datetime64 in which case fallback to object dtype | ||
is currently allowed. | ||
""" | ||
if not _can_hold_element(values, value) and values.dtype.kind == "M": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this would change behavior for e.g. Period or Interval?
(which would make us more consistent across dtypes, which i would like to see, just needs to be intentional)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we don't have tests for it in any case, since the tests are passing.
Now, checking for period, it seems to raise for both of array and Series:
In [5]: arr = pd.period_range("2012", periods=3).array
In [6]: arr[0] = None
In [8]: arr.fillna("?")
...
TypeError: value should be a 'Period', 'NaT', or array of those. Got 'str' instead.
In [9]: pd.Series(arr).fillna("?")
...
TypeError: value should be a 'Period', 'NaT', or array of those. Got 'str' instead.
while indeed for DatetimeArray, we coerce to object for Series:
In [10]: arr = pd.date_range("2012", periods=3).array
In [11]: arr[0] = None
In [12]: arr.fillna("?")
..
TypeError: value should be a 'Timestamp', 'NaT', or array of those. Got 'str' instead.
In [13]: pd.Series(arr).fillna("?")
Out[13]:
0 ?
1 2012-01-02 00:00:00
2 2012-01-03 00:00:00
dtype: object
return can_hold_element(values, element) | ||
|
||
|
||
def fillna_ea_array(values, value, limit=None, inplace: bool = False, downcast=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values: ExtensionArray
?
@jorisvandenbossche if you can rebase |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
xref #39146
I was looking into getting rid of the Block fallback (
apply_with_block
) forfillna
in ArrayManager. For that, I started with putting the bits and pieces from the several BlockManager methods (the differentfillna
implementations, maybe_downcast) together in a function that would do the same for plain arrays as input, to see what this would give.The draft result of that can be seen in this PR. This seems to pass all tests for the ArrayManager, so should have the same behaviour as BlockManager.fillna. I only hoped that we could then also use this in the BlockManager to have a shared implementation (as we did for eg
Block.astype
and others). However, that doesn't seem so easy in this case .. So not yet sure how to best proceed here.