-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: ExtensionArray.fillna #19909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: ExtensionArray.fillna #19909
Changes from 1 commit
74614cb
67a19ba
280ac94
6705712
69a0f9b
4d6846b
f3b81dc
70efbb8
8fc3851
39096e5
9e44f88
e902f18
17126e6
1106ef2
744c381
9a3fa55
2bc43bc
f22fd7b
c26d261
b342efe
a609f48
a35f93c
3f78dec
1160e15
c9c7a48
05fced6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
"""An interface for extending pandas with custom arrays.""" | ||
import itertools | ||
|
||
import numpy as np | ||
|
||
from pandas.errors import AbstractMethodError | ||
|
@@ -216,6 +218,88 @@ def isna(self): | |
""" | ||
raise AbstractMethodError(self) | ||
|
||
def tolist(self): | ||
# type: () -> list | ||
"""Convert the array to a list of scalars.""" | ||
return list(self) | ||
|
||
def fillna(self, value=None, method=None, limit=None): | ||
""" Fill NA/NaN values using the specified method. | ||
|
||
Parameters | ||
---------- | ||
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None | ||
Method to use for filling holes in reindexed Series | ||
pad / ffill: propagate last valid observation forward to next valid | ||
backfill / bfill: use NEXT valid observation to fill gap | ||
value : scalar, array-like | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value before method ? (order of signature) |
||
If a scalar value is passed it is used to fill all missing values. | ||
Alternatively, an array-like 'value' can be given. It's expected | ||
that the array-like have the same length as 'self'. | ||
limit : int, default None | ||
(Not implemented yet for ExtensionArray!) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can now maybe be passed to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I had to change things up slightly to use integers and the datetime fillna methods if you want to take a look. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Categorical.fillna can probably do soemthing similar, haven't looked. |
||
If method is specified, this is the maximum number of consecutive | ||
NaN values to forward/backward fill. In other words, if there is | ||
a gap with more than this number of consecutive NaNs, it will only | ||
be partially filled. If method is not specified, this is the | ||
maximum number of entries along the entire axis where NaNs will be | ||
filled. | ||
|
||
Returns | ||
------- | ||
filled : ExtensionArray with NA/NaN filled | ||
""" | ||
from pandas.api.types import is_scalar | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A I'll see what I can do to simplify this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would that be different with |
||
from pandas.util._validators import validate_fillna_kwargs | ||
|
||
value, method = validate_fillna_kwargs(value, method) | ||
|
||
if not is_scalar(value): | ||
if len(value) != len(self): | ||
raise ValueError("Length of 'value' does not match. Got ({}) " | ||
" expected {}".format(len(value), len(self))) | ||
else: | ||
value = itertools.cycle([value]) | ||
|
||
if limit is not None: | ||
msg = ("Specifying 'limit' for 'fillna' has not been implemented " | ||
"yet for {} typed data".format(self.dtype)) | ||
raise NotImplementedError(msg) | ||
|
||
mask = self.isna() | ||
|
||
if mask.any(): | ||
# ffill / bfill | ||
if method is not None: | ||
if method == 'backfill': | ||
data = reversed(self) | ||
mask = reversed(mask) | ||
last_valid = self[len(self) - 1] | ||
else: | ||
last_valid = self[0] | ||
data = self | ||
|
||
new_values = [] | ||
|
||
for is_na, val in zip(mask, data): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? what is all this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As you can clearly see from the diff, it is an generic implementation of @TomAugspurger do you want this for cyberpandas? Otherwise, we could also start for now with only supporting filling with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the other hand, we could maybe do this differently (that might be more efficient, without looping through all data): create an array of indices (arange), mask them with the missing values, then fill those indices with the existing fill-method machinery, and then use that to index into the original data and to fill the missing values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really care about ffill / bfill, but Categorical supports it so best to be consistent. Here's one alternative based astyping to object and setting. + from pandas.core.missing import pad_1d, backfill_1d
value, method = validate_fillna_kwargs(value, method)
+ mask = self.isna()
+
if not is_scalar(value):
if len(value) != len(self):
raise ValueError("Length of 'value' does not match. Got ({}) "
" expected {}".format(len(value), len(self)))
- else:
- value = itertools.cycle([value])
+ value = value[mask]
if limit is not None:
msg = ("Specifying 'limit' for 'fillna' has not been implemented "
"yet for {} typed data".format(self.dtype))
raise NotImplementedError(msg)
- mask = self.isna()
-
if mask.any():
# ffill / bfill
- if method is not None:
- if method == 'backfill':
- data = reversed(self)
- mask = reversed(mask)
- last_valid = self[len(self) - 1]
- else:
- last_valid = self[0]
- data = self
-
- new_values = []
-
- for is_na, val in zip(mask, data):
- if is_na:
- new_values.append(last_valid)
- else:
- new_values.append(val)
- last_valid = val
-
- if method in {'bfill', 'backfill'}:
- new_values = list(reversed(new_values))
+ if method == 'pad':
+ values = self.astype(object)
+ new_values = pad_1d(values, mask=mask)
+ elif method == 'backfill':
+ values = self.astype(object)
+ new_values = backfill_1d(values, mask=mask)
else:
# fill with value
- new_values = [
- val if is_na else original
- for is_na, original, val in zip(mask, self, value)
- ]
+ new_values = self.copy()
+ new_values[mask] = value
else:
new_values = self
return type(self)(new_values) Timings (n_rows, value/method, seconds): astype
Loopy
So the loopy version (current PR) is ~2-4x faster. Lots of variability there, esp depending on the speed of EA.setitem, but it seems like a good default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What code did you use for the timings here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from timeit import default_timer as tic
import pandas as pd
import numpy as np
from pandas.tests.extension.decimal.array import DecimalArray
from decimal import Decimal
Ns = [1000, 1_000_000]
for N in Ns:
arr = np.random.uniform(size=(N,))
arr[arr > .8] = np.nan
arr = DecimalArray([Decimal(x) for x in arr])
start = tic()
arr.fillna(value=Decimal('1.0'))
stop = tic()
print(f'{N}, value, {stop - start:0.2f}')
start = tic()
arr.fillna(method='ffill')
stop = tic()
print(f'{N}, ffill, {stop - start:0.2f}')
start = tic()
arr.fillna(method='bfill')
stop = tic()
print(f'{N}, bfill, {stop - start:0.2f}') I can re-run it / profile things a bit to see why the array[object] version was so much slower. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With a quick dirty check:
And for this 'new' method I did :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, didn't see your post before posting mine. So doing the same with the decimal array
I suppose this is because DecimalArray uses an object array under the hood anyhow? So something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we shouldn't care about performance for the default implementation. It'll be too dependent on which optimizations the subclass can implement. I'll go with a simple implementation like your pad_new. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And also, if you care about performance, you probably don't have an object array under the hood but a typed array, and then this will be faster. |
||
if is_na: | ||
new_values.append(last_valid) | ||
else: | ||
new_values.append(val) | ||
last_valid = val | ||
|
||
if method in {'bfill', 'backfill'}: | ||
new_values = list(reversed(new_values)) | ||
else: | ||
# fill with value | ||
new_values = [ | ||
val if is_na else original | ||
for is_na, original, val in zip(mask, self, value) | ||
] | ||
else: | ||
new_values = self | ||
return type(self)(new_values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait on #19906 before merging this, so I can update this line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this can now just be |
||
|
||
# ------------------------------------------------------------------------ | ||
# Indexing methods | ||
# ------------------------------------------------------------------------ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1963,6 +1963,23 @@ def concat_same_type(self, to_concat, placement=None): | |
return self.make_block_same_class(values, ndim=self.ndim, | ||
placement=placement) | ||
|
||
def fillna(self, value, limit=None, inplace=False, downcast=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was moved from Categorical.fillna with changes 1.) Removed check for limit, since that's done in |
||
mgr=None): | ||
values = self.values if inplace else self.values.copy() | ||
values = values.fillna(value=value, limit=limit) | ||
return [self.make_block_same_class(values=values, | ||
placement=self.mgr_locs, | ||
ndim=self.ndim)] | ||
|
||
def interpolate(self, method='pad', axis=0, inplace=False, limit=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was just a move from |
||
fill_value=None, **kwargs): | ||
|
||
values = self.values if inplace else self.values.copy() | ||
return self.make_block_same_class( | ||
values=values.fillna(value=fill_value, method=method, | ||
limit=limit), | ||
placement=self.mgr_locs) | ||
|
||
|
||
class NumericBlock(Block): | ||
__slots__ = () | ||
|
@@ -2522,27 +2539,6 @@ def _try_coerce_result(self, result): | |
|
||
return result | ||
|
||
def fillna(self, value, limit=None, inplace=False, downcast=None, | ||
mgr=None): | ||
# we may need to upcast our fill to match our dtype | ||
if limit is not None: | ||
raise NotImplementedError("specifying a limit for 'fillna' has " | ||
"not been implemented yet") | ||
|
||
values = self.values if inplace else self.values.copy() | ||
values = self._try_coerce_result(values.fillna(value=value, | ||
limit=limit)) | ||
return [self.make_block(values=values)] | ||
|
||
def interpolate(self, method='pad', axis=0, inplace=False, limit=None, | ||
fill_value=None, **kwargs): | ||
|
||
values = self.values if inplace else self.values.copy() | ||
return self.make_block_same_class( | ||
values=values.fillna(fill_value=fill_value, method=method, | ||
limit=limit), | ||
placement=self.mgr_locs) | ||
|
||
def shift(self, periods, axis=0, mgr=None): | ||
return self.make_block_same_class(values=self.values.shift(periods), | ||
placement=self.mgr_locs) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,7 +69,10 @@ def test_getitem_scalar(self): | |
|
||
|
||
class TestMissing(base.BaseMissingTests): | ||
pass | ||
|
||
@pytest.mark.skip(reason="Backwards compatability") | ||
def test_fillna_limit_raises(self): | ||
"""Has a different error message.""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can just up the error message for |
||
|
||
|
||
class TestMethods(base.BaseMethodsTests): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for fillna?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirectly, via the default
fillna
.Series.__iter__
callsSeries.tolist
, which callsSeries._values.tolist
. We could modify that to check the type and just calllist(self._values)
for EAs. I don't have a strong preference vs. adding a default.tolist
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or directly call
values.__iter__
? As iterating through values to make a list and then iterate again through that list sounds like a bit of overhead.For
.tolist()
itself, it is of course consistent with Series and numpy arrays, but personally I am not sure what its value is compared tolist(values)
. I don't think you typically can implement a more efficient conversion to lists than whatlist(values)
will do by default? (i.e. iterating through the values)