-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: ExtensionArray.fillna #19909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: ExtensionArray.fillna #19909
Conversation
------- | ||
filled : ExtensionArray with NA/NaN filled | ||
""" | ||
from pandas.api.types import is_scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A putmask
type method would be extremely useful here.
I'll see what I can do to simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would that be different with array[mask] = values
?
pandas/core/arrays/base.py
Outdated
] | ||
else: | ||
new_values = self | ||
return type(self)(new_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait on #19906 before merging this, so I can update this line.
@@ -1963,6 +1963,23 @@ def concat_same_type(self, to_concat, placement=None): | |||
return self.make_block_same_class(values, ndim=self.ndim, | |||
placement=placement) | |||
|
|||
def fillna(self, value, limit=None, inplace=False, downcast=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved from Categorical.fillna with changes
1.) Removed check for limit, since that's done in values.fillna
2.) Removed _try_coerce_result
For CategoricalBlock, this was unnecessary since
Categorical.fillna
always returns a Categorical
3.) Used make_block_same_class
This limits ExtensionArray.fillna
to not change the type
of the array / block, which I think is a good thing.
placement=self.mgr_locs, | ||
ndim=self.ndim)] | ||
|
||
def interpolate(self, method='pad', axis=0, inplace=False, limit=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a move from CategoricalBlock
to ExtensionBlock
, no changes.
|
||
@pytest.mark.skip(reason="Backwards compatability") | ||
def test_fillna_limit_raises(self): | ||
"""Has a different error message.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just up the error message for Categorical.fillna
probably.
pandas/core/arrays/base.py
Outdated
def tolist(self): | ||
# type: () -> list | ||
"""Convert the array to a list of scalars.""" | ||
return list(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for fillna?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirectly, via the default fillna
.
Series.__iter__
calls Series.tolist
, which calls Series._values.tolist
. We could modify that to check the type and just call list(self._values)
for EAs. I don't have a strong preference vs. adding a default .tolist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or directly call values.__iter__
? As iterating through values to make a list and then iterate again through that list sounds like a bit of overhead.
For .tolist()
itself, it is of course consistent with Series and numpy arrays, but personally I am not sure what its value is compared to list(values)
. I don't think you typically can implement a more efficient conversion to lists than what list(values)
will do by default? (i.e. iterating through the values)
Codecov Report
@@ Coverage Diff @@
## master #19909 +/- ##
==========================================
- Coverage 91.78% 91.75% -0.03%
==========================================
Files 152 152
Lines 49196 49214 +18
==========================================
+ Hits 45153 45155 +2
- Misses 4043 4059 +16
Continue to review full report at Codecov.
|
pandas/core/arrays/base.py
Outdated
|
||
new_values = [] | ||
|
||
for is_na, val in zip(mask, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? what is all this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can clearly see from the diff, it is an generic implementation of fillna
, and the above part for the method
kwarg (forward or backward fill).
If you don't like the implementation (I also don't really like it, feels very inefficient), it can be more helpful to give a more specific comment than that.
@TomAugspurger do you want this for cyberpandas? Otherwise, we could also start for now with only supporting filling with a value
? (that's what we supported for now in geopandas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, we could maybe do this differently (that might be more efficient, without looping through all data): create an array of indices (arange), mask them with the missing values, then fill those indices with the existing fill-method machinery, and then use that to index into the original data and to fill the missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really care about ffill / bfill, but Categorical supports it so best to be consistent.
Here's one alternative based astyping to object and setting.
+ from pandas.core.missing import pad_1d, backfill_1d
value, method = validate_fillna_kwargs(value, method)
+ mask = self.isna()
+
if not is_scalar(value):
if len(value) != len(self):
raise ValueError("Length of 'value' does not match. Got ({}) "
" expected {}".format(len(value), len(self)))
- else:
- value = itertools.cycle([value])
+ value = value[mask]
if limit is not None:
msg = ("Specifying 'limit' for 'fillna' has not been implemented "
"yet for {} typed data".format(self.dtype))
raise NotImplementedError(msg)
- mask = self.isna()
-
if mask.any():
# ffill / bfill
- if method is not None:
- if method == 'backfill':
- data = reversed(self)
- mask = reversed(mask)
- last_valid = self[len(self) - 1]
- else:
- last_valid = self[0]
- data = self
-
- new_values = []
-
- for is_na, val in zip(mask, data):
- if is_na:
- new_values.append(last_valid)
- else:
- new_values.append(val)
- last_valid = val
-
- if method in {'bfill', 'backfill'}:
- new_values = list(reversed(new_values))
+ if method == 'pad':
+ values = self.astype(object)
+ new_values = pad_1d(values, mask=mask)
+ elif method == 'backfill':
+ values = self.astype(object)
+ new_values = backfill_1d(values, mask=mask)
else:
# fill with value
- new_values = [
- val if is_na else original
- for is_na, original, val in zip(mask, self, value)
- ]
+ new_values = self.copy()
+ new_values[mask] = value
else:
new_values = self
return type(self)(new_values)
Timings (n_rows, value/method, seconds):
astype
1000, value, 0.01
1000, ffill, 0.00
1000, bfill, 0.00
1000000, value, 4.61
1000000, ffill, 2.34
1000000, bfill, 2.36
Loopy
1000, value, 0.00
1000, ffill, 0.00
1000, bfill, 0.00
1000000, value, 1.28
1000000, ffill, 1.37
1000000, bfill, 1.32
So the loopy version (current PR) is ~2-4x faster. Lots of variability there, esp depending on the speed of EA.setitem, but it seems like a good default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What code did you use for the timings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from timeit import default_timer as tic
import pandas as pd
import numpy as np
from pandas.tests.extension.decimal.array import DecimalArray
from decimal import Decimal
Ns = [1000, 1_000_000]
for N in Ns:
arr = np.random.uniform(size=(N,))
arr[arr > .8] = np.nan
arr = DecimalArray([Decimal(x) for x in arr])
start = tic()
arr.fillna(value=Decimal('1.0'))
stop = tic()
print(f'{N}, value, {stop - start:0.2f}')
start = tic()
arr.fillna(method='ffill')
stop = tic()
print(f'{N}, ffill, {stop - start:0.2f}')
start = tic()
arr.fillna(method='bfill')
stop = tic()
print(f'{N}, bfill, {stop - start:0.2f}')
I can re-run it / profile things a bit to see why the array[object] version was so much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a quick dirty check:
In [1]: from geopandas.array import from_shapely
...: import shapely.geometry
...: a = np.array([shapely.geometry.Point(i, i) for i in np.random.randn(10000)], dtype=object)
...: a[np.random.randint(0, 10000, 10)] = None
...: ga = from_shapely(a)
...: s = pd.Series(ga)
In [2]: s.head()
Out[2]:
0 POINT (-0.9684702137574125 -0.9684702137574125)
1 POINT (0.419808066032094 0.419808066032094)
2 POINT (0.57207855258413 0.57207855258413)
3 POINT (0.7773915414611537 0.7773915414611537)
4 POINT (2.073261603753559 2.073261603753559)
dtype: geometry
In [3]: %timeit s.fillna(method='pad')
115 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit s.fillna(method='pad_new')
217 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: s.fillna(method='pad').equals(s.fillna(method='pad_new'))
Out[5]: True
And for this 'new' method I did :
+ if method == 'pad_new':
+ idx = np.arange(len(self), dtype=float)
+ idx[mask] = np.nan
+ idx = pad_1d(idx)
+ idx = idx.astype('int64')
+ new_values = self.take(idx)
+ else: ... existing code in PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, didn't see your post before posting mine. So doing the same with the decimal array arr
(with 10000 values) doesn't show a speed-up, rather a slowdown:
In [18]: %timeit arr.fillna(method='pad')
17.3 ms ± 772 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [19]: %timeit arr.fillna(method='pad_new')
28.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I suppose this is because DecimalArray uses an object array under the hood anyhow? So something like take
will not be that efficient (while in geopandas it is a take
on a int array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't care about performance for the default implementation. It'll be too dependent on which optimizations the subclass can implement.
I'll go with a simple implementation like your pad_new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, if you care about performance, you probably don't have an object array under the hood but a typed array, and then this will be faster.
This reverts commit 67a19ba.
@jreback @jorisvandenbossche do you have a preference for diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index 8dc4ddbff..bb58fefe5 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -218,11 +218,6 @@ class ExtensionArray(object):
"""
raise AbstractMethodError(self)
- def tolist(self):
- # type: () -> list
- """Convert the array to a list of scalars."""
- return list(self)
-
def fillna(self, value=None, method=None, limit=None):
""" Fill NA/NaN values using the specified method.
diff --git a/pandas/core/base.py b/pandas/core/base.py
index 280b88497..5ced4ec25 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -829,6 +829,8 @@ class IndexOpsMixin(object):
if is_datetimelike(self):
return [com._maybe_box_datetimelike(x) for x in self._values]
+ elif is_extension_array_dtype(self):
+ return list(self._values)
else:
return self._values.tolist() I don't have a strong preference, but slightly lean towards implementing it (the current state of the PR). It's ugly, but practical in this case. |
pandas/core/arrays/base.py
Outdated
func = pad_1d if method == 'pad' else backfill_1d | ||
idx = np.arange(len(self), dtype=np.float64) | ||
idx[mask] = np.nan | ||
idx = func(idx, mask=mask).astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably needs to be np.intp (or an ensure_platform_int) for windows?
pandas/core/arrays/base.py
Outdated
new_values[mask] = value | ||
else: | ||
new_values = self.copy() | ||
return type(self)(new_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can now just be new_values
?
pandas/core/arrays/base.py
Outdated
if mask.any(): | ||
if method is not None: | ||
# ffill / bfill | ||
func = pad_1d if method == 'pad' else backfill_1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is method
normalized by here? (I mean, can it also be 'ffill'?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, by validate_fillna_kwargs
pandas/core/arrays/base.py
Outdated
Method to use for filling holes in reindexed Series | ||
pad / ffill: propagate last valid observation forward to next valid | ||
backfill / bfill: use NEXT valid observation to fill gap | ||
value : scalar, array-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value before method ? (order of signature)
pandas/core/arrays/base.py
Outdated
Alternatively, an array-like 'value' can be given. It's expected | ||
that the array-like have the same length as 'self'. | ||
limit : int, default None | ||
(Not implemented yet for ExtensionArray!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can now maybe be passed to pad_1d
/ backfill_1d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had to change things up slightly to use integers and the datetime fillna methods if you want to take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categorical.fillna can probably do soemthing similar, haven't looked.
Moved to a new method in dtypes.cast.
Updated to remove Added a new method |
@@ -53,3 +54,8 @@ def test_is_extension_array_dtype(self, data): | |||
assert is_extension_array_dtype(data.dtype) | |||
assert is_extension_array_dtype(pd.Series(data)) | |||
assert isinstance(data.dtype, ExtensionDtype) | |||
|
|||
def test_tolist(self, data): | |||
result = tolist(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rather test Series[extension].tolist() here? to check it is the same as list(data) (instead of testing this internal method)
Moved to test_casting
Good call.
…On Thu, Mar 1, 2018 at 2:31 PM, Joris Van den Bossche < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/tests/extension/base/interface.py
<#19909 (comment)>:
> @@ -53,3 +54,8 @@ def test_is_extension_array_dtype(self, data):
assert is_extension_array_dtype(data.dtype)
assert is_extension_array_dtype(pd.Series(data))
assert isinstance(data.dtype, ExtensionDtype)
+
+ def test_tolist(self, data):
+ result = tolist(data)
maybe rather test Series[extension].tolist() here? to check it is the same
as list(data) (instead of testing this internal method)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19909 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIklIZF7WQfxJSi7gMtcRQFeI4_OTks5taFqNgaJpZM4SThPt>
.
|
This is fine for me. I agree it is a bit more complexity, so if somebody feels strong about it, I won't object having a |
This turned up an issue with |
@@ -1712,7 +1711,7 @@ def __len__(self): | |||
|
|||
def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values()) | |||
return iter(self.get_values().tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this does self.get_values().tolist()
instead of directly self.tolist()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to avoid an infinite loop.
Categorical.tolist
calls tolist
which calls Categorical.__iter__
.
Need to do get_values
break that cycle / extract the python scalar types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just feels that we are doing to much work. To iterate, we first create an array, iterate through it to create a list, and then iterate through the list ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So get_values either returns a array or a Datetime(like)Index. For the array we can just iterate over it, but for the Datetime(like)Index I think as well . The tolist implementation there is list(self.astype(object))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that previous comment was of course wrong, as that is exactly checking the get_values().tolist()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, therefore we need the tolist.
But then moving the logic is maybe not best, as then tolist
would consume an iterator coming from a list ..
It's all just complex with the different unboxing depending on the type :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified slightly if you want to take one last look. Basically, we didn't have to worry about the different unboxing for different types, since Categorical.get_values()
with datetimes returns a DatetimeIndex, and when we .tolist()
on that we get the right unboxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the whole .get_values()
fiasco (which I created a while back :). I think should be addressed sooner rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any issues with the current implementation? Currently there's a tradeoff between how expensive it = iter(Categorical)
is vs. next(it)
.
Somewhere, we have to go from numpy scalars to Python scalars. The fastest way to do that is with tolist()
, but that has upfront memory overhead. We could avoid that by doing it in the next
, but that has time overhead for
- converting to Python scalars elementwise
- An extra call to CategoricalIndex.categories.getitem per element
so that next(it)
becomes much slower. I don't think there's much more to be done right now.
pandas/core/arrays/base.py
Outdated
# The basic idea is to create an array of integer positions. | ||
# Internally, we use iNaT and the datetime filling routines | ||
# to avoid floating-point NaN. Once filled, we take on `self` | ||
# to get the actual values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not clear why you are writing another implementation of fill here. We already have one for object dtypes. I would suggest using that one. Yes I get that you want one for the extension classes, but if that is the case then I would replace the algos we have with THIS one. There is no reason to have 2 floating around. Then this will be fully tested and you can look at performance and such.
I don't really have a preference to the these 2 options.
The extension classes implementation MUST be integrated if its in the base class.
@@ -1712,7 +1711,7 @@ def __len__(self): | |||
|
|||
def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values()) | |||
return iter(self.get_values().tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the whole .get_values()
fiasco (which I created a while back :). I think should be addressed sooner rather than later.
Sorry, but I really disagree that having a good implementation of |
But OK, let's go ahead with this PR, we can always discuss later an improved implementation |
Ah I opened #20300 I don't really care whether we do it now or elsewhere. That type of filling on integer, followed by a take seems like a useful strategy in general. I can do a followup implementing in in |
OK, let's merge this one then! |
i will have a look later today |
Thanks. No rush, I'm probably reviewing docs all day. Just wanted to make
sure my PRs were passing on master.
…On Mon, Mar 12, 2018 at 9:23 AM, Jeff Reback ***@***.***> wrote:
i will have a look later today
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19909 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIknkp4OrXDJqabfbYDkxusnhLLqlks5tdoTzgaJpZM4SThPt>
.
|
Fixed a merge conflict. This should be ready as well. |
Fixed the merge conflict. |
All green. |
you never answered my question of why you are adding a new implementation and not using the current one. |
I removed the new implementation. #20300 |
Was that your last objection? |
thanks @TomAugspurger, no didn't see the removal, but looking good now |
xref #19696
This adds
ExtensionArray.fillna
, and support forNDFrame.fillna
with extension arrays.Like
Categorical.fillna
,limit
is not implemented. This shouldn't make it any harder to support that in the future.