-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Support uncertainties as EA Dtype #53970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
4c04e30
a165278
0b833aa
2a1913b
7d8ba79
8ada662
42ff1a8
e53a62c
b8083a9
93be04f
ed69055
65cf17a
2f92811
e572b3d
2964f44
6644fe0
c9e0de3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,7 @@ Other enhancements | |
- :meth:`MultiIndex.sortlevel` and :meth:`Index.sortlevel` gained a new keyword ``na_position`` (:issue:`51612`) | ||
- :meth:`arrays.DatetimeArray.map`, :meth:`arrays.TimedeltaArray.map` and :meth:`arrays.PeriodArray.map` can now take a ``na_action`` argument (:issue:`51644`) | ||
- :meth:`arrays.SparseArray.map` now supports ``na_action`` (:issue:`52096`). | ||
- :meth:`arrays.BaseMaskedArray.take` handle non-na float as fill value (triggered by ``ufloat`` NaN from ``uncertainties`` package) (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_) | ||
- Add :meth:`diff()` and :meth:`round()` for :class:`Index` (:issue:`19708`) | ||
- Add dtype of categories to ``repr`` information of :class:`CategoricalDtype` (:issue:`52179`) | ||
- Added to the escape mode "latex-math" preserving without escaping all characters between "\(" and "\)" in formatter (:issue:`51903`) | ||
|
@@ -107,6 +108,7 @@ Other enhancements | |
- :meth:`Categorical.from_codes` has gotten a ``validate`` parameter (:issue:`50975`) | ||
- :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`) | ||
- :meth:`DataFrameGroupby.agg` and :meth:`DataFrameGroupby.transform` now support grouping by multiple keys when the index is not a :class:`MultiIndex` for ``engine="numba"`` (:issue:`53486`) | ||
- :meth:`GroupBy.first` if series contains only NA values (which might be NaN), return the first NA value, else return np.nan (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what this is decribing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is describing a change of behavior of the |
||
- :meth:`Series.explode` now supports pyarrow-backed list types (:issue:`53602`) | ||
- :meth:`Series.str.join` now supports ``ArrowDtype(pa.string())`` (:issue:`53646`) | ||
- :meth:`SeriesGroupby.agg` and :meth:`DataFrameGroupby.agg` now support passing in multiple functions for ``engine="numba"`` (:issue:`53486`) | ||
|
@@ -552,6 +554,7 @@ Other | |
- Bug in :meth:`Series.map` when giving a callable to an empty series, the returned series had ``object`` dtype. It now keeps the original dtype (:issue:`52384`) | ||
- Bug in :meth:`Series.memory_usage` when ``deep=True`` throw an error with Series of objects and the returned value is incorrect, as it does not take into account GC corrections (:issue:`51858`) | ||
- Fixed incorrect ``__name__`` attribute of ``pandas._libs.json`` (:issue:`52898`) | ||
- Change ``~tests/extension/test_boolean.py`` to use pd.NA instead of np.nan (following similar patterns in ``~tests/extension/test_integer.py`` and ``~tests/extension/test_float.py``), updating :func:`make_data`, :func:`data_missing`, :func:`data_missing_for_sorting`, :func:`data_for_grouping`, :func:`_check_op` (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_) | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ***DO NOT USE THIS SECTION*** | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -870,7 +870,9 @@ def take( | |
# we only fill where the indexer is null | ||
# not existing missing values | ||
# TODO(jreback) what if we have a non-na float as a fill value? | ||
if allow_fill and notna(fill_value): | ||
# NaN with uncertainties is scalar but does not register as `isna`, | ||
# so use fact that NaN != NaN | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if allow_fill and notna(fill_value) and fill_value == fill_value: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how would you even get here with uncertainties? is your EA subclassing BaseMaskedArray? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The EA itself is a very vanilla EA. The test suite instantiates the EA in myriad ways for test purposes. In this case the EA gets instantiated as a MaskedArray (which derives from BaseMaskedArray) for testing MaskedArray functionality. I think the test suite has every right to rigorously instantiate EAs in every way imaginable, which it does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. im still confused here. the EA i see in hgrecco/pint-pandas#140 subclasses ExtensionArray but not BaseMaskedArray, so i don't see how the code change here would affect it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK...I looked more deeply into this when I get back the computer where I did original work and didn't upload some notes to the cloud. I'm pretty sure it's due to some conditions created when trying to merge two DataFrames that I dummied up in my notes. I'll create a test case that can go into the Pandas test suite. It won't depend on uncertainties, but if and when used with uncertainties, it should trigger the path in question. Should be done by Monday. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, that was a lot more challenging to reconstruct than I expected, but I re-created the condition. TL;DR: now that I use pd.NA as my NA value for my particular EA, this change that caught your eye is indeed unnecessary and I can revert it. But, since you asked, here's the path I took initially. Starting with
The idea (before settling on pd.NA, which works without this particular change) was to use a NaN to which we could cast any numerical type (we cannot downcast uncertainties to In the test code we set up Here's the
The way the test code is run, because
When we execute
With
In this particular case, because none of the indexes are -1, no filling is going to happen. But should we have things to fill, they'd be filled with the uncertain nan. And because this code really doesn't want to fill add new NaNs to the array, I put this extra defensive code in. And that's how we hit this condition. I'm going to take this extraneous test out ( |
||
fill_mask = np.asarray(indexer) == -1 | ||
result[fill_mask] = fill_value | ||
mask = mask ^ fill_mask | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -587,7 +587,7 @@ def nkeys(self) -> int: | |
|
||
def get_iterator( | ||
self, data: NDFrameT, axis: AxisInt = 0 | ||
) -> Iterator[tuple[Hashable, NDFrameT]]: | ||
) -> Iterator[tuple[Hashable, NDFrameT]]: # Doesn't work with non-hashable EA types | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls remove |
||
""" | ||
Groupby iterator | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,7 +28,7 @@ | |
|
||
|
||
def make_data(): | ||
return [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False] | ||
return [True, False] * 4 + [pd.NA] + [True, False] * 44 + [pd.NA] + [True, False] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made this change to track the identical changes in test_integer.py and test_floating.py. The it does not matter whether the NA values in this case are pd.NA, np.nan, or a mixture. But I did think it better to make the three files consistent. I'm happy to revert this change if you prefer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed this should be removed. It's ideal to minimize extraneous changes |
||
|
||
|
||
@pytest.fixture | ||
|
@@ -48,7 +48,7 @@ def data_for_twos(dtype): | |
|
||
@pytest.fixture | ||
def data_missing(dtype): | ||
return pd.array([np.nan, True], dtype=dtype) | ||
return pd.array([pd.NA, True], dtype=dtype) | ||
|
||
|
||
@pytest.fixture | ||
|
@@ -58,7 +58,7 @@ def data_for_sorting(dtype): | |
|
||
@pytest.fixture | ||
def data_missing_for_sorting(dtype): | ||
return pd.array([True, np.nan, False], dtype=dtype) | ||
return pd.array([True, pd.NA, False], dtype=dtype) | ||
|
||
|
||
@pytest.fixture | ||
|
@@ -76,7 +76,7 @@ def na_value(): | |
def data_for_grouping(dtype): | ||
b = True | ||
a = False | ||
na = np.nan | ||
na = pd.NA | ||
return pd.array([b, b, na, na, a, a, b], dtype=dtype) | ||
|
||
|
||
|
@@ -147,7 +147,7 @@ def _check_op(self, obj, op, other, op_name, exc=NotImplementedError): | |
expected = expected.astype("Float64") | ||
if op_name == "__rpow__": | ||
# for rpow, combine does not propagate NaN | ||
expected[result.isna()] = np.nan | ||
expected[result.isna()] = pd.NA | ||
self.assert_equal(result, expected) | ||
else: | ||
with pytest.raises(exc): | ||
|
Uh oh!
There was an error while loading. Please reload this page.