ENH: standardize fill_value behavior across the API #15587

ResidentMario · 2017-03-06T04:32:39Z

closes ENH: standardize fill_value behavior across the API #15533
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

ResidentMario · 2017-03-06T04:42:32Z

This is a starting point for #15533. Right now I've only added _is_fillable_values and validate_fill_value methods to the bottom of common.py.

There's way too much magic ATM. Some specific questions:

Is there a method for detecting pandas (non-numpy) time dtypes: Timestamp, Period, Timedelta? AFAIK all of the common.py ops are w.r.t. numpy dtypes (datetime64 etc.).
common.py ops accept individual objects or arrays and look at the dtype thereof, so we have to catch numpy and pandas data structs separately. What's a good way of tackling this? Just import all of the names and test them all?

jreback · 2017-03-06T13:57:38Z

pandas/types/common.py

@@ -491,3 +491,27 @@ def pandas_dtype(dtype):
        return dtype

    return np.dtype(dtype)
+
+


all of this should be in pandas.types.missing

jreback · 2017-03-06T13:57:52Z

pandas/types/common.py

+        return True
+
+
+def validate_fill_value(value):


you can just do this in one function

I wanted to separate creating the validity boolean from raising a ValueError for it, in case in the future there's a need to do the former without the latter. Fine with it being just one method tho, if you think that's better.

jreback · 2017-03-06T14:12:28Z

pandas/types/common.py

+    pandas_ts_types = ('Timestamp', 'Period', 'Timedelta')
+    pandas_block_types = ('Series', 'DataFrame')
+
+    if any([isinstance(value, (list, dict)),


so the way to do this is

def validate_fill_value(value): def _validate(v): # do validation on a scalar return boolean if is_list_like(value) or is_dict_like(value): return all(_validate(v) for v in list(values)) return _validate(value)

Missed these is_list_like/is_dict_like helpers, these are important, thanks. But, why the evaluation that comes afterwards? As I understand it, we should be rejecting list and dict type inputs outright.

The former is valid in fillna, though passing a dict isn't implemented in any of the fill_value parameters. Lists, meanwhile, are never a valid fill value.

What I mean is that I think the implementation would be something like:

def validate_fill_value(value): def _validate(v): # do validation on a scalar return boolean if is_list_like(value) or is_dict_like(value): return False else: return _validate(value)

the idiom i gave us for using s validation function in a scalar or in each element of a list

adapt 2 what u need

jreback · 2017-03-06T14:15:43Z

pandas/types/common.py

+            (not (isinstance(value, string_types) or
+                  isinstance(value, (int, float, complex, str, None.__class__)) or
+                  is_numeric_dtype(value) or
+                  is_datetime_or_timedelta_dtype(value) or


what you actually need though is to pass in 2 values at the top-level

def validate_fill_value(value, dtype): def _validate(v): # only a sample if is_datetime64_any_dtype(dtype): return isinstance(value, (np.datetime64, datetime)) elif is_numeric_dtype(dtype): return is_float(value) or is_integer(value) else: # string return isinstance(value, compat.string_type)

So when I call this method, would I need to do something like:

dtype = value.dtype if hasattr(value, dtype) else None

Beforehand?

Not sure I grok this separate parameter.

the dtype must be passed in otherwise how do unknow is the filll_value is the right type
e.g. an int is not valid if u have a datetime array

Well, I've been following the fillna behavior thus far. Right now fillna would convert that input to a timestamp in ns. Same with float or bool. And upcast the column to object dtype to fit a str fill.

fillna behaves that way AFAIK because it's convenient to propagate a 0 or an np.nan or whatever other out-of-type sentinel value across an entire DataFrame all at once, instead of having to go column-by-column.

The same argument might apply for fill_value, but, I do see it being a far weaker one. So if you think that it's OK for fill_value to have a separate, stricter behavior than fillna, sure.

ResidentMario · 2017-03-07T03:12:09Z

So with date stuff, we can catch numpy/stdlib datetime/Timestamp using is_datetime64_any_dtype. We can catch Timedelta using is_timedelta64_dtype.

But how do we catch Period? When fed to is_datetime64_any_dtype it returns False. Additionally, the following evaluates to False as well:

is_period_dtype(pd.Series([pd.Period('2015-01-01')]).dtype)

Is this supposed to happen? The Period numpy dtype is just 'O'...

jreback · 2017-03-07T15:04:21Z

Periods are object type when in a Series ATM. They have a specific dtype only in an Index.
There is an is_period_arraylike if you really need inference on an array.

jreback · 2017-03-07T15:04:49Z

numpy has pretty much nothing to do with dtypes anymore in pandas (except for some basic types).

ResidentMario · 2017-03-07T19:44:28Z

See the method in the new commit.

How it works right now:

list_like, dict_like, and callable fill values will always raise a TypeError.
isnull fill values will always pass.
If the unified dtype (the dtype you get when you cast to a numpy array) of a Series or DataFrame is Object, any object excepting the ones in the first bullet point will be accepted.
The above includes Period dtype columns. Theoretically Period dtype columns should only accept Period fill values. However, because of the way periods are implemented, with an O dtype, there doesn't seem to be an easy way of conforming to this behavior without changing the method signature somehow. Periods just fall through to the general object case right now.
If the unified dtype is datetime64, only datetime types will work.
If the unified dtype is timedelta64, only timedelta types will work.

Is this behavior OK?

codecov-io · 2017-03-07T19:48:02Z

Codecov Report

Merging #15587 into master will decrease coverage by -0.03%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master   #15587      +/-   ##
==========================================
- Coverage   91.06%   91.03%   -0.03%     
==========================================
  Files         137      137              
  Lines       49307    49330      +23     
==========================================
+ Hits        44899    44908       +9     
- Misses       4408     4422      +14

Impacted Files	Coverage Δ
pandas/core/reshape.py	`99.28% <100%> (ø)`	✅
pandas/core/missing.py	`84.38% <71.42%> (-0.57%)`	❌
pandas/io/gbq.py	`25% <0%> (-58.34%)`	❌
pandas/tools/merge.py	`91.78% <0%> (-0.35%)`	❌
pandas/core/frame.py	`97.87% <0%> (-0.06%)`	❌
pandas/formats/format.py	`95.33% <0%> (-0.01%)`	❌
pandas/io/excel.py	`79.67% <0%> (+0.03%)`	✅
pandas/tseries/base.py	`96.65% <0%> (+0.06%)`	✅
pandas/core/common.py	`91.36% <0%> (+0.33%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09360d8...9eaa0f2. Read the comment docs.

jreback

code looks good!

pls add a bunch of tests! (in pandas.tests.types.missing) to validate the validation function (IOW, go thru all types with some valid and separately some invalid ones). use parametrize.

jreback · 2017-03-07T19:54:14Z

pandas/core/missing.py

+
+
+def validate_fill_value(value, dtype):
+    if is_list_like(value) or is_dict_like(value) or callable(value):


can you add a doc-string :>

jreback · 2017-03-07T19:54:50Z

pandas/core/missing.py

+
+
+def validate_fill_value(value, dtype):
+    if is_list_like(value) or is_dict_like(value) or callable(value):


why dont' you check not is_scalar? (which allows strings, datetimes, and all pandas scalars).

jreback · 2017-03-07T19:55:19Z

pandas/core/missing.py

+                        'a scalar, but you passed a '
+                        '"{0}"'.format(type(value).__name__))
+    elif not isnull(value):
+        from datetime import datetime, timedelta


put imports at the top of the file

jreback · 2017-03-07T19:55:47Z

pandas/core/reshape.py

@@ -405,6 +405,10 @@ def _slow_pivot(index, columns, values):


 def unstack(obj, level, fill_value=None):
+    if fill_value:
+        from pandas.core.missing import validate_fill_value


import at the top

jreback · 2017-03-07T22:14:46Z

pandas/core/reshape.py

@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values):


 def unstack(obj, level, fill_value=None):
+    if fill_value:


actually I would always pass this (we will make None an acceptable fill_value below)

jreback · 2017-03-07T22:15:10Z

pandas/core/reshape.py

@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values):


 def unstack(obj, level, fill_value=None):
+    if fill_value:
+        validate_fill_value(fill_value, obj.values.dtype)


just pass obj.dtype, we never explictly call .values

obj may be a DataFrame AFAIK. I call values to get the numpy array here to consolidate the dtype (which means that e.g. a DataFrame with columns of mixed type will accept fill_value according to object rules). Is there a way to get this without accessing the underlying array directly?

jreback · 2017-03-07T22:15:55Z

pandas/tests/types/test_missing.py

@@ -301,3 +302,11 @@ def test_na_value_for_dtype():

    for dtype in ['O']:
        assert np.isnan(na_value_for_dtype(np.dtype(dtype)))
+
+
+class TestValidateFillValue(tm.TestCase):


don't use a class, just create a function (and use parametrize)

jreback · 2017-03-07T22:16:13Z

pandas/types/missing.py

+
+def validate_fill_value(value, dtype):
+    """
+    Make sure the fill value is appropriate for the given dtype.


add in a Parameters, Returns, Raises section

jreback · 2017-03-07T22:16:51Z

pandas/types/missing.py

+        raise TypeError('"fill_value" parameter must be '
+                        'a scalar, but you passed a '
+                        '"{0}"'.format(type(value).__name__))
+    elif not isnull(value):


actually you already check that value is None is ok (just need a test to check!)

jreback · 2017-03-07T23:53:29Z

that's not right

a fill value will be applied per individual dtype
so best to simply validate at a lower level then

look in internals and put this check in the fillna method

ResidentMario · 2017-03-08T00:09:59Z

How about iterating through the sub-series column-by-column? Do isinstance(obj, ABCFrame) and ifTrue do a [_validate(col, dtype) for col in obj.columns].

On the design side. Suppose I have a mixed dtype DataFrame, let's say with a str column and a bool column. After some operation I now have a null value in each. fill_value doesn't implement columnar dict input like fillna does, so there's no way of handling these columns separately.

If fill_value is supposed to be a quick substitute for fillna, then this should be OK. If on the other hand the idea is that we want it there strictly to cover cases when we don't upcast the column dtype (which was the original motivation), then validation should be by-column, yeah.

jreback · 2017-03-08T00:15:16Z

@ResidentMario you can also do columnar. Keep in mind though that we generally will simply skip a non-compat fill-value.

In [1]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})

In [2]: df
Out[2]: 
   A          B
0  1 2013-01-01
1  2 2013-01-02
2  3 2013-01-03

In [3]: df.iloc[1] = np.nan

In [4]: df
Out[4]: 
     A          B
0  1.0 2013-01-01
1  NaN        NaT
2  3.0 2013-01-03

In [5]: df.fillna(0)
Out[5]: 
     A          B
0  1.0 2013-01-01
1  0.0 1970-01-01
2  3.0 2013-01-03

In [6]: df.fillna(Timestamp('20130201'))
Out[6]: 
                     A          B
0                    1 2013-01-01
1  2013-02-01 00:00:00 2013-02-01
2                    3 2013-01-03

so one could argue that both [5] and [6] are wrong or right. We generally leave this up to the user when having mixed dtypes.

ResidentMario · 2017-03-08T02:58:46Z

Exactly, leave it to the user—that's what the current implementation would do. So, do you think fill_value should follow the [5] & [6] case, the (current) "lossy" implementation, or a stricter (suggested) check-each-column implementation? I can do whichever.

jreback · 2017-03-11T17:38:27Z

closing in favor of #15563

normally pls don't create new PR's for the same issue, just push to the same one.

ResidentMario · 2017-03-11T17:48:07Z

This is a different PR, though. The goal here is to implement error handling for fill_value parameters to various methods; the goal of PR#15563 is to implement error handing for fillna. Presumably I'd then rebase from there and work the feature out here.

That being said I cherry-picked some of the commits here for that PR. So a totally new PR might just be cleaner anyway.

Sorry, took a while to get to the bottom of this particular molehill.

ResidentMario added 2 commits March 5, 2017 23:25

Add fill validation methods.

ff31c3d

Correction.

61153ca

jreback requested changes Mar 6, 2017

View reviewed changes

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Error Reporting Incorrect or improved errors from pandas labels Mar 6, 2017

jreback mentioned this pull request Mar 6, 2017

ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497

Closed

4 tasks

Experimental validator.

1efdc4a

jreback requested changes Mar 7, 2017

View reviewed changes

Move around imports, stub out test.

9eaa0f2

jreback reviewed Mar 7, 2017

View reviewed changes

ResidentMario mentioned this pull request Mar 8, 2017

BUG: fillna('') does not replace NaT #11953

Open

jorisvandenbossche mentioned this pull request Mar 9, 2017

ENH: standardize fill_value behavior across the API #15533

Open

jreback closed this Mar 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: standardize fill_value behavior across the API #15587

ENH: standardize fill_value behavior across the API #15587

ResidentMario commented Mar 6, 2017

ResidentMario commented Mar 6, 2017

jreback Mar 6, 2017

jreback Mar 6, 2017

ResidentMario Mar 6, 2017

jreback Mar 6, 2017

ResidentMario Mar 6, 2017 •

edited

Loading

jreback Mar 6, 2017

jreback Mar 6, 2017

ResidentMario Mar 6, 2017

jreback Mar 7, 2017

ResidentMario Mar 7, 2017

ResidentMario commented Mar 7, 2017 •

edited

Loading

jreback commented Mar 7, 2017

jreback commented Mar 7, 2017

ResidentMario commented Mar 7, 2017

codecov-io commented Mar 7, 2017 •

edited

Loading

jreback left a comment

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

ResidentMario Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback Mar 7, 2017

jreback commented Mar 7, 2017

ResidentMario commented Mar 8, 2017

jreback commented Mar 8, 2017

ResidentMario commented Mar 8, 2017

jreback commented Mar 11, 2017

ResidentMario commented Mar 11, 2017

		@@ -491,3 +491,27 @@ def pandas_dtype(dtype):
		return dtype

		return np.dtype(dtype)



		def validate_fill_value(value, dtype):
		if is_list_like(value) or is_dict_like(value) or callable(value):

		@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values):


		def unstack(obj, level, fill_value=None):
		if fill_value:

ENH: standardize fill_value behavior across the API #15587

ENH: standardize fill_value behavior across the API #15587

Conversation

ResidentMario commented Mar 6, 2017

ResidentMario commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ResidentMario Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ResidentMario commented Mar 7, 2017 • edited Loading

jreback commented Mar 7, 2017

jreback commented Mar 7, 2017

ResidentMario commented Mar 7, 2017

codecov-io commented Mar 7, 2017 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 7, 2017

ResidentMario commented Mar 8, 2017

jreback commented Mar 8, 2017

ResidentMario commented Mar 8, 2017

jreback commented Mar 11, 2017

ResidentMario commented Mar 11, 2017

ResidentMario Mar 6, 2017 •

edited

Loading

ResidentMario commented Mar 7, 2017 •

edited

Loading

codecov-io commented Mar 7, 2017 •

edited

Loading