Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: standardize fill_value behavior across the API #15587

Closed
wants to merge 4 commits into from

Conversation

ResidentMario
Copy link
Contributor

@ResidentMario
Copy link
Contributor Author

This is a starting point for #15533. Right now I've only added _is_fillable_values and validate_fill_value methods to the bottom of common.py.

There's way too much magic ATM. Some specific questions:

  • Is there a method for detecting pandas (non-numpy) time dtypes: Timestamp, Period, Timedelta? AFAIK all of the common.py ops are w.r.t. numpy dtypes (datetime64 etc.).
  • common.py ops accept individual objects or arrays and look at the dtype thereof, so we have to catch numpy and pandas data structs separately. What's a good way of tackling this? Just import all of the names and test them all?

@@ -491,3 +491,27 @@ def pandas_dtype(dtype):
return dtype

return np.dtype(dtype)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this should be in pandas.types.missing

return True


def validate_fill_value(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do this in one function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to separate creating the validity boolean from raising a ValueError for it, in case in the future there's a need to do the former without the latter. Fine with it being just one method tho, if you think that's better.

pandas_ts_types = ('Timestamp', 'Period', 'Timedelta')
pandas_block_types = ('Series', 'DataFrame')

if any([isinstance(value, (list, dict)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the way to do this is

def validate_fill_value(value):
   def _validate(v):
        # do validation on a scalar
        return boolean
   if is_list_like(value) or is_dict_like(value):
       return all(_validate(v) for v in list(values))
   return _validate(value)

Copy link
Contributor Author

@ResidentMario ResidentMario Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed these is_list_like/is_dict_like helpers, these are important, thanks. But, why the evaluation that comes afterwards? As I understand it, we should be rejecting list and dict type inputs outright.

The former is valid in fillna, though passing a dict isn't implemented in any of the fill_value parameters. Lists, meanwhile, are never a valid fill value.

What I mean is that I think the implementation would be something like:

def validate_fill_value(value):
   def _validate(v):
        # do validation on a scalar
        return boolean
   if is_list_like(value) or is_dict_like(value):
       return False
   else:
       return _validate(value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idiom i gave us for using s validation function in a scalar or in each element of a list

adapt 2 what u need

(not (isinstance(value, string_types) or
isinstance(value, (int, float, complex, str, None.__class__)) or
is_numeric_dtype(value) or
is_datetime_or_timedelta_dtype(value) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you actually need though is to pass in 2 values at the top-level

def validate_fill_value(value, dtype):
     def _validate(v):
           # only a sample
           if is_datetime64_any_dtype(dtype):
              return isinstance(value, (np.datetime64, datetime))
           elif is_numeric_dtype(dtype):
              return is_float(value) or is_integer(value)
           else:
             # string
              return isinstance(value, compat.string_type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when I call this method, would I need to do something like:

dtype = value.dtype if hasattr(value, dtype) else None

Beforehand?

Not sure I grok this separate parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dtype must be passed in otherwise how do unknow is the filll_value is the right type
e.g. an int is not valid if u have a datetime array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've been following the fillna behavior thus far. Right now fillna would convert that input to a timestamp in ns. Same with float or bool. And upcast the column to object dtype to fit a str fill.

fillna behaves that way AFAIK because it's convenient to propagate a 0 or an np.nan or whatever other out-of-type sentinel value across an entire DataFrame all at once, instead of having to go column-by-column.

The same argument might apply for fill_value, but, I do see it being a far weaker one. So if you think that it's OK for fill_value to have a separate, stricter behavior than fillna, sure.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Error Reporting Incorrect or improved errors from pandas labels Mar 6, 2017
@ResidentMario
Copy link
Contributor Author

ResidentMario commented Mar 7, 2017

So with date stuff, we can catch numpy/stdlib datetime/Timestamp using is_datetime64_any_dtype. We can catch Timedelta using is_timedelta64_dtype.

But how do we catch Period? When fed to is_datetime64_any_dtype it returns False. Additionally, the following evaluates to False as well:

is_period_dtype(pd.Series([pd.Period('2015-01-01')]).dtype)

Is this supposed to happen? The Period numpy dtype is just 'O'...

@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

Periods are object type when in a Series ATM. They have a specific dtype only in an Index.
There is an is_period_arraylike if you really need inference on an array.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

numpy has pretty much nothing to do with dtypes anymore in pandas (except for some basic types).

@ResidentMario
Copy link
Contributor Author

See the method in the new commit.

How it works right now:

  • list_like, dict_like, and callable fill values will always raise a TypeError.
  • isnull fill values will always pass.
  • If the unified dtype (the dtype you get when you cast to a numpy array) of a Series or DataFrame is Object, any object excepting the ones in the first bullet point will be accepted.
  • The above includes Period dtype columns. Theoretically Period dtype columns should only accept Period fill values. However, because of the way periods are implemented, with an O dtype, there doesn't seem to be an easy way of conforming to this behavior without changing the method signature somehow. Periods just fall through to the general object case right now.
  • If the unified dtype is datetime64, only datetime types will work.
  • If the unified dtype is timedelta64, only timedelta types will work.

Is this behavior OK?

@codecov-io
Copy link

codecov-io commented Mar 7, 2017

Codecov Report

Merging #15587 into master will decrease coverage by -0.03%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master   #15587      +/-   ##
==========================================
- Coverage   91.06%   91.03%   -0.03%     
==========================================
  Files         137      137              
  Lines       49307    49330      +23     
==========================================
+ Hits        44899    44908       +9     
- Misses       4408     4422      +14
Impacted Files Coverage Δ
pandas/core/reshape.py 99.28% <100%> (ø)
pandas/core/missing.py 84.38% <71.42%> (-0.57%)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)
pandas/core/frame.py 97.87% <0%> (-0.06%)
pandas/formats/format.py 95.33% <0%> (-0.01%)
pandas/io/excel.py 79.67% <0%> (+0.03%)
pandas/tseries/base.py 96.65% <0%> (+0.06%)
pandas/core/common.py 91.36% <0%> (+0.33%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09360d8...9eaa0f2. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good!

pls add a bunch of tests! (in pandas.tests.types.missing) to validate the validation function (IOW, go thru all types with some valid and separately some invalid ones). use parametrize.



def validate_fill_value(value, dtype):
if is_list_like(value) or is_dict_like(value) or callable(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string :>



def validate_fill_value(value, dtype):
if is_list_like(value) or is_dict_like(value) or callable(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont' you check not is_scalar? (which allows strings, datetimes, and all pandas scalars).

'a scalar, but you passed a '
'"{0}"'.format(type(value).__name__))
elif not isnull(value):
from datetime import datetime, timedelta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put imports at the top of the file

@@ -405,6 +405,10 @@ def _slow_pivot(index, columns, values):


def unstack(obj, level, fill_value=None):
if fill_value:
from pandas.core.missing import validate_fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import at the top

@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values):


def unstack(obj, level, fill_value=None):
if fill_value:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I would always pass this (we will make None an acceptable fill_value below)

@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values):


def unstack(obj, level, fill_value=None):
if fill_value:
validate_fill_value(fill_value, obj.values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pass obj.dtype, we never explictly call .values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj may be a DataFrame AFAIK. I call values to get the numpy array here to consolidate the dtype (which means that e.g. a DataFrame with columns of mixed type will accept fill_value according to object rules). Is there a way to get this without accessing the underlying array directly?

@@ -301,3 +302,11 @@ def test_na_value_for_dtype():

for dtype in ['O']:
assert np.isnan(na_value_for_dtype(np.dtype(dtype)))


class TestValidateFillValue(tm.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a class, just create a function (and use parametrize)


def validate_fill_value(value, dtype):
"""
Make sure the fill value is appropriate for the given dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in a Parameters, Returns, Raises section

raise TypeError('"fill_value" parameter must be '
'a scalar, but you passed a '
'"{0}"'.format(type(value).__name__))
elif not isnull(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you already check that value is None is ok (just need a test to check!)

@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

that's not right

a fill value will be applied per individual dtype
so best to simply validate at a lower level then

look in internals and put this check in the fillna method

@ResidentMario
Copy link
Contributor Author

How about iterating through the sub-series column-by-column? Do isinstance(obj, ABCFrame) and ifTrue do a [_validate(col, dtype) for col in obj.columns].

On the design side. Suppose I have a mixed dtype DataFrame, let's say with a str column and a bool column. After some operation I now have a null value in each. fill_value doesn't implement columnar dict input like fillna does, so there's no way of handling these columns separately.

If fill_value is supposed to be a quick substitute for fillna, then this should be OK. If on the other hand the idea is that we want it there strictly to cover cases when we don't upcast the column dtype (which was the original motivation), then validation should be by-column, yeah.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

@ResidentMario you can also do columnar. Keep in mind though that we generally will simply skip a non-compat fill-value.

In [1]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})

In [2]: df
Out[2]: 
   A          B
0  1 2013-01-01
1  2 2013-01-02
2  3 2013-01-03

In [3]: df.iloc[1] = np.nan

In [4]: df
Out[4]: 
     A          B
0  1.0 2013-01-01
1  NaN        NaT
2  3.0 2013-01-03

In [5]: df.fillna(0)
Out[5]: 
     A          B
0  1.0 2013-01-01
1  0.0 1970-01-01
2  3.0 2013-01-03

In [6]: df.fillna(Timestamp('20130201'))
Out[6]: 
                     A          B
0                    1 2013-01-01
1  2013-02-01 00:00:00 2013-02-01
2                    3 2013-01-03

so one could argue that both [5] and [6] are wrong or right. We generally leave this up to the user when having mixed dtypes.

@ResidentMario
Copy link
Contributor Author

Exactly, leave it to the user—that's what the current implementation would do. So, do you think fill_value should follow the [5] & [6] case, the (current) "lossy" implementation, or a stricter (suggested) check-each-column implementation? I can do whichever.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2017

closing in favor of #15563

normally pls don't create new PR's for the same issue, just push to the same one.

@jreback jreback closed this Mar 11, 2017
@ResidentMario
Copy link
Contributor Author

This is a different PR, though. The goal here is to implement error handling for fill_value parameters to various methods; the goal of PR#15563 is to implement error handing for fillna. Presumably I'd then rebase from there and work the feature out here.

That being said I cherry-picked some of the commits here for that PR. So a totally new PR might just be cleaner anyway.

Sorry, took a while to get to the bottom of this particular molehill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: standardize fill_value behavior across the API
3 participants