-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: standardize fill_value behavior across the API #15587
Conversation
This is a starting point for #15533. Right now I've only added There's way too much magic ATM. Some specific questions:
|
pandas/types/common.py
Outdated
@@ -491,3 +491,27 @@ def pandas_dtype(dtype): | |||
return dtype | |||
|
|||
return np.dtype(dtype) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this should be in pandas.types.missing
pandas/types/common.py
Outdated
return True | ||
|
||
|
||
def validate_fill_value(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do this in one function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to separate creating the validity boolean from raising a ValueError
for it, in case in the future there's a need to do the former without the latter. Fine with it being just one method tho, if you think that's better.
pandas/types/common.py
Outdated
pandas_ts_types = ('Timestamp', 'Period', 'Timedelta') | ||
pandas_block_types = ('Series', 'DataFrame') | ||
|
||
if any([isinstance(value, (list, dict)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the way to do this is
def validate_fill_value(value):
def _validate(v):
# do validation on a scalar
return boolean
if is_list_like(value) or is_dict_like(value):
return all(_validate(v) for v in list(values))
return _validate(value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed these is_list_like
/is_dict_like
helpers, these are important, thanks. But, why the evaluation that comes afterwards? As I understand it, we should be rejecting list and dict type inputs outright.
The former is valid in fillna
, though passing a dict isn't implemented in any of the fill_value
parameters. Lists, meanwhile, are never a valid fill value.
What I mean is that I think the implementation would be something like:
def validate_fill_value(value):
def _validate(v):
# do validation on a scalar
return boolean
if is_list_like(value) or is_dict_like(value):
return False
else:
return _validate(value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idiom i gave us for using s validation function in a scalar or in each element of a list
adapt 2 what u need
pandas/types/common.py
Outdated
(not (isinstance(value, string_types) or | ||
isinstance(value, (int, float, complex, str, None.__class__)) or | ||
is_numeric_dtype(value) or | ||
is_datetime_or_timedelta_dtype(value) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you actually need though is to pass in 2 values at the top-level
def validate_fill_value(value, dtype):
def _validate(v):
# only a sample
if is_datetime64_any_dtype(dtype):
return isinstance(value, (np.datetime64, datetime))
elif is_numeric_dtype(dtype):
return is_float(value) or is_integer(value)
else:
# string
return isinstance(value, compat.string_type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when I call this method, would I need to do something like:
dtype = value.dtype if hasattr(value, dtype) else None
Beforehand?
Not sure I grok this separate parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dtype must be passed in otherwise how do unknow is the filll_value is the right type
e.g. an int is not valid if u have a datetime array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've been following the fillna
behavior thus far. Right now fillna
would convert that input to a timestamp in ns
. Same with float
or bool
. And upcast the column to object
dtype to fit a str
fill.
fillna
behaves that way AFAIK because it's convenient to propagate a 0
or an np.nan
or whatever other out-of-type sentinel value across an entire DataFrame
all at once, instead of having to go column-by-column.
The same argument might apply for fill_value
, but, I do see it being a far weaker one. So if you think that it's OK for fill_value
to have a separate, stricter behavior than fillna
, sure.
So with date stuff, we can catch But how do we catch
Is this supposed to happen? The |
Periods are object type when in a Series ATM. They have a specific dtype only in an Index. |
numpy has pretty much nothing to do with dtypes anymore in pandas (except for some basic types). |
See the method in the new commit. How it works right now:
Is this behavior OK? |
Codecov Report
@@ Coverage Diff @@
## master #15587 +/- ##
==========================================
- Coverage 91.06% 91.03% -0.03%
==========================================
Files 137 137
Lines 49307 49330 +23
==========================================
+ Hits 44899 44908 +9
- Misses 4408 4422 +14
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good!
pls add a bunch of tests! (in pandas.tests.types.missing) to validate the validation function (IOW, go thru all types with some valid and separately some invalid ones). use parametrize.
pandas/core/missing.py
Outdated
|
||
|
||
def validate_fill_value(value, dtype): | ||
if is_list_like(value) or is_dict_like(value) or callable(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string :>
pandas/core/missing.py
Outdated
|
||
|
||
def validate_fill_value(value, dtype): | ||
if is_list_like(value) or is_dict_like(value) or callable(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dont' you check not is_scalar
? (which allows strings, datetimes, and all pandas scalars).
pandas/core/missing.py
Outdated
'a scalar, but you passed a ' | ||
'"{0}"'.format(type(value).__name__)) | ||
elif not isnull(value): | ||
from datetime import datetime, timedelta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put imports at the top of the file
pandas/core/reshape.py
Outdated
@@ -405,6 +405,10 @@ def _slow_pivot(index, columns, values): | |||
|
|||
|
|||
def unstack(obj, level, fill_value=None): | |||
if fill_value: | |||
from pandas.core.missing import validate_fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import at the top
@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values): | |||
|
|||
|
|||
def unstack(obj, level, fill_value=None): | |||
if fill_value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I would always pass this (we will make None
an acceptable fill_value below)
@@ -405,6 +406,9 @@ def _slow_pivot(index, columns, values): | |||
|
|||
|
|||
def unstack(obj, level, fill_value=None): | |||
if fill_value: | |||
validate_fill_value(fill_value, obj.values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just pass obj.dtype
, we never explictly call .values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj
may be a DataFrame
AFAIK. I call values
to get the numpy
array here to consolidate the dtype
(which means that e.g. a DataFrame
with columns of mixed type will accept fill_value
according to object
rules). Is there a way to get this without accessing the underlying array directly?
@@ -301,3 +302,11 @@ def test_na_value_for_dtype(): | |||
|
|||
for dtype in ['O']: | |||
assert np.isnan(na_value_for_dtype(np.dtype(dtype))) | |||
|
|||
|
|||
class TestValidateFillValue(tm.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use a class, just create a function (and use parametrize)
|
||
def validate_fill_value(value, dtype): | ||
""" | ||
Make sure the fill value is appropriate for the given dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in a Parameters, Returns, Raises section
raise TypeError('"fill_value" parameter must be ' | ||
'a scalar, but you passed a ' | ||
'"{0}"'.format(type(value).__name__)) | ||
elif not isnull(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you already check that value is None
is ok (just need a test to check!)
that's not right a fill value will be applied per individual dtype look in internals and put this check in the fillna method |
How about iterating through the sub-series column-by-column? Do On the design side. Suppose I have a mixed dtype DataFrame, let's say with a If |
@ResidentMario you can also do columnar. Keep in mind though that we generally will simply skip a non-compat fill-value.
so one could argue that both [5] and [6] are wrong or right. We generally leave this up to the user when having mixed dtypes. |
Exactly, leave it to the user—that's what the current implementation would do. So, do you think |
closing in favor of #15563 normally pls don't create new PR's for the same issue, just push to the same one. |
This is a different PR, though. The goal here is to implement error handling for That being said I cherry-picked some of the commits here for that PR. So a totally new PR might just be cleaner anyway. Sorry, took a while to get to the bottom of this particular molehill. |
git diff upstream/master | flake8 --diff