Skip to content

BUG: astype with timedelta and datetime string (#22100) #22107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

fjdiod
Copy link
Contributor

@fjdiod fjdiod commented Jul 28, 2018

@pep8speaks
Copy link

pep8speaks commented Jul 28, 2018

Hello @fjdiod! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 31, 2018 at 10:45 Hours UTC

def test_astype_nansafe(arr, dtype):
# GH #22100
result = astype_nansafe(arr, dtype)
is_dtype_equal(result.dtype, dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an assertion here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piggybacking on this I think it would be better to explicitly parametrize the expected value and make assertions around that.

On a somewhat related note how do we want to handle datetime64[ns, tz]? Off the top of my head I can't think of how to do a conversion to that via astype so maybe should still raise for that?

@fjdiod
Copy link
Contributor Author

fjdiod commented Jul 29, 2018

Why do we need this?

if ((PY3 and dtype not in [_INT64_DTYPE, _TD_DTYPE]) or
(not PY3 and dtype != _TD_DTYPE)):
# allow frequency conversions
# we return a float here!
if dtype.kind == 'm':
mask = isna(arr)
result = arr.astype(dtype).astype(np.float64)
result[mask] = np.nan
return result
elif dtype == _TD_DTYPE:
return arr.astype(_TD_DTYPE, copy=copy)

With that we have:

>>> s = pd.Series([None])
>>> s.astype('m8')
0   NaT
dtype: timedelta64[ns]
>>> s.astype('m8').astype('m8')
0   NaN
dtype: float64

which looks strange.
Also, I've noticed:

>>> s = pd.Series([None, 1, 2])
>>> s.astype('timedelta64[ns]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sergey/projects/pandas/pandas/util/_decorators.py", line 177, in wrapper
    return func(*args, **kwargs)
  File "/home/sergey/projects/pandas/pandas/core/generic.py", line 5182, in astype
    **kwargs)
  File "/home/sergey/projects/pandas/pandas/core/internals/managers.py", line 554, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/home/sergey/projects/pandas/pandas/core/internals/managers.py", line 421, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/sergey/projects/pandas/pandas/core/internals/blocks.py", line 564, in astype
    **kwargs)
  File "/home/sergey/projects/pandas/pandas/core/internals/blocks.py", line 655, in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
  File "/home/sergey/projects/pandas/pandas/core/dtypes/cast.py", line 709, in astype_nansafe
    raise ValueError('Cannot convert non-finite values (NA or inf) to '
ValueError: Cannot convert non-finite values (NA or inf) to integer

Is it expected?

@codecov
Copy link

codecov bot commented Jul 30, 2018

Codecov Report

Merging #22107 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22107      +/-   ##
==========================================
- Coverage   92.07%   92.06%   -0.02%     
==========================================
  Files         170      170              
  Lines       50688    50706      +18     
==========================================
+ Hits        46671    46682      +11     
- Misses       4017     4024       +7
Flag Coverage Δ
#multiple 90.47% <100%> (-0.02%) ⬇️
#single 42.3% <51.56%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 92.47% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.95% <ø> (ø) ⬆️
pandas/core/frame.py 97.26% <ø> (ø) ⬆️
pandas/core/series.py 94.11% <ø> (ø) ⬆️
pandas/io/html.py 89.17% <ø> (ø) ⬆️
pandas/util/testing.py 85.69% <ø> (-0.21%) ⬇️
pandas/core/generic.py 96.47% <ø> (ø) ⬆️
pandas/core/arrays/interval.py 92.55% <ø> (-0.05%) ⬇️
pandas/core/internals/blocks.py 94.45% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 95.5% <100%> (+0.04%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b3e7d...74e670b. Read the comment docs.

@gfyoung gfyoung added Bug Datetime Datetime data dtype labels Jul 30, 2018
@fjdiod
Copy link
Contributor Author

fjdiod commented Jul 30, 2018

Hello, @jbrockmendel, @WillAyd
Are there any other changes I should make?

@@ -712,7 +712,8 @@ def astype_nansafe(arr, dtype, copy=True):
elif is_object_dtype(arr):

# work around NumPy brokenness, #1987
if np.issubdtype(dtype.type, np.integer):
is_time = is_datetime64_dtype(dtype) or dtype == _TD_DTYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TD_DTYPE is specifically timedelta64[ns]. Are there paths that get here with e.g. timedelta64[D]? (I'm assuming those should be excluded too.

IIRC datetimete64 dtype won't trigger np.issubtype(dtype.type, np.integer), am I remembering incorrectly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, seems I was wrong about datetime64, they worked fine before.

(np.array(['2000'], dtype='object'),
'datetime64', 'datetime64[ns]'),
])
def test_astype_nansafe(arr, dtype, expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are datetime64[ns, TZ] dtypes relevant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI whatever the solution is here would impact the other limitation with orient='table' around timezone data:

# Cannot directly use as_type with timezone data on object; raise for now

I'd be fine with just raising as part of this PR until its explicitly resolved but just bringing that up for visibility

(np.array(['2000'], dtype='object'),
'datetime64[ns]', 'datetime64[ns]'),
(np.array(['2000'], dtype='object'),
'datetime64', 'datetime64[ns]'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be made less verbose (easier to read if on one line, but not a huge deal) by changing 'object' --> 'O', 'timedelta64[ns]' --> 'm8[ns]', and 'datetime64[ns]'' --> M8[ns]

@@ -712,7 +712,8 @@ def astype_nansafe(arr, dtype, copy=True):
elif is_object_dtype(arr):

# work around NumPy brokenness, #1987
if np.issubdtype(dtype.type, np.integer):
is_time = is_timedelta64_dtype(dtype)
if np.issubdtype(dtype.type, np.integer) and is_time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_integer_dtype and also add a comment here (and can remove the existing work around comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but it breaks:

def test_other_timedelta_unit(self, unit):
# GH 13389
df1 = pd.DataFrame({'entity_id': [101, 102]})
s = pd.Series([None, None], index=[101, 102], name='days')
dtype = "m8[{}]".format(unit)
df2 = s.astype(dtype).to_frame('days')
assert df2['days'].dtype == 'm8[ns]'

Because of:
if ((PY3 and dtype not in [_INT64_DTYPE, _TD_DTYPE]) or
(not PY3 and dtype != _TD_DTYPE)):
# allow frequency conversions
# we return a float here!
if dtype.kind == 'm':
mask = isna(arr)
result = arr.astype(dtype).astype(np.float64)
result[mask] = np.nan
return result

Are this necessary? Here timedelta64 are turned to float64.
Should I change the test or this part of astype_nansafe?
Also, I've notices that conversions like pd.Series([None, 1, 2]).astype('timedelta64') do not work now. Should I open a new issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be missing something but shouldn't it not be going down this branch? i.e. isn't the statement dtype not in [_INT64_DTYPE, _TD_DTYPE] supposed to be false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IntNADType handled correctly here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd yah at this point I’m not sure if it’s tangential, but it looks like the TDDTYPE check should go above the (PY3 ...) check in line 688, after which that check is always-true.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

closing as stale. if you'd like to continue pls ping.

@jreback jreback closed this Nov 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

astype not converting timedelta and datetime strings
6 participants