-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API deprecate date_parser, add date_format #50601
Comments
The main drawback I can see of switching from a callback to a format string is that the callback enables you to use Is there concern that the other parameters of to_datetime may end up having to be added to At any rate, I certainly feel that this would be a pretty sizable improvement to date parsing, not just in terms of performance gains but also in terms of usability. |
Thanks for your input, much appreciated how about
and it would be performant, because pandas/pandas/io/parsers/base_parser.py Lines 1133 to 1135 in 3bc2203
, the kwargs could directly go to pandas/pandas/io/parsers/base_parser.py Lines 1124 to 1130 in 3bc2203
|
not sure I see what you mean - is this currently supported? do you have an example please? |
Something like this: import pandas as pd
from io import StringIO
from tempfile import TemporaryFile
date_euro = pd.date_range(start='1900', end='2000', freq='1D').strftime("%d/%m/%Y %H:%M:%S").to_list()
date_us = pd.date_range(start='1900', end='2000', freq='1D').strftime("%m/%d/%Y %H:%M:%S").to_list()
date_df_pre = pd.DataFrame({
'date_euro': date_euro,
'date_us': date_us,
})
csv_buf = StringIO()
date_df_pre.to_csv(csv_buf)
csv_buf.seek(0)
date_df_pre (day first vs month-first)
from typing import Union
def parse_all_dates(col: Union[str,pd.Series]):
if not isinstance(col, pd.Series): return col
if not pd.api.types.is_string_dtype(col.dtype): return col
if col.name == 'date_euro': return pd.to_datetime(col, format="%d/%m/%Y %H:%M:%S")
if col.name == 'date_us': return pd.to_datetime(col, format="%m/%d/%Y %H:%M:%S")
return col
new_df = pd.read_csv(csv_buf, usecols=['date_euro', 'date_us'], parse_dates=['date_euro', 'date_us'], date_parser=parse_all_dates)
new_df
A bit contrived, but I'm sure there are real-world use-cases that would need something like this, where different date formats are used per column. |
IIUC that's not currently supported by if so, let's keep that to a separate discussion (you're welcome to open a new issue and I'll gladly take a look) |
I can't say, I haven't had to use these parameters before and can't tell whether they're actually useful for CSV parsing, I'm just pointing it out for the record, in case there's a use case with these that I'm missing. If the other args for to_datetime actually matter for file parsing, then I guess kwargs is probably the most extensible and least "future-intrusive" solution. |
Oh, no, that code snippet actually works and uses |
ah I see, sorry, I should've actually tried running it. and if it's possible, then there's definitely someone out there using it 😄 a dict mapping column names to kwargs would allow for that functionality to be kept though, right? like to_datetime(data, to_datetime_kwargs={'date_euro': {'dayfirst': True}, 'date_us': {'dayfirst': False}} So, |
I can't speak for the ergonomics of such a solution (the list of parameters for read_csv is already pretty long and complex), but it would at least preserve the functionnality of the callback, as far as I can tell. Looking at the current docs, there are cases where specific use of to_datetime is needed (such as mixed timezone columns), so that wouldn't be lost, at least. Truth be told, the more I look at the extended use-cases, the less I'm confident about having this sort of highly tunable functionnality (date parsing kwargs, that is) be baked directly in the read_csv function instead of having the user do the conversion as a second pass. Pola-rs, for instance, only has a switch for automatic date parsing where it tries to figure out your date format in an undocumented way based on the first value of a column by testing a bunch of big-endian date formats, and the guide saying you should leave that switch off and do the string-to-date conversion yourself if you have anything more complicated than that. Leaving |
Thanks for your thoughts Yes, I think you're right, I'd be +1 for simple knobs, and doc change So, to summarise:
Performance-wise, this would only be an improvement to the status quo |
+1 to this. |
for use cases like the different-behavior-per-column id rather tell users to handle that after read_csv. read_csv is way too complicated already. |
Agreed I just checked, and if the format can be guessed, then just In [2]: %%timeit
...: df = pd.read_csv(io.StringIO(data), parse_dates=['ts_col'])
...:
...:
18.2 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) So, I'd suggest going further - just deprecate Since PDEP4, the original issue which motivated adding the |
Leaving out non-automatic date parsing is something I'd normally be in favor of, but I should point out that manual date conversion gets pretty convoluted (in comparison to the normal case, I'm looking at this for the perspective of a new user or someone migrating code) when it comes to index columns and expecially MultiIndex. # Roughly copied from the SO answer above
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])
idx_mydatetime = df.index.get_level_values('mydatetime')
idx_num = df.index.get_level_values('num')
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num]) There might be simpler ways of handling this in a "post-parsing" manner that I'm not aware of, but that,s still roughly what I'd imagine index parsing would look like if |
I think that code looks fine, and rare-enough that to not need a special place in the documentation |
Time-series data with an unconventionnal/ambiguous date format and one or more extra key columns (sensor name, histogram bucket, dataset name if there's multiple reports in one file, etc.)? I personally see these quite a lot; ISO 8601 is unfortunately not as universal as I would like. This is CSV we're talking about, people can't even agree whether the "C" stands for "Tab", "Semi-Colon" or "Space". Date formats are anybody's guess. |
Yes, and if it's not a format which pandas can guess, then people can read in the column and then run Do you have an example of something that people can currently do with df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num']) doesn't use
|
On its own, yes, but I would imagine most people have it written like this instead: df = pd.read_csv(
infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'],
date_parser=lambda x: pd.to_datetime(x, format=timestamp_format)
) ...and converting it to work without df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])
idx_mydatetime = df.index.get_level_values('mydatetime')
idx_num = df.index.get_level_values('num')
idx_mydatetime = pd.to_datetime(idx_mydatetime, format=timestamp_format)
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num]) Meanwhile, if df = pd.read_csv(
infile,
parse_dates={'mydatetime': ['date', 'time']},
date_format={'mydatetime': timestamp_format},
index_col=['mydatetime', 'num']
) Which would cover the vast majority of uses of |
Thanks - not sold on this to be honest, in pandas 2.0.0 most common date formats should be guessable, and if anyone has something really unusual and has a multiindex in which the date is one of the levels (which itself strikes me as exceedingly rare), then it seems fine to require them to add an extra few lines. I don't think all this complexity is warranted for such a rare use-case |
@MarcoGorelli I am struggling a bit to convert to Pandas 2.0 recommended way. I have the following (full code here): import datetime
import pandas as pd
def date_parse(value: str) -> datetime.datetime:
return datetime.datetime.strptime(value.strip(), "%d/%m/%y")
df = pd.read_excel(
filename,
usecols="B:K",
parse_dates=["Data Negócio"],
date_parser=date_parse,
skipfooter=4,
skiprows=10,
) In this case, my final dataframe has a column "Data Negócio" with type import datetime
import pandas as pd
DATE_FORMAT = "%d/%m/%y"
df = pd.read_excel(
filename,
usecols="B:K",
parse_dates=["Data Negócio"],
date_format=DATE_FORMAT,
skipfooter=4,
skiprows=10,
) My column now is an object and nothing is formatted. I get values such as ' 01/12/21'. Is it a bug, or what am I missing? |
hi @staticdev - could you share a reproducible example please? |
(and also open a new issue if this is a bug - thanks!) |
There doesn't seem to be a lot of documentation for users that have a working The issue I am having is that
I am using a custom parser that uses
|
you can do df = pd.read_csv(data, sep=",", parse_dates=['designation_date'], date_format='mixed') or something like df = pd.read_csv(data, sep=",")
df['designation_date'] = pd.to_datetime(df['designation_date'], format='mixed', utc=True) |
Thank you for this suggestion. It is working with the first suggestion. Now just dealing with NaT values (#11953), which I might be able to deal with as-is. However, I do still think extra documentation beyond just stating the alternative method within the FutureWarning would be beneficial. I don't think the FutureWarning message needs to be modified, but there is very little guidance searching online assuming a starting point of a working date_parser implementation. But I do assume it is probably way easier than I am thinking. |
sure, pull requests to improve the docs would be very welcome https://pandas.pydata.org/docs/dev/development/contributing_docstring.html |
Use `pd.read_csv(..., date_format="ISO8601")` to silence `UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.`. Xref pandas-dev/pandas#50601
current docs
it doesn't say how to declare format, it doesn't say the type of the output, it doesn't say if it's timezone aware or not, and what is the underlying function please don't deprecate functions until get valid alternative |
It's the same as the 'format' argument in 'to_datetime', a PR to clarify this would be welcome |
Date parser could be used to combine multiple columns into a single datetime (e.g. columns for 'year', 'month', 'day', 'hour' minute', 'second' could be turned into a single datetime column). Is this no longer possible? |
@brendan-m-murphy: I believe you can still do this with |
Would be great if that new parser would be able to convert POSIX times (epoch seconds). The referenced documentation for the
Linux strftime man3 manpage:
And sqlite seems to support it too: https://www.sqlite.org/lang_datefunc.html |
@gsgxnet could you open a new issue please? |
I guess that's not going to hold true in the near future, since it looks like that feature is set to get deprecated in the next Pandas version, as per PR #56569 and issue #55569. |
TLDR the conversation here goes on for a bit, but to summarise, the suggestion is:
date_parser
, because it always hurts performance (counter examples welcome!)date_format
, because that can boost performanceobject
, and then apply their parsingPerformance-wise, this would only be an improvement to the status quo
As far as I can tell,
date_parser
is a net negative and only ever slows things downIn the best case, it only results in a slight degradation:
Parsing element-by-element is also slower than just using
.apply
:In the worst case, it results in 65x performance degradation, see #50586 (comment) (and this gets way worse for larger datasets)
My suggestion is:
date_parser
date_format
, which would actually deliver a performance improvement:That's over 3 times as fast!
The text was updated successfully, but these errors were encountered: