-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296
Comments
@sm-Fifteen we'd happily accept improvements to the docs. I'm not sure that code sample is appropriate for the user guide (perhaps in the cookbook though). Will you submit a pull request? |
Can't reproduce this: In [1]: format = '%Y-%d-%m %H:%M:%S%z'
In [2]: dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()
In [3]: data = 'date\n' + '\n'.join(dates) + '\n'
In [4]: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
Out[4]:
date
0 1900-01-01 00:00:00+01:00
1 1900-01-02 00:00:00+01:00
2 1900-01-03 00:00:00+01:00
3 1900-01-04 00:00:00+01:00
4 1900-01-05 00:00:00+01:00
... ...
36520 1999-12-28 00:00:00+01:00
36521 1999-12-29 00:00:00+01:00
36522 1999-12-30 00:00:00+01:00
36523 1999-12-31 00:00:00+01:00
36524 2000-01-01 00:00:00+01:00
[36525 rows x 1 columns]
In [5]: %%timeit
...: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
...:
...:
273 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
...: df = pd.read_csv(io.StringIO(data))
...: df['date'] = pd.to_datetime(df['date'], format=format)
...:
...:
240 ms ± 4.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
@MarcoGorelli would like to work if possible ? |
go ahead - I'd suggest spending some time reading through PDEP4 to understand the change, and then removing the |
take |
@MarcoGorelli should I be directly removing the |
@MarcoGorelli: The difference is usually pretty small, but there are stress cases where it can get pretty significant. Based on your example code: import pandas as pd
from io import StringIO
from tempfile import TemporaryFile
timestamp_format = '%Y-%d-%m %H:%M:%S%z'
date_format = '%Y-%d-%m'
time_format = '%H:%M:%S'
datetime_format = '%Y-%d-%m %H:%M:%S'
csv_file: TemporaryFile
if csv_file: csv_file.close()
csv_file = TemporaryFile()
date_index = pd.date_range(start='1900', end='2000', freq='12H', tz='Europe/Paris')
dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
dates_df['date_only'] = date_index.strftime(date_format)
dates_df['time_only'] = date_index.strftime(time_format)
dates_df.to_csv(csv_file, header=True) Timezone parsing is one such case: %%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['ts_col'], date_parser=lambda x: pd.to_datetime(x, format=timestamp_format), parse_dates=['ts_col'])
%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['ts_col'])
pd.to_datetime(new_df['ts_col'], exact=True, cache=True, format=timestamp_format)
Column merging is another: %%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['date_only', 'time_only'], date_parser=lambda x: pd.to_datetime(x, format=datetime_format), parse_dates=[['date_only', 'time_only']])
%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['date_only', 'time_only'], parse_dates=[['date_only', 'time_only']])
pd.to_datetime(new_df['date_only_time_only'], exact=True, cache=True, format=datetime_format)
(just realized I ran these on Pandas 1.4.3 instead of 1.5.2, but I would imagine that these cases are niche enough that the performance wouldn't have improved by that much over the 6 months between these two releases). |
thanks @sm-Fifteen - I tried running that but got
, could you please make the example reproducible? |
actually, it reproduces with just thanks for the report, will take a look |
OK, got it. In this case, it's due to #40111 So, in pandas/pandas/io/parsers/base_parser.py Lines 1132 to 1147 in 3bc2203
the following happens:
Gosh, I hate all these fallbacks...TBH I think this warrants and API change, and to make the behaviour:
|
So you're saying that |
yeah - so #50586 would solve the performance issue in your particular case but I'd still like to make an API change to be honest |
I should probably close this by now, since date_parser has been deprecated and replaced with parameters that avoid this performance issue entirely. |
Location of the documentation
The Date parsing functions section of the CSV file parsing section, specifically the reccomended use of
date_parser
in cases where the user knows what format the date will be in in advance and/or that format is non-standard ans not supported by Pandas.Demonstration of the problem
Based on what the documentation, I tried something like this:
For testing, I limited the amount of rows parsed to 500 time samples (188000 rows/timestamps), about 3.5% of the total file, which takes a surprising 43 seconds to process, mostly due to datetime parsing according to
cProfile
:Using
date_parser
like this simply does not scale and blocks the entire CSV decoding process.Meanwhile, here's an alternative version that bypasses
date_parser
and converts the datetime column in a single batch after parsing finishes:This one completes in 6 seconds despite running on 20 times as much data (10000 samples instead of 500, notice the change in
nrows
) than the first example.Documentation problem
Usage of the
date_parser
parameter tends to be a huge performance cliff given how it appears to run in an row-wise fashion (if the profiler'sncalls
metric is to be believed), something the surrounding documentation heavily stresses as well:The speedup described there isn't from Pandas having some sort of inferred date fast-path, but simply because the
date_parser
callback is being called in an extremely inefficient way for most workloads. There is a note above that section that could be considered as hinting at this:All of those conflicting advices makes the current documentation fairly misleading on that topic, and people who don't profile their code might be led to believe that this is just a problem with pandas being too slow to handle their CSVs or something along those lines.
Suggested fix for documentation
One subsection of the Date Handling section (either appended to "Date parsing functions" or under a new subtitle) should give concrete examples on how to deal with files that contain non-standard or strange timestamp formats.
This gives clear instructions for users dealing with a use case that's probably not all that uncommon, mentions the alternative and a reason why this is preferable and gives a code example. The code example itself also shows the interaction between
parse_dates
when combining columns and manually-specified date formats (the doc does not otherwise mention that this results in a column of space-separated values). I'm not certain how to change the above code example to make it work correctly for data columns, index columns and multi-indexes alike, if such a thing is possible.The text was updated successfully, but these errors were encountered: