DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

sm-Fifteen · 2020-07-15T19:51:10Z

Location of the documentation

The Date parsing functions section of the CSV file parsing section, specifically the reccomended use of date_parser in cases where the user knows what format the date will be in in advance and/or that format is non-standard ans not supported by Pandas.

If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.to_datetime(x, format=...).

Demonstration of the problem

I'm trying to parse a CSV file that contains signal power data keyed by timestamp-frequency pairs in a vertical format (376 frequencies by 14300 samples, so about 5.3 million rows, with only one timestamp per row) with the intent of pivoting it into a columnar format. The time format is rather unpleasant to deal with and is stored as 2 separate columns, but pandas' CSV parser has all the tools I need to reconstruct the timestamps correctly, so I'm not worried there.

Based on what the documentation, I tried something like this:

import pandas as pd

def read_csv_slow(in_path: str):
    df = pd.read_csv(
        in_path,
        nrows=376 * 500,
        usecols=['Date','Time', 'Frequency', 'Power'], index_col=['Date_Time', 'Frequency']
        dtype={'Frequency': 'int32', 'Power': 'float32'},

       parse_dates=[['Date','Time']],

        # Extremely slow
        date_parser=lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S:%f'), # 06/17/2020 + 11:47:22:746
    )

For testing, I limited the amount of rows parsed to 500 time samples (188000 rows/timestamps), about 3.5% of the total file, which takes a surprising 43 seconds to process, mostly due to datetime parsing according to cProfile:

> python -m cProfile -s tottime .\read_csv_slow.py
         61508066 function calls (61501857 primitive calls) in 43.756 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   188000    4.507    0.000    5.728    0.000 {pandas._libs.tslibs.strptime.array_strptime}
 15247920    3.286    0.000    5.884    0.000 {built-in method builtins.isinstance}
   188000    2.358    0.000    2.358    0.000 {pandas._libs.tslibs.parsing._format_is_iso}
  5264920    1.851    0.000    2.432    0.000 generic.py:10(_check)
  1316199    1.828    0.000    3.879    0.000 common.py:1708(_is_dtype_type)
   188001    1.650    0.000   34.982    0.000 datetimes.py:246(_convert_listlike_datetimes)
   376150    1.400    0.000    1.400    0.000 {built-in method numpy.array}
   188006    1.348    0.000    8.870    0.000 datetimes.py:1678(sequence_to_dt64ns)
   188001    1.292    0.000   42.474    0.000 datetimes.py:530(to_datetime)

Using date_parser like this simply does not scale and blocks the entire CSV decoding process.

Meanwhile, here's an alternative version that bypasses date_parser and converts the datetime column in a single batch after parsing finishes:

import pandas as pd

def read_mxflex_csv_fast(in_path: str):
    df = pd.read_csv(
        in_path,
        nrows=376 * 10000,
        usecols=['Date','Time', 'Frequency', 'Power'], index_col=['Date_Time', 'Frequency']
        dtype={'Frequency': 'int32', 'Power': 'float32'},

       parse_dates=[['Date','Time']],
    )

    date_data = df.index.get_level_values(0)
    date_idx = pd.to_datetime(date_data, exact=True, cache=True, format='%m/%d/%Y %H:%M:%S:%f') # 06/17/2020 + 11:47:22:746
    freq_idx = df.index.get_level_values(1)
    df.index = pd.MultiIndex.from_arrays([date_idx, freq_idx])

This one completes in 6 seconds despite running on 20 times as much data (10000 samples instead of 500, notice the change in nrows) than the first example.

> python -m cProfile -s tottime .\read_csv_fast.py
         236167 function calls (229787 primitive calls) in 5.957 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.353    2.353    2.412    2.412 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
       28    0.477    0.017    0.477    0.017 {pandas._libs.lib.infer_dtype}
        2    0.442    0.221    0.442    0.221 {method 'unique' of 'pandas._libs.hashtable.StringHashTable' objects}
        1    0.344    0.344    0.344    0.344 {pandas._libs.tslibs.parsing._concat_date_cols}
        2    0.342    0.171    0.342    0.171 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
        1    0.235    0.235    0.235    0.235 {method 'factorize' of 'pandas._libs.hashtable.StringHashTable' objects}
       60    0.179    0.003    0.182    0.003 {built-in method _imp.create_dynamic}
     1667    0.177    0.000    0.177    0.000 {built-in method nt.stat}
        3    0.136    0.045    0.136    0.045 {method 'factorize' of 'pandas._libs.hashtable.Int64HashTable' objects}

Documentation problem

Usage of the date_parser parameter tends to be a huge performance cliff given how it appears to run in an row-wise fashion (if the profiler's ncalls metric is to be believed), something the surrounding documentation heavily stresses as well:

If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted the same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds have been observed.

The speedup described there isn't from Pandas having some sort of inferred date fast-path, but simply because the date_parser callback is being called in an extremely inefficient way for most workloads. There is a note above that section that could be considered as hinting at this:

If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.

All of those conflicting advices makes the current documentation fairly misleading on that topic, and people who don't profile their code might be led to believe that this is just a problem with pandas being too slow to handle their CSVs or something along those lines.

Suggested fix for documentation

One subsection of the Date Handling section (either appended to "Date parsing functions" or under a new subtitle) should give concrete examples on how to deal with files that contain non-standard or strange timestamp formats.

If your CSV file contains columns with timestamps in an unconventional format, it is usually significantly faster to
completely parse the file while leaving your dates as text (rather than specifying a date_parser callback when opening the file) and later converting the entire column as a single manipulation.
print(open('buzz.csv').read())
Date,Time,Value
11261999,13h57m58s.345,0.0100
11261999,13h57m59s.999,-0.5900
ts_format = "%m%d%Y %Hh%Mm%Ss.%f" # 11261999 13h57m59s.999

df = pd.read_csv('buzz.csv' parse_dates=[['Date','Time']], index_col=['Date_Time'],  infer_datetime_format=False)
df.index = pd.to_datetime(df.index, exact=True, cache=True, format=ts_format)

This gives clear instructions for users dealing with a use case that's probably not all that uncommon, mentions the alternative and a reason why this is preferable and gives a code example. The code example itself also shows the interaction between parse_dates when combining columns and manually-specified date formats (the doc does not otherwise mention that this results in a column of space-separated values). I'm not certain how to change the above code example to make it work correctly for data columns, index columns and multi-indexes alike, if such a thing is possible.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-09-04T15:20:20Z

@sm-Fifteen we'd happily accept improvements to the docs. I'm not sure that code sample is appropriate for the user guide (perhaps in the cookbook though). Will you submit a pull request?

MarcoGorelli · 2022-12-17T14:28:41Z

Can't reproduce this:

In [1]: format = '%Y-%d-%m %H:%M:%S%z'

In [2]: dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()

In [3]: data = 'date\n' + '\n'.join(dates) + '\n'

In [4]: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
Out[4]: 
                           date
0     1900-01-01 00:00:00+01:00
1     1900-01-02 00:00:00+01:00
2     1900-01-03 00:00:00+01:00
3     1900-01-04 00:00:00+01:00
4     1900-01-05 00:00:00+01:00
...                         ...
36520 1999-12-28 00:00:00+01:00
36521 1999-12-29 00:00:00+01:00
36522 1999-12-30 00:00:00+01:00
36523 1999-12-31 00:00:00+01:00
36524 2000-01-01 00:00:00+01:00

[36525 rows x 1 columns]

In [5]: %%timeit
   ...: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
   ...: 
   ...: 
273 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data))
   ...: df['date'] = pd.to_datetime(df['date'], format=format)
   ...: 
   ...: 
240 ms ± 4.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

infer_datetime_format is obsolete as of PDEP4 though, so we should clarify the docs, thanks for the issue

ShashwatAgrawal20 · 2022-12-17T16:49:26Z

@MarcoGorelli would like to work if possible ?

MarcoGorelli · 2022-12-17T17:05:33Z

go ahead - I'd suggest spending some time reading through PDEP4 to understand the change, and then removing the infer_datetime_format suggestion here and making sure the rest still makes sense

ShashwatAgrawal20 · 2022-12-18T01:34:21Z

take

ShashwatAgrawal20 · 2022-12-18T03:21:57Z

@MarcoGorelli should I be directly removing the infer_datetime_format or show that it's obsolete

sm-Fifteen · 2023-01-04T19:47:06Z

Can't reproduce this:

infer_datetime_format is obsolete as of PDEP4 though, so we should clarify the docs, thanks for the issue

@MarcoGorelli: The difference is usually pretty small, but there are stress cases where it can get pretty significant. Based on your example code:

import pandas as pd
from io import StringIO
from tempfile import TemporaryFile

timestamp_format = '%Y-%d-%m %H:%M:%S%z'

date_format = '%Y-%d-%m'
time_format = '%H:%M:%S'
datetime_format = '%Y-%d-%m %H:%M:%S'

csv_file: TemporaryFile
if csv_file: csv_file.close()
csv_file = TemporaryFile()

date_index = pd.date_range(start='1900', end='2000', freq='12H', tz='Europe/Paris')

dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
dates_df['date_only'] = date_index.strftime(date_format)
dates_df['time_only'] = date_index.strftime(time_format)
dates_df.to_csv(csv_file, header=True)

Timezone parsing is one such case:

%%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['ts_col'], date_parser=lambda x: pd.to_datetime(x, format=timestamp_format), parse_dates=['ts_col'])

51.9 s ± 653 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['ts_col'])
pd.to_datetime(new_df['ts_col'], exact=True, cache=True, format=timestamp_format)

1.78 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Column merging is another:

%%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['date_only', 'time_only'], date_parser=lambda x: pd.to_datetime(x, format=datetime_format), parse_dates=[['date_only', 'time_only']])

13.1 s ± 344 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['date_only', 'time_only'], parse_dates=[['date_only', 'time_only']])
pd.to_datetime(new_df['date_only_time_only'], exact=True, cache=True, format=datetime_format)

338 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(just realized I ran these on Pandas 1.4.3 instead of 1.5.2, but I would imagine that these cases are niche enough that the performance wouldn't have improved by that much over the 6 months between these two releases).

MarcoGorelli · 2023-01-04T19:52:29Z

thanks @sm-Fifteen - I tried running that but got

Traceback (most recent call last):
  File "f.py", line 18, in <module>
    if csv_file: csv_file.close()
NameError: name 'csv_file' is not defined

, could you please make the example reproducible?

MarcoGorelli · 2023-01-04T19:57:09Z

actually, it reproduces with just 'tmp.csv'

thanks for the report, will take a look

MarcoGorelli · 2023-01-05T14:44:03Z

OK, got it. In this case, it's due to #40111

So, in

pandas/pandas/io/parsers/base_parser.py

Lines 1132 to 1147 in 3bc2203

    
           try: 
        
               result = tools.to_datetime( 
        
                   date_parser(*date_cols), errors="ignore", cache=cache_dates 
        
               ) 
        
               if isinstance(result, datetime.datetime): 
        
                   raise Exception("scalar parser") 
        
               return result 
        
           except Exception: 
        
               return tools.to_datetime( 
        
                   parsing.try_parse_dates( 
        
                       parsing.concat_date_cols(date_cols), 
        
                       parser=date_parser, 
        
                       dayfirst=dayfirst, 
        
                   ), 
        
                   errors="ignore", 
        
               )

the following happens:

date_parser(*date_cols) succeeds, converts the input to a Index of (mixed-offset) Timestamps
to_datetime on that Index fails, because it's not idempotent 🤦 (see linked issue)
the except block is reached, and so parsing happens row-by-row, as documented:

Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

Gosh, I hate all these fallbacks...TBH I think this warrants and API change, and to make the behaviour:

if no date_parser is provided, use to_datetime with errors='ignore'
if a date_parser is provided, just use that. No try/except, no magic fallbacks, if a user wants row-by-row parsing they're free to implement that within their parser

sm-Fifteen · 2023-01-05T18:37:23Z

So you're saying that x in date_parser=lambda x: pd.to_datetime(x, format=datetime_format) is actually supposed to be a pandas Series object representing the full column specified by parse_dates, but that the "magic fallback" thing causes it to be called once per row with x as a string value instead? That woud explain the massive performance cliff observed. From my initial obeservations, I thought it was just always being called row-wise.

MarcoGorelli · 2023-01-05T19:02:57Z

yeah - so #50586 would solve the performance issue in your particular case

but I'd still like to make an API change to be honest

sm-Fifteen · 2024-04-02T20:13:41Z

I should probably close this by now, since date_parser has been deprecated and replaced with parameters that avoid this performance issue entirely.

sm-Fifteen added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2020

TomAugspurger added IO CSV read_csv, to_csv Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020

MarcoGorelli added the Datetime Datetime data dtype label Dec 17, 2022

MarcoGorelli added the good first issue label Dec 17, 2022

github-actions bot assigned ShashwatAgrawal20 Dec 18, 2022

ShashwatAgrawal20 mentioned this issue Dec 18, 2022

DOC: don't teach infer_datetime_format in user_guide #50334

Merged

1 task

MarcoGorelli closed this as completed in #50334 Dec 21, 2022

MarcoGorelli reopened this Jan 4, 2023

MarcoGorelli mentioned this issue Jan 5, 2023

BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

Merged

5 tasks

sm-Fifteen mentioned this issue Jan 6, 2023

API deprecate date_parser, add date_format #50601

Closed

sm-Fifteen closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

sm-Fifteen commented Jul 15, 2020

TomAugspurger commented Sep 4, 2020

MarcoGorelli commented Dec 17, 2022 •

edited

Loading

ShashwatAgrawal20 commented Dec 17, 2022

MarcoGorelli commented Dec 17, 2022

ShashwatAgrawal20 commented Dec 18, 2022

ShashwatAgrawal20 commented Dec 18, 2022

sm-Fifteen commented Jan 4, 2023

MarcoGorelli commented Jan 4, 2023

MarcoGorelli commented Jan 4, 2023

MarcoGorelli commented Jan 5, 2023

sm-Fifteen commented Jan 5, 2023 •

edited

Loading

MarcoGorelli commented Jan 5, 2023

sm-Fifteen commented Apr 2, 2024

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

Comments

sm-Fifteen commented Jul 15, 2020

Location of the documentation

Demonstration of the problem

Documentation problem

Suggested fix for documentation

TomAugspurger commented Sep 4, 2020

MarcoGorelli commented Dec 17, 2022 • edited Loading

ShashwatAgrawal20 commented Dec 17, 2022

MarcoGorelli commented Dec 17, 2022

ShashwatAgrawal20 commented Dec 18, 2022

ShashwatAgrawal20 commented Dec 18, 2022

sm-Fifteen commented Jan 4, 2023

MarcoGorelli commented Jan 4, 2023

MarcoGorelli commented Jan 4, 2023

MarcoGorelli commented Jan 5, 2023

sm-Fifteen commented Jan 5, 2023 • edited Loading

MarcoGorelli commented Jan 5, 2023

sm-Fifteen commented Apr 2, 2024

MarcoGorelli commented Dec 17, 2022 •

edited

Loading

sm-Fifteen commented Jan 5, 2023 •

edited

Loading