Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

Closed
sm-Fifteen opened this issue Jul 15, 2020 · 13 comments · Fixed by #50334
Closed

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

sm-Fifteen opened this issue Jul 15, 2020 · 13 comments · Fixed by #50334
Assignees
Labels
Datetime Datetime data dtype Docs good first issue IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@sm-Fifteen
Copy link

Location of the documentation

The Date parsing functions section of the CSV file parsing section, specifically the reccomended use of date_parser in cases where the user knows what format the date will be in in advance and/or that format is non-standard ans not supported by Pandas.

If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.to_datetime(x, format=...).

Demonstration of the problem

I'm trying to parse a CSV file that contains signal power data keyed by timestamp-frequency pairs in a vertical format (376 frequencies by 14300 samples, so about 5.3 million rows, with only one timestamp per row) with the intent of pivoting it into a columnar format. The time format is rather unpleasant to deal with and is stored as 2 separate columns, but pandas' CSV parser has all the tools I need to reconstruct the timestamps correctly, so I'm not worried there.

Based on what the documentation, I tried something like this:

import pandas as pd

def read_csv_slow(in_path: str):
    df = pd.read_csv(
        in_path,
        nrows=376 * 500,
        usecols=['Date','Time', 'Frequency', 'Power'], index_col=['Date_Time', 'Frequency']
        dtype={'Frequency': 'int32', 'Power': 'float32'},

       parse_dates=[['Date','Time']],

        # Extremely slow
        date_parser=lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S:%f'), # 06/17/2020 + 11:47:22:746
    )

For testing, I limited the amount of rows parsed to 500 time samples (188000 rows/timestamps), about 3.5% of the total file, which takes a surprising 43 seconds to process, mostly due to datetime parsing according to cProfile:

> python -m cProfile -s tottime .\read_csv_slow.py
         61508066 function calls (61501857 primitive calls) in 43.756 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   188000    4.507    0.000    5.728    0.000 {pandas._libs.tslibs.strptime.array_strptime}
 15247920    3.286    0.000    5.884    0.000 {built-in method builtins.isinstance}
   188000    2.358    0.000    2.358    0.000 {pandas._libs.tslibs.parsing._format_is_iso}
  5264920    1.851    0.000    2.432    0.000 generic.py:10(_check)
  1316199    1.828    0.000    3.879    0.000 common.py:1708(_is_dtype_type)
   188001    1.650    0.000   34.982    0.000 datetimes.py:246(_convert_listlike_datetimes)
   376150    1.400    0.000    1.400    0.000 {built-in method numpy.array}
   188006    1.348    0.000    8.870    0.000 datetimes.py:1678(sequence_to_dt64ns)
   188001    1.292    0.000   42.474    0.000 datetimes.py:530(to_datetime)

Using date_parser like this simply does not scale and blocks the entire CSV decoding process.


Meanwhile, here's an alternative version that bypasses date_parser and converts the datetime column in a single batch after parsing finishes:

import pandas as pd

def read_mxflex_csv_fast(in_path: str):
    df = pd.read_csv(
        in_path,
        nrows=376 * 10000,
        usecols=['Date','Time', 'Frequency', 'Power'], index_col=['Date_Time', 'Frequency']
        dtype={'Frequency': 'int32', 'Power': 'float32'},

       parse_dates=[['Date','Time']],
    )

    date_data = df.index.get_level_values(0)
    date_idx = pd.to_datetime(date_data, exact=True, cache=True, format='%m/%d/%Y %H:%M:%S:%f') # 06/17/2020 + 11:47:22:746
    freq_idx = df.index.get_level_values(1)
    df.index = pd.MultiIndex.from_arrays([date_idx, freq_idx])

This one completes in 6 seconds despite running on 20 times as much data (10000 samples instead of 500, notice the change in nrows) than the first example.

> python -m cProfile -s tottime .\read_csv_fast.py
         236167 function calls (229787 primitive calls) in 5.957 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.353    2.353    2.412    2.412 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
       28    0.477    0.017    0.477    0.017 {pandas._libs.lib.infer_dtype}
        2    0.442    0.221    0.442    0.221 {method 'unique' of 'pandas._libs.hashtable.StringHashTable' objects}
        1    0.344    0.344    0.344    0.344 {pandas._libs.tslibs.parsing._concat_date_cols}
        2    0.342    0.171    0.342    0.171 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
        1    0.235    0.235    0.235    0.235 {method 'factorize' of 'pandas._libs.hashtable.StringHashTable' objects}
       60    0.179    0.003    0.182    0.003 {built-in method _imp.create_dynamic}
     1667    0.177    0.000    0.177    0.000 {built-in method nt.stat}
        3    0.136    0.045    0.136    0.045 {method 'factorize' of 'pandas._libs.hashtable.Int64HashTable' objects}

Documentation problem

Usage of the date_parser parameter tends to be a huge performance cliff given how it appears to run in an row-wise fashion (if the profiler's ncalls metric is to be believed), something the surrounding documentation heavily stresses as well:

If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted the same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds have been observed.

The speedup described there isn't from Pandas having some sort of inferred date fast-path, but simply because the date_parser callback is being called in an extremely inefficient way for most workloads. There is a note above that section that could be considered as hinting at this:

If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.

All of those conflicting advices makes the current documentation fairly misleading on that topic, and people who don't profile their code might be led to believe that this is just a problem with pandas being too slow to handle their CSVs or something along those lines.

Suggested fix for documentation

One subsection of the Date Handling section (either appended to "Date parsing functions" or under a new subtitle) should give concrete examples on how to deal with files that contain non-standard or strange timestamp formats.

If your CSV file contains columns with timestamps in an unconventional format, it is usually significantly faster to
completely parse the file while leaving your dates as text (rather than specifying a date_parser callback when opening the file) and later converting the entire column as a single manipulation.

print(open('buzz.csv').read())
Date,Time,Value
11261999,13h57m58s.345,0.0100
11261999,13h57m59s.999,-0.5900
ts_format = "%m%d%Y %Hh%Mm%Ss.%f" # 11261999 13h57m59s.999

df = pd.read_csv('buzz.csv' parse_dates=[['Date','Time']], index_col=['Date_Time'],  infer_datetime_format=False)
df.index = pd.to_datetime(df.index, exact=True, cache=True, format=ts_format)

This gives clear instructions for users dealing with a use case that's probably not all that uncommon, mentions the alternative and a reason why this is preferable and gives a code example. The code example itself also shows the interaction between parse_dates when combining columns and manually-specified date formats (the doc does not otherwise mention that this results in a column of space-separated values). I'm not certain how to change the above code example to make it work correctly for data columns, index columns and multi-indexes alike, if such a thing is possible.

@sm-Fifteen sm-Fifteen added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2020
@TomAugspurger
Copy link
Contributor

@sm-Fifteen we'd happily accept improvements to the docs. I'm not sure that code sample is appropriate for the user guide (perhaps in the cookbook though). Will you submit a pull request?

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Dec 17, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 17, 2022

Can't reproduce this:

In [1]: format = '%Y-%d-%m %H:%M:%S%z'

In [2]: dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()

In [3]: data = 'date\n' + '\n'.join(dates) + '\n'

In [4]: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
Out[4]: 
                           date
0     1900-01-01 00:00:00+01:00
1     1900-01-02 00:00:00+01:00
2     1900-01-03 00:00:00+01:00
3     1900-01-04 00:00:00+01:00
4     1900-01-05 00:00:00+01:00
...                         ...
36520 1999-12-28 00:00:00+01:00
36521 1999-12-29 00:00:00+01:00
36522 1999-12-30 00:00:00+01:00
36523 1999-12-31 00:00:00+01:00
36524 2000-01-01 00:00:00+01:00

[36525 rows x 1 columns]

In [5]: %%timeit
   ...: pd.read_csv(io.StringIO(data), date_parser=lambda x: pd.to_datetime(x, format=format), parse_dates=['date'])
   ...: 
   ...: 
273 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data))
   ...: df['date'] = pd.to_datetime(df['date'], format=format)
   ...: 
   ...: 
240 ms ± 4.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

infer_datetime_format is obsolete as of PDEP4 though, so we should clarify the docs, thanks for the issue

@ShashwatAgrawal20
Copy link
Contributor

@MarcoGorelli would like to work if possible ?

@MarcoGorelli
Copy link
Member

go ahead - I'd suggest spending some time reading through PDEP4 to understand the change, and then removing the infer_datetime_format suggestion here and making sure the rest still makes sense

@ShashwatAgrawal20
Copy link
Contributor

take

@ShashwatAgrawal20
Copy link
Contributor

@MarcoGorelli should I be directly removing the infer_datetime_format or show that it's obsolete

@sm-Fifteen
Copy link
Author

Can't reproduce this:

infer_datetime_format is obsolete as of PDEP4 though, so we should clarify the docs, thanks for the issue

@MarcoGorelli: The difference is usually pretty small, but there are stress cases where it can get pretty significant. Based on your example code:

import pandas as pd
from io import StringIO
from tempfile import TemporaryFile

timestamp_format = '%Y-%d-%m %H:%M:%S%z'

date_format = '%Y-%d-%m'
time_format = '%H:%M:%S'
datetime_format = '%Y-%d-%m %H:%M:%S'

csv_file: TemporaryFile
if csv_file: csv_file.close()
csv_file = TemporaryFile()

date_index = pd.date_range(start='1900', end='2000', freq='12H', tz='Europe/Paris')

dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
dates_df['date_only'] = date_index.strftime(date_format)
dates_df['time_only'] = date_index.strftime(time_format)
dates_df.to_csv(csv_file, header=True)

Timezone parsing is one such case:

%%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['ts_col'], date_parser=lambda x: pd.to_datetime(x, format=timestamp_format), parse_dates=['ts_col'])

51.9 s ± 653 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['ts_col'])
pd.to_datetime(new_df['ts_col'], exact=True, cache=True, format=timestamp_format)

1.78 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Column merging is another:

%%timeit
csv_file.seek(0)
pd.read_csv(csv_file, usecols=['date_only', 'time_only'], date_parser=lambda x: pd.to_datetime(x, format=datetime_format), parse_dates=[['date_only', 'time_only']])

13.1 s ± 344 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
csv_file.seek(0)
new_df = pd.read_csv(csv_file, usecols=['date_only', 'time_only'], parse_dates=[['date_only', 'time_only']])
pd.to_datetime(new_df['date_only_time_only'], exact=True, cache=True, format=datetime_format)

338 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(just realized I ran these on Pandas 1.4.3 instead of 1.5.2, but I would imagine that these cases are niche enough that the performance wouldn't have improved by that much over the 6 months between these two releases).

@MarcoGorelli
Copy link
Member

thanks @sm-Fifteen - I tried running that but got

Traceback (most recent call last):
  File "f.py", line 18, in <module>
    if csv_file: csv_file.close()
NameError: name 'csv_file' is not defined

, could you please make the example reproducible?

@MarcoGorelli
Copy link
Member

actually, it reproduces with just 'tmp.csv'

thanks for the report, will take a look

@MarcoGorelli MarcoGorelli reopened this Jan 4, 2023
@MarcoGorelli
Copy link
Member

OK, got it. In this case, it's due to #40111

So, in

try:
result = tools.to_datetime(
date_parser(*date_cols), errors="ignore", cache=cache_dates
)
if isinstance(result, datetime.datetime):
raise Exception("scalar parser")
return result
except Exception:
return tools.to_datetime(
parsing.try_parse_dates(
parsing.concat_date_cols(date_cols),
parser=date_parser,
dayfirst=dayfirst,
),
errors="ignore",
)

the following happens:

  • date_parser(*date_cols) succeeds, converts the input to a Index of (mixed-offset) Timestamps
  • to_datetime on that Index fails, because it's not idempotent 🤦 (see linked issue)
  • the except block is reached, and so parsing happens row-by-row, as documented:

Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

Gosh, I hate all these fallbacks...TBH I think this warrants and API change, and to make the behaviour:

  • if no date_parser is provided, use to_datetime with errors='ignore'
  • if a date_parser is provided, just use that. No try/except, no magic fallbacks, if a user wants row-by-row parsing they're free to implement that within their parser

@sm-Fifteen
Copy link
Author

sm-Fifteen commented Jan 5, 2023

So you're saying that x in date_parser=lambda x: pd.to_datetime(x, format=datetime_format) is actually supposed to be a pandas Series object representing the full column specified by parse_dates, but that the "magic fallback" thing causes it to be called once per row with x as a string value instead? That woud explain the massive performance cliff observed. From my initial obeservations, I thought it was just always being called row-wise.

@MarcoGorelli
Copy link
Member

yeah - so #50586 would solve the performance issue in your particular case

but I'd still like to make an API change to be honest

@sm-Fifteen
Copy link
Author

I should probably close this by now, since date_parser has been deprecated and replaced with parameters that avoid this performance issue entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs good first issue IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants