ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

itamarst · 2021-04-09T12:19:27Z

Is your feature request related to a problem?

Right now when you do pandas.read_sql without chunking, memory usage appears to be O(4N) where N is number of rows. For example, if the rows can be represented in ~1GB of RAM, approximately 4GB of RAM will be used.

By using chunking + server-side cursors (#35689) internally, this could be changed to O(N), or ~1GB of RAM usage in our example.

In particular, it appears that the rows get represented in 4 different forms as part of constructing the DataFrame, as you can see in the example code + memory usage report in "Iteration #1" section of https://pythonspeed.com/articles/pandas-sql-chunking/.

Describe the solution you'd like

Pandas could internally load the SQL in chunks + streaming/server-side-cursors (see the linked article above for details, or #40796), and construct the dataframe with appending. The 4× would then only be on chunk size, i.e. memory usage would be O(N + 4C).

API breaking implications

This should be completely transparent in most ways, except perhaps for same sort of issues you get with read CSV in terms of guessing dtypes? Although perhaps the SQL can use column types for that?

Describe alternatives you've considered

Users can use chunking themselves, but it's non-obvious that you're getting 4× the RAM usage over what you'd actually need, so people end up both duplicating the idiom resulting in bugs, and might actually be able to do the simpler all-in-RAM thing if Pandas did it right.

In general, memory usage is quite opaque (I had to write a whole new profiler to get the graphs in the article linked above) so putting burden on users is not ideal.

Additional context

It's possible to use the open source Fil memory profiler to measure peak memory usage in order to validate an improvement. To unit test memory usage could maybe look at https://docs.python.org/3/library/resource.html#resource.RLIMIT_RSS of a subprocess.

The text was updated successfully, but these errors were encountered:

itamarst added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2021

itamarst mentioned this issue Apr 9, 2021

Enable server side cursors #40796

Closed

4 tasks

lithomas1 added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2021

lithomas1 added this to the Contributions Welcome milestone Apr 10, 2021

axiomcura mentioned this issue May 13, 2022

Potential memory leak in SingleCell's .merge_single_cells() method cytomining/pycytominer#195

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

itamarst commented Apr 9, 2021

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

Comments

itamarst commented Apr 9, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context