Skip to content

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
itamarst opened this issue Apr 9, 2021 · 0 comments
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@itamarst
Copy link

itamarst commented Apr 9, 2021

Is your feature request related to a problem?

Right now when you do pandas.read_sql without chunking, memory usage appears to be O(4N) where N is number of rows. For example, if the rows can be represented in ~1GB of RAM, approximately 4GB of RAM will be used.

By using chunking + server-side cursors (#35689) internally, this could be changed to O(N), or ~1GB of RAM usage in our example.

In particular, it appears that the rows get represented in 4 different forms as part of constructing the DataFrame, as you can see in the example code + memory usage report in "Iteration #1" section of https://pythonspeed.com/articles/pandas-sql-chunking/.

Describe the solution you'd like

Pandas could internally load the SQL in chunks + streaming/server-side-cursors (see the linked article above for details, or #40796), and construct the dataframe with appending. The 4× would then only be on chunk size, i.e. memory usage would be O(N + 4C).

API breaking implications

This should be completely transparent in most ways, except perhaps for same sort of issues you get with read CSV in terms of guessing dtypes? Although perhaps the SQL can use column types for that?

Describe alternatives you've considered

Users can use chunking themselves, but it's non-obvious that you're getting 4× the RAM usage over what you'd actually need, so people end up both duplicating the idiom resulting in bugs, and might actually be able to do the simpler all-in-RAM thing if Pandas did it right.

In general, memory usage is quite opaque (I had to write a whole new profiler to get the graphs in the article linked above) so putting burden on users is not ideal.

Additional context

It's possible to use the open source Fil memory profiler to measure peak memory usage in order to validate an improvement. To unit test memory usage could maybe look at https://docs.python.org/3/library/resource.html#resource.RLIMIT_RSS of a subprocess.

@itamarst itamarst added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2021
@lithomas1 lithomas1 added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Apr 10, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

3 participants