You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now when you do pandas.read_sql without chunking, memory usage appears to be O(4N) where N is number of rows. For example, if the rows can be represented in ~1GB of RAM, approximately 4GB of RAM will be used.
By using chunking + server-side cursors (#35689) internally, this could be changed to O(N), or ~1GB of RAM usage in our example.
In particular, it appears that the rows get represented in 4 different forms as part of constructing the DataFrame, as you can see in the example code + memory usage report in "Iteration #1" section of https://pythonspeed.com/articles/pandas-sql-chunking/.
Describe the solution you'd like
Pandas could internally load the SQL in chunks + streaming/server-side-cursors (see the linked article above for details, or #40796), and construct the dataframe with appending. The 4× would then only be on chunk size, i.e. memory usage would be O(N + 4C).
API breaking implications
This should be completely transparent in most ways, except perhaps for same sort of issues you get with read CSV in terms of guessing dtypes? Although perhaps the SQL can use column types for that?
Describe alternatives you've considered
Users can use chunking themselves, but it's non-obvious that you're getting 4× the RAM usage over what you'd actually need, so people end up both duplicating the idiom resulting in bugs, and might actually be able to do the simpler all-in-RAM thing if Pandas did it right.
In general, memory usage is quite opaque (I had to write a whole new profiler to get the graphs in the article linked above) so putting burden on users is not ideal.
Is your feature request related to a problem?
Right now when you do
pandas.read_sql
without chunking, memory usage appears to beO(4N)
where N is number of rows. For example, if the rows can be represented in ~1GB of RAM, approximately 4GB of RAM will be used.By using chunking + server-side cursors (#35689) internally, this could be changed to
O(N)
, or ~1GB of RAM usage in our example.In particular, it appears that the rows get represented in 4 different forms as part of constructing the DataFrame, as you can see in the example code + memory usage report in "Iteration #1" section of https://pythonspeed.com/articles/pandas-sql-chunking/.
Describe the solution you'd like
Pandas could internally load the SQL in chunks + streaming/server-side-cursors (see the linked article above for details, or #40796), and construct the dataframe with appending. The 4× would then only be on chunk size, i.e. memory usage would be
O(N + 4C)
.API breaking implications
This should be completely transparent in most ways, except perhaps for same sort of issues you get with read CSV in terms of guessing dtypes? Although perhaps the SQL can use column types for that?
Describe alternatives you've considered
Users can use chunking themselves, but it's non-obvious that you're getting 4× the RAM usage over what you'd actually need, so people end up both duplicating the idiom resulting in bugs, and might actually be able to do the simpler all-in-RAM thing if Pandas did it right.
In general, memory usage is quite opaque (I had to write a whole new profiler to get the graphs in the article linked above) so putting burden on users is not ideal.
Additional context
It's possible to use the open source Fil memory profiler to measure peak memory usage in order to validate an improvement. To unit test memory usage could maybe look at https://docs.python.org/3/library/resource.html#resource.RLIMIT_RSS of a subprocess.
The text was updated successfully, but these errors were encountered: