ENH: enable server side cursors when chunksize is set #56742

skshetry · 2024-01-05T11:37:47Z

Please look at the previous attempts in #40796 and #46166. A lot of changes have happened since that PR. Since SQLTable (or PandasSQL) seems polymorphic, I did not want to change the function signature of execute() and noticed the use of self.returns_generator which can be used to use server-side cursors or not.

closes ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Closes pandas-dev#35689.

WillAyd · 2024-01-05T19:28:41Z

Does postgres support this? It would be nice if we can add a test that asserts this actually happens.

I suppose it would also be limited to only some of the drivers. I know the ADBC postgres driver does not implement this yet, guessing libpq does?

skshetry · 2024-01-08T11:07:49Z

Does postgres support this?

Yes, postgres supports this. Please take a look at:

Using Server Side Cursors (a.k.a. stream results) - SQLAlchemy

Also note that what we need here is to avoid fetching all the results into the memory, so the title is a bit misleading. My understanding is that, even if the backend does not support server-side cursors, stream_results=True indicates to the dialect that results should be “streamed” and not pre-buffered, if possible.

Server side cursors also imply a wider set of features with relational databases, such as the ability to “scroll” a cursor forwards and backwards. SQLAlchemy does not include any explicit support for these behaviors; within SQLAlchemy itself, the general term “server side cursors” should be considered to mean “unbuffered results” and “client side cursors” means “result rows are buffered into memory before the first row is returned”.

Also note that SQLAlchemy by default buffers 1000 rows, which may be less efficient for small results.

When a Result object delivered using the Connection.execution_options.stream_results option is iterated directly, rows are fetched internally using a default buffering scheme that buffers first a small set of rows, then a larger and larger buffer on each fetch up to a pre-configured limit of 1000 rows. The maximum size of this buffer can be affected using the Connection.execution_options.max_row_buffer execution option:

It would be nice if we can add a test that asserts this actually happens.

Since this is mostly a memory optimization, I am not sure how to test this. I don't want to mock it as much as possible. If you have any ideas, I'm happy to add a test.

I suppose it would also be limited to only some of the drivers.

SQLAlchemy docs says server-side cursors is supported for "mysqlclient, PyMySQL, mariadbconnector dialects and may also be available in others" for mysql and "psycopg2, asyncpg dialects and may also be available in others" for postgres. Oracle uses server-side cursors by default.

Streaming may be supported by other databases too (snowflake does as far as I see). psycopg internally buffers everything unless we use server-side cursors.

I know the ADBC postgres driver does not implement this yet, guessing libpq does?

My understanding is that server side cursors (aka named cursors) [in postgres] are created through SQL commands, so I am not sure if this needs libpq support. But I am not familiar with libpq and haven't used ADBC at all, so I maybe wrong here. Please see Server Side Cursors - psycopg.

skshetry · 2024-01-11T17:36:56Z

I have added a simple test to check for execution_options.

WillAyd · 2024-01-21T15:21:08Z

Since this is mostly a memory optimization, I am not sure how to test this. I don't want to mock it as much as possible. If you have any ideas, I'm happy to add a test.

Probably the best way to assert this actually does something is to add a peakmem benchmark to asv_bench/benchmarks/io/sql.py

github-actions · 2024-02-21T00:05:29Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-02-28T17:51:21Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

skshetry · 2024-02-29T09:09:37Z

I'm so sorry to disappear like this. A lot of things happening both personally and at work that I won't be able to get to it in the short time.

If anyone wants to contribute, the only thing missing from this PR is a memory benchmark (although I don't think it's important).

ENH: enable server side cursors when chunksize is set.

ffe076f

Closes pandas-dev#35689.

skshetry marked this pull request as ready for review January 5, 2024 13:36

WillAyd added the IO SQL to_sql, read_sql, read_sql_query label Jan 5, 2024

add simple test for execution_options

07b363a

fix tests

ed5611b

github-actions bot added the Stale label Feb 21, 2024

Merge branch 'main' into stream-results

27553f7

mroeschke closed this Feb 28, 2024

skshetry deleted the stream-results branch February 29, 2024 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: enable server side cursors when chunksize is set #56742

ENH: enable server side cursors when chunksize is set #56742

skshetry commented Jan 5, 2024 •

edited

Loading

WillAyd commented Jan 5, 2024

skshetry commented Jan 8, 2024 •

edited

Loading

skshetry commented Jan 11, 2024

WillAyd commented Jan 21, 2024

github-actions bot commented Feb 21, 2024

mroeschke commented Feb 28, 2024

skshetry commented Feb 29, 2024

ENH: enable server side cursors when chunksize is set #56742

ENH: enable server side cursors when chunksize is set #56742

Conversation

skshetry commented Jan 5, 2024 • edited Loading

WillAyd commented Jan 5, 2024

skshetry commented Jan 8, 2024 • edited Loading

skshetry commented Jan 11, 2024

WillAyd commented Jan 21, 2024

github-actions bot commented Feb 21, 2024

mroeschke commented Feb 28, 2024

skshetry commented Feb 29, 2024

skshetry commented Jan 5, 2024 •

edited

Loading

skshetry commented Jan 8, 2024 •

edited

Loading