Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable server side cursors when chunksize is set #56742

Closed
wants to merge 4 commits into from

Conversation

skshetry
Copy link

@skshetry skshetry commented Jan 5, 2024

Please look at the previous attempts in #40796 and #46166. A lot of changes have happened since that PR. Since SQLTable (or PandasSQL) seems polymorphic, I did not want to change the function signature of execute() and noticed the use of self.returns_generator which can be used to use server-side cursors or not.

@skshetry skshetry marked this pull request as ready for review January 5, 2024 13:36
@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2024

Does postgres support this? It would be nice if we can add a test that asserts this actually happens.

I suppose it would also be limited to only some of the drivers. I know the ADBC postgres driver does not implement this yet, guessing libpq does?

@WillAyd WillAyd added the IO SQL to_sql, read_sql, read_sql_query label Jan 5, 2024
@skshetry
Copy link
Author

skshetry commented Jan 8, 2024

Does postgres support this?

Yes, postgres supports this. Please take a look at:

Also note that what we need here is to avoid fetching all the results into the memory, so the title is a bit misleading. My understanding is that, even if the backend does not support server-side cursors, stream_results=True indicates to the dialect that results should be “streamed” and not pre-buffered, if possible.

Server side cursors also imply a wider set of features with relational databases, such as the ability to “scroll” a cursor forwards and backwards. SQLAlchemy does not include any explicit support for these behaviors; within SQLAlchemy itself, the general term “server side cursors” should be considered to mean “unbuffered results” and “client side cursors” means “result rows are buffered into memory before the first row is returned”.

Also note that SQLAlchemy by default buffers 1000 rows, which may be less efficient for small results.

When a Result object delivered using the Connection.execution_options.stream_results option is iterated directly, rows are fetched internally using a default buffering scheme that buffers first a small set of rows, then a larger and larger buffer on each fetch up to a pre-configured limit of 1000 rows. The maximum size of this buffer can be affected using the Connection.execution_options.max_row_buffer execution option:


It would be nice if we can add a test that asserts this actually happens.

Since this is mostly a memory optimization, I am not sure how to test this. I don't want to mock it as much as possible. If you have any ideas, I'm happy to add a test.

I suppose it would also be limited to only some of the drivers.

SQLAlchemy docs says server-side cursors is supported for "mysqlclient, PyMySQL, mariadbconnector dialects and may also be available in others" for mysql and "psycopg2, asyncpg dialects and may also be available in others" for postgres. Oracle uses server-side cursors by default.

Streaming may be supported by other databases too (snowflake does as far as I see). psycopg internally buffers everything unless we use server-side cursors.

I know the ADBC postgres driver does not implement this yet, guessing libpq does?

My understanding is that server side cursors (aka named cursors) [in postgres] are created through SQL commands, so I am not sure if this needs libpq support. But I am not familiar with libpq and haven't used ADBC at all, so I maybe wrong here. Please see Server Side Cursors - psycopg.

@skshetry
Copy link
Author

I have added a simple test to check for execution_options.

@WillAyd
Copy link
Member

WillAyd commented Jan 21, 2024

Since this is mostly a memory optimization, I am not sure how to test this. I don't want to mock it as much as possible. If you have any ideas, I'm happy to add a test.

Probably the best way to assert this actually does something is to add a peakmem benchmark to asv_bench/benchmarks/io/sql.py

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 21, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Feb 28, 2024
@skshetry
Copy link
Author

I'm so sorry to disappear like this. A lot of things happening both personally and at work that I won't be able to get to it in the short time.

If anyone wants to contribute, the only thing missing from this PR is a memory benchmark (although I don't think it's important).

@skshetry skshetry deleted the stream-results branch February 29, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets
3 participants