Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable server side cursors #40796

Closed
wants to merge 12 commits into from
Closed

Enable server side cursors #40796

wants to merge 12 commits into from

Conversation

J0
Copy link

@J0 J0 commented Apr 6, 2021

This pull request attempts to fix #35689. I read an article by @itamarst and decided to look further into the codebase. After noticing that the fix semed to be simple I decided file a PR. The fix is naive and I'm not sure the true solution is as simple as I imagine it to be. There are probably many things I'm missing as I've not read the codebase in detail so do let me know what else needs to be addressed.

@itamarst
Copy link

itamarst commented Apr 7, 2021

Thanks for doing this! You could perhaps only do the stream_results=True conditionally, if chunksize is set.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnt' seems to break anything, but its certainly possible we are not exerciing things. can you a) comment on what that parameter is doing (e.g. put an reference in the code to the article), and b) if you can come up with a test that fails current and passes with this and c) would need a whatsnew note (perf section)? for 1.3.

would we ever need to turn this off?

@itamarst
Copy link

itamarst commented Apr 9, 2021

As context, this is a memory usage performance improvement (in some cases). You can see the article here (https://pythonspeed.com/articles/pandas-sql-chunking/) with some memory-usage graphs.

With streaming mode, rows are only fetched on demand. If you're using chunking, this is what you want, because otherwise you might try to load gigabytes of rows into memory and then dole them out in chunks, potentially running out of memory.

If you're not using chunking, though, streaming is probably a runtime performance hint, because there's a bunch more network latency involved, and you already know you want all the data so why not just get it all in one go? (See #40847 for follow-up work that would use streaming cursors everywhere, but I don't think it should always be on).

@J0
Copy link
Author

J0 commented Apr 9, 2021

Hey @itamarst and @jreback,

Thanks for the input! Will do, will address the comments and revert shortly.

Update: I note the PEP8 style error below, will address it tomorrow when I add

  1. a reference in sql.py to the article
  2. a test that fails current and passes with the addition of the new parameter
  3. A whatsnew note for 1.3 under the perf section

@pep8speaks
Copy link

pep8speaks commented Apr 11, 2021

Hello @J0! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-02 04:15:17 UTC

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 19, 2021
@J0
Copy link
Author

J0 commented May 19, 2021

Yes, I am still interested in working on this and will take a look at this on the weekend

@simonjayhawkins
Copy link
Member

Yes, I am still interested in working on this and will take a look at this on the weekend

Thanks @J0 . closing as stale. ping when ready

@J0
Copy link
Author

J0 commented Jun 9, 2021

Okay, will do. Sorry -- work got a bit busy. Will try to make time for it in this weekend or over an evening. Thanks for your understanding

@J0
Copy link
Author

J0 commented Jun 12, 2021

Hey! @simonjayhawkins,

I'm working on this right now -- any chance I could get this re-opened?

Have addressed the past issues and would like to see if all the checks are passing.

@J0
Copy link
Author

J0 commented Jun 12, 2021

Thanks!

@J0
Copy link
Author

J0 commented Jun 12, 2021

Oh dear, looks like the coverage tests aren't passing.

Not too sure what's the best way to go about writing a failing this for this since the main change is in the option passed in

Will think about this but just wanted to check if @itamarst and @jreback had any suggestions

@itamarst
Copy link

My guess is the failure is unrelated: it timed out after 60 minutes. I would just push a meaningless change and see if it passes this time.

@J0
Copy link
Author

J0 commented Jun 14, 2021

Looks like I need approval to run the workflows D: @simonjayhawkins could I trouble you again for help with requesting for approval to run the branch?

Thank you so much -- am aware you probably are quite busy.

@J0
Copy link
Author

J0 commented Jun 22, 2021

Hey, does anyone have suggestions on how I might get the tests to pass? a little clueless at this point.

@itamarst
Copy link

In a large project like this, it's possible (a) to have intermittent failures (b) for things to break in ways that are unrelated to your code. So general procedure is:

  1. Try rerunning with meaningless commit.
  2. If you get the exact same failure, and it's clear that it's not related to your code, merge forward or rebase, depending on how the project does it.

@J0
Copy link
Author

J0 commented Jul 2, 2021

Workflows :( any chance for help with enabling the workflows again? @simonjayhawkins

@J0 J0 requested a review from jreback July 12, 2021 02:59
@@ -1421,7 +1421,13 @@ def run_transaction(self):

def execute(self, *args, **kwargs):
"""Simple passthrough to SQLAlchemy connectable"""
return self.connectable.execution_options().execute(*args, **kwargs)
if "chunksize" in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to do this, then chunksize should be an actual keyword with a default, and a doc-string.

are there any tests which actually evaluate this path?

@mroeschke
Copy link
Member

Thanks for the PR, but appears that this PR has gone stale again. If you're interested in continuing further by merging master and addressing the comment, feel free to let the team know and we can reopen.

@mroeschke mroeschke closed this Sep 8, 2021
@J0
Copy link
Author

J0 commented Jan 13, 2022

@mroeschke apologies for the delay. I now have time to patch this again -- by any chance could we reopen this? Or would it be better if I opened a new PR?

Lmk!

@mroeschke
Copy link
Member

@J0 appears I cannot reopen this pull request as our master branch was renamed to main recently. I would recommend to open a new PR with the change.

@J0
Copy link
Author

J0 commented Jan 14, 2022

@mroeschke sure, sounds good, will open a new PR! :)

@J0
Copy link
Author

J0 commented Feb 27, 2022

Re-opened at #46166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets
7 participants