-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689
Labels
Comments
By setting the right SQLAlchemy option, you can support not just for PostgreSQL but for any database SQLAlchemy knows can do server-side cursors. For example, this uses ~100MB RAM: import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass@localhost/example"
)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", engine, chunksize=1000):
print(f"Got dataframe with {len(chunk_dataframe)} entries")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas() And this uses ~35MB RAM, just imports: import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass@localhost/example"
)
conn = engine.connect().execution_options(stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe with {len(chunk_dataframe)} entries")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas() See the attached SVGs for memory profiles of both programs. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem?
pandas.read_sql_query
supports Python "generator" pattern when providingchunksize
argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based onchunksize
. Large datasets will easily run into out-of-memory problems with this approach.Describe the solution you'd like
Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not support it.
API breaking implications
is_cursor
argument ofSQLDatabase
orSQLiteDatabase
is not exposed inpandas.read_sql_query
orpandas.read_sql_table
without any reason. It should be exposed, this way named (server-side) cursor could be provided.Describe alternatives you've considered
Instead of doing:
I tried reimplementing it like this:
but it fails because
SQLiteDatabase
tries to accesscursor.description
, which is NULL for some reason with server-side cursors (and idea why?).Additional references
The text was updated successfully, but these errors were encountered: