ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

cloud-rocket · 2020-08-12T15:11:27Z

Is your feature request related to a problem?

pandas.read_sql_query supports Python "generator" pattern when providing chunksize argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based on chunksize. Large datasets will easily run into out-of-memory problems with this approach.

Describe the solution you'd like

Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not support it.

API breaking implications

is_cursor argument of SQLDatabase or SQLiteDatabase is not exposed in pandas.read_sql_query or pandas.read_sql_table without any reason. It should be exposed, this way named (server-side) cursor could be provided.

Describe alternatives you've considered

Instead of doing:

iter = sql.read_sql_query(sql,
      conn,
      index_col='col1',
      chunksize=chunksize)

I tried reimplementing it like this:

from pandas.io.sql import SQLiteDatabase

curs = conn.cursor(name='cur_name') # server side cursor creation
curs.itersize = chunksize

pandas_sql = SQLiteDatabase(curs, is_cursor=True)
iter = pandas_sql.read_query(
      sql,
      index_col='col1',
      chunksize=chunksize)

but it fails because SQLiteDatabase tries to access cursor.description, which is NULL for some reason with server-side cursors (and idea why?).

Additional references

The text was updated successfully, but these errors were encountered:

itamarst · 2021-04-05T18:51:37Z

By setting the right SQLAlchemy option, you can support not just for PostgreSQL but for any database SQLAlchemy knows can do server-side cursors.

For example, this uses ~100MB RAM:

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", engine, chunksize=1000):
        print(f"Got dataframe with {len(chunk_dataframe)} entries")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

And this uses ~35MB RAM, just imports:

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    conn = engine.connect().execution_options(stream_results=True)

    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", conn, chunksize=1000):
        print(f"Got dataframe with {len(chunk_dataframe)} entries")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

See the attached SVGs for memory profiles of both programs.
memory-profiles.zip

Closes pandas-dev#35689.

cloud-rocket added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 12, 2020

alimcmaster1 added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2020

J0 mentioned this issue Apr 6, 2021

Enable server side cursors #40796

Closed

4 tasks

itamarst mentioned this issue Apr 9, 2021

ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage #40847

Open

jlynchMicron mentioned this issue Feb 1, 2022

Add support for Server Side Cursors (a.k.a. stream results) googleapis/python-bigquery-sqlalchemy#407

Open

J0 mentioned this issue Feb 27, 2022

Enable server side cursors #46166

Closed

4 tasks

skshetry added a commit to skshetry/pandas that referenced this issue Jan 5, 2024

ENH: enable server side cursors when chunksize is set.

ffe076f

Closes pandas-dev#35689.

skshetry mentioned this issue Jan 5, 2024

ENH: enable server side cursors when chunksize is set #56742

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

cloud-rocket commented Aug 12, 2020 •

edited

Loading

itamarst commented Apr 5, 2021

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

Comments

cloud-rocket commented Aug 12, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional references

itamarst commented Apr 5, 2021

cloud-rocket commented Aug 12, 2020 •

edited

Loading