Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

Open
cloud-rocket opened this issue Aug 12, 2020 · 1 comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@cloud-rocket
Copy link

cloud-rocket commented Aug 12, 2020

Is your feature request related to a problem?

pandas.read_sql_query supports Python "generator" pattern when providing chunksize argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based on chunksize. Large datasets will easily run into out-of-memory problems with this approach.

Describe the solution you'd like

Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not support it.

API breaking implications

is_cursor argument of SQLDatabase or SQLiteDatabase is not exposed in pandas.read_sql_query or pandas.read_sql_table without any reason. It should be exposed, this way named (server-side) cursor could be provided.

Describe alternatives you've considered

Instead of doing:

iter = sql.read_sql_query(sql,
      conn,
      index_col='col1',
      chunksize=chunksize)

I tried reimplementing it like this:

from pandas.io.sql import SQLiteDatabase

curs = conn.cursor(name='cur_name') # server side cursor creation
curs.itersize = chunksize

pandas_sql = SQLiteDatabase(curs, is_cursor=True)
iter = pandas_sql.read_query(
      sql,
      index_col='col1',
      chunksize=chunksize)

but it fails because SQLiteDatabase tries to access cursor.description, which is NULL for some reason with server-side cursors (and idea why?).

Additional references

@cloud-rocket cloud-rocket added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 12, 2020
@alimcmaster1 alimcmaster1 added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2020
@itamarst
Copy link

itamarst commented Apr 5, 2021

By setting the right SQLAlchemy option, you can support not just for PostgreSQL but for any database SQLAlchemy knows can do server-side cursors.

For example, this uses ~100MB RAM:

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", engine, chunksize=1000):
        print(f"Got dataframe with {len(chunk_dataframe)} entries")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

And this uses ~35MB RAM, just imports:

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    conn = engine.connect().execution_options(stream_results=True)

    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", conn, chunksize=1000):
        print(f"Got dataframe with {len(chunk_dataframe)} entries")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

See the attached SVGs for memory profiles of both programs.
memory-profiles.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
3 participants