Reading table with chunksize still pumps the memory #12265

klonuo · 2016-02-09T09:39:47Z

I'm trying to migrate database tables from MySQL to SQL Server:

import pandas as pd
from sqlalchemy import create_engine

my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen")
ms_engine = create_engine('mssql+pyodbc://localhost/gen?driver=SQL Server')

for table_name in ['topics', 'fiction', 'compact']:
    for table in pd.read_sql_query('SELECT * FROM %s' % table_name,
                                   my_engine, 
                                   chunksize=100000):

        table.to_sql(name=table_name, con=ms_engine, if_exists='append')

I thought that using chunksize would release the memory, but it's just growing up.
I tried also garbage collector, but it has no effect.

Maybe my expectations were wrong?

I'm using Python 3.5.1 with pandas 0.17.1 and all latest packages, although I tried also Python 2.7 with pandas 0.16 and same results

jorisvandenbossche · 2016-02-09T10:37:47Z

Maybe my expectations were wrong?

Possibly. chunksize is in any case a somewhat misleading keyword:

Using chunksize does not necessarily fetches the data from the database into python in chunks. By default it will fetch all data into memory at once, and only returns the data in chunks (so the conversion to a dataframe happens in chunks). Generally, this is a limitation of the database drivers.
See also Add doc note on memory usage of read_sql with chunksize #10693. In fact, pandas could try to set the appropriate execution_options, but until then you can set this yourself for the engine you provide to read_sql (but this works only using psycopg2).
Python does not necessarily free all memory that is not used anymore to the OS, so it will also depend on how you measured the memory usage (as you will probably know since you tried the garbage collector)

Side note, there may be better ways for database migration. Eg teh author of SQLAlchemy has also a database migration tool: https://pypi.python.org/pypi/alembic

klonuo · 2016-02-09T10:47:25Z

Thanks for your prompt reply.
So I've been mislead with chunksize... I remember I was using it in the past, but perhaps understood it wrong, and used it with tables that fits in memory...

PG is also an option for me instead MSSQL, but I'll try alembic first.

Cheers

klonuo · 2016-02-09T11:07:43Z

PG example parameters for this example would be very nice ;)

jorisvandenbossche · 2016-02-09T12:07:00Z

In principle, I think it should be something like this:

engine = create_engine("...", execution_options=dict(stream_results=True))

for table in pd.read_sql_query('SELECT * FROM %s' % table_name, engine, chunksize=100000):
    ...

But, I never tested this myself. Would be interesting to hear experiences with it.

klonuo · 2016-02-09T12:53:16Z

Thanks Joris, you seem like a really nice person

unfortunately your snippet seems like not enough:

import pandas as pd
from sqlalchemy import create_engine

my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen")
pg_engine = create_engine('postgresql://postgres:postgres@localhost:5432/gen',
                          execution_options=dict(stream_results=True))

for table_name in tables.keys():
    for table in pd.read_sql('SELECT * FROM %s' % table_name, 
                             my_engine, 
                             chunksize=10000):

        table.to_sql(name=table_name, con=pg_engine, if_exists='append')

I mean this is just for reference. It would be nice if we could migrate data in chunks to PG just by using pandas.

Anyway I'm reading alembic and I will put here simple script if it's simple as in pandas :)

jorisvandenbossche · 2016-02-09T13:29:29Z

Ah, but note that the execution_option of stream_results is for fetching the data. You are fetching the data from MySQL, which does not support this, and so the above will still fetch all data at once into memory, so it is normal you do not see any difference.

klonuo · 2016-02-14T01:40:21Z

Several days later, for reference...

Alembic was too complicated for my concentration. I tried FME and Navicat apps, and while later didn't manage to make migration through "Data transfer" for all tables, former migrated successfully, but although MySQL tables were encoded in UTF-8 it didn't use nvarchar data type for SQL Server, so I got records with garbage characters. On top of it no index was preserved.

So I used Python (^_^):

#!/usr/bin/env python3

import pandas as pd
from sqlalchemy import create_engine

my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen?charset=utf8")
ms_engine = create_engine('mssql+pyodbc://localhost/gen?driver=SQL Server')

chunksize = 10000
for table_name in ['topics', 'fiction', 'compact']:

    row_count = int(pd.read_sql('SELECT COUNT(*) FROM {table_name}'.format(
        table_name=table_name), my_engine).values)

    for i in range(int(row_count / chunksize) + 1):
        query = 'SELECT * FROM {table_name} LIMIT {offset}, {chunksize}'.format(
            table_name=table_name, offset=i * chunksize, chunksize=chunksize)

        pd.read_sql_query(query, con=my_engine).to_sql(
            name=table_name, con=ms_engine, if_exists='append', index=False)

jorisvandenbossche · 2016-07-11T21:16:44Z

@tfurmston I notice you removed your comment. But, it was a very useful comment, so if you want, feel free to add it again.

robertdavidwest · 2017-04-06T14:42:20Z

Thank you @klonuo, I'm using your solution for myself.

Question. What if the sqlalchemy engine had another boolean option whereby if selected with chunksize this simple loop with limits is done in the background? Curious what other would think. Or is it better to be as explicit, as @klonuo's solution?

alfonsomhc · 2017-06-28T10:42:09Z

I see that server side cursors are supported in sqlalchemy now (New in version 1.1.4):
http://docs.sqlalchemy.org/en/latest/dialects/mysql.html#server-side-cursors

I have verified that

engine = create_engine('mysql+pymysql://user:password@domain/database', server_side_cursors=True)
result = engine.execute(sql_query)
result.fetchone()

returns a row inmediately (i.e. the client doesn't read the complete table in memory). This should be useful to allow read_sql to read in chunks and avoid memory problems. Passing the parameter chunk to fetchmany:
result.fetchmany(chunk)
should do the trick?

Travis-Barton · 2019-12-11T19:55:45Z

I know I'm pretty late to the party, but I use the OFFSET module in my SQL quarries wrapped inside of a for loop to gather the data in chunks.

JakobHenning · 2020-10-02T13:14:56Z

@alfonsomhc I have now tried that, and even with results.fetchmany(10) my memory goes above 7 GB (I have also tried with execution_options={"stream_results":True}

frbelotto · 2024-04-29T20:09:42Z

Guys,
Just to be sure, so, using chunksize wont save me memory, is that right?
what would be a better approach to read a bigger than memory table (I need to read it and save it locally to a CSV file)

DominicJW · 2025-01-02T08:46:04Z

I also had this issue. I was using mysqlconnector. in the connection string f'mysql+mysqlconnector://{username}:{password}@{host}:{port}/{database}'

Passing the parameter execute_option=dict(stream=True) to the create_engine function did not raise any warning or throw an error, but it made no change to how the data was loaded/fetched.

removing execute_option=dict(stream=True) and replacing with server_side_cursors=True did throw an error
ArgumentError: Dialect <sqlalchemy.dialects.mysql.mysqlconnector.MySQLDialect_mysqlconnector object at 0x7b6454107eb0> does not support server side cursors

I swapped to pymysql. this time it worked. but I got a deprecation warning from SQLAlchemy
SADeprecationWarning: The create_engine.server_side_cursors parameter is deprecated and will be removed in a future release. Please use the Connection.execution_options.stream_results parameter

However the code executed fine with no further warnings.
Taking the warnings advice I swapped execute_option=dict(stream=True) back in. And this also worked as expected.

jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Feb 9, 2016

klonuo closed this as completed Feb 9, 2016

jorisvandenbossche added this to the No action milestone Feb 9, 2016

jorisvandenbossche added the Usage Question label Feb 9, 2016

jorisvandenbossche mentioned this issue May 13, 2016

pandas read_sql reads the entire table in to memory despite specifying chunksize #13168

Closed

abmyii mentioned this issue Mar 19, 2022

Stop hanging on reading diff abmyii/DoltK#1

Open

ShotaroTsuji mentioned this issue Jul 13, 2022

Results & Discussionの1stドラフトを作成する horatos/measure-pandas-read-sql#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reading table with chunksize still pumps the memory #12265

Reading table with chunksize still pumps the memory #12265

klonuo commented Feb 9, 2016

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 14, 2016

Uh oh!

jorisvandenbossche commented Jul 11, 2016

Uh oh!

robertdavidwest commented Apr 6, 2017

Uh oh!

alfonsomhc commented Jun 28, 2017 •

edited

Loading

Uh oh!

Travis-Barton commented Dec 11, 2019

Uh oh!

JakobHenning commented Oct 2, 2020

Uh oh!

frbelotto commented Apr 29, 2024

Uh oh!

DominicJW commented Jan 2, 2025

Uh oh!

Uh oh!

Reading table with chunksize still pumps the memory #12265

Reading table with chunksize still pumps the memory #12265

Comments

klonuo commented Feb 9, 2016

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 9, 2016

Uh oh!

jorisvandenbossche commented Feb 9, 2016

Uh oh!

klonuo commented Feb 14, 2016

Uh oh!

jorisvandenbossche commented Jul 11, 2016

Uh oh!

robertdavidwest commented Apr 6, 2017

Uh oh!

alfonsomhc commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Travis-Barton commented Dec 11, 2019

Uh oh!

JakobHenning commented Oct 2, 2020

Uh oh!

frbelotto commented Apr 29, 2024

Uh oh!

DominicJW commented Jan 2, 2025

Uh oh!

alfonsomhc commented Jun 28, 2017 •

edited

Loading