-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Reading table with chunksize still pumps the memory #12265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Possibly.
Side note, there may be better ways for database migration. Eg teh author of SQLAlchemy has also a database migration tool: https://pypi.python.org/pypi/alembic |
Thanks for your prompt reply. PG is also an option for me instead MSSQL, but I'll try alembic first. Cheers |
PG example parameters for this example would be very nice ;) |
In principle, I think it should be something like this:
But, I never tested this myself. Would be interesting to hear experiences with it. |
Thanks Joris, you seem like a really nice person unfortunately your snippet seems like not enough: import pandas as pd
from sqlalchemy import create_engine
my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen")
pg_engine = create_engine('postgresql://postgres:postgres@localhost:5432/gen',
execution_options=dict(stream_results=True))
for table_name in tables.keys():
for table in pd.read_sql('SELECT * FROM %s' % table_name,
my_engine,
chunksize=10000):
table.to_sql(name=table_name, con=pg_engine, if_exists='append') I mean this is just for reference. It would be nice if we could migrate data in chunks to PG just by using pandas. Anyway I'm reading alembic and I will put here simple script if it's simple as in pandas :) |
Ah, but note that the |
Several days later, for reference... Alembic was too complicated for my concentration. I tried FME and Navicat apps, and while later didn't manage to make migration through "Data transfer" for all tables, former migrated successfully, but although MySQL tables were encoded in UTF-8 it didn't use So I used Python (^_^): #!/usr/bin/env python3
import pandas as pd
from sqlalchemy import create_engine
my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen?charset=utf8")
ms_engine = create_engine('mssql+pyodbc://localhost/gen?driver=SQL Server')
chunksize = 10000
for table_name in ['topics', 'fiction', 'compact']:
row_count = int(pd.read_sql('SELECT COUNT(*) FROM {table_name}'.format(
table_name=table_name), my_engine).values)
for i in range(int(row_count / chunksize) + 1):
query = 'SELECT * FROM {table_name} LIMIT {offset}, {chunksize}'.format(
table_name=table_name, offset=i * chunksize, chunksize=chunksize)
pd.read_sql_query(query, con=my_engine).to_sql(
name=table_name, con=ms_engine, if_exists='append', index=False) |
@tfurmston I notice you removed your comment. But, it was a very useful comment, so if you want, feel free to add it again. |
Thank you @klonuo, I'm using your solution for myself. Question. What if the sqlalchemy engine had another boolean option whereby if selected with chunksize this simple loop with limits is done in the background? Curious what other would think. Or is it better to be as explicit, as @klonuo's solution? |
I see that server side cursors are supported in sqlalchemy now (New in version 1.1.4): I have verified that
returns a row inmediately (i.e. the client doesn't read the complete table in memory). This should be useful to allow read_sql to read in chunks and avoid memory problems. Passing the parameter chunk to fetchmany: |
I know I'm pretty late to the party, but I use the OFFSET module in my SQL quarries wrapped inside of a for loop to gather the data in chunks. |
@alfonsomhc I have now tried that, and even with |
Guys, |
I also had this issue. I was using mysqlconnector. in the connection string Passing the parameter removing I swapped to pymysql. this time it worked. but I got a deprecation warning from SQLAlchemy However the code executed fine with no further warnings. |
I'm trying to migrate database tables from MySQL to SQL Server:
I thought that using chunksize would release the memory, but it's just growing up.
I tried also garbage collector, but it has no effect.
Maybe my expectations were wrong?
I'm using Python 3.5.1 with pandas 0.17.1 and all latest packages, although I tried also Python 2.7 with pandas 0.16 and same results
The text was updated successfully, but these errors were encountered: