Skip to content

Commit 2ff1ad0

Browse files
authored
Merge pull request #20 from horatos/docs/19-results-summary-1
Results & Discussionの1stドラフトを作成する
2 parents 7156d82 + e3d6f9d commit 2ff1ad0

File tree

2 files changed

+243
-0
lines changed

2 files changed

+243
-0
lines changed

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,31 @@ RDBMSとPythonプログラムはそれぞれ別のDockerコンテナで実行す
4343

4444
手順2で使うプログラムは仮説を関数として実装する。このとき、それぞれの関数にはデコレータ`@profile`をつけておくことで行ごとのメモリプロファイルを取得できるようにしておく。各々の関数はinvokeパッケージで起動できるようにしておく。
4545

46+
## 結果
47+
48+
データベースに格納する行数を10万行、chunksizeを1000としたときの実験結果を docs/results-0.txt に保存してある。
49+
50+
実験1のメモリ使用量は`pd.read_sql`のIncrementを採用する。実験記録から使用量はMySQL 8が45.9 MiB、PostgreSQLが68.5 MiBである。
51+
52+
実験2のメモリ使用量は`pd.read_sql``for chunk in it`のIncrementの合計を採用する。実験記録から使用量はMySQL 8が40.6 MiB、PostgreSQLが30.9 MiBである。
53+
54+
実験3のメモリ使用量は、実験1と同様に、`pd.read_sql`のIncrementを採用する。実験記録から使用量はMySQL 8が43.4 MiB、PostgreSQLが71.4 MiBである。
55+
56+
実験4のメモリ使用量は`pd.read_sql``for chunk in it`のIncrementの合計を採用する。実験記録から使用量はMySQL 8が2.5 MiB、PostgreSQLが3.8 MiBである。
57+
58+
| | MySQL 8 | PostgreSQL |
59+
|:------:|---------:|-----------:|
60+
| 実験1 | 45.9 MiB | 68.5 MiB |
61+
| 実験2 | 40.6 MiB | 30.9 MiB |
62+
| 実験3 | 43.4 MiB | 71.4 MiB |
63+
| 実験4 | 2.5 MiB | 3.8 MiB |
64+
65+
## 議論
66+
67+
予想に反してMySQLでもPostgreSQLでも共にサーバーサイドカーソルが利用できることがわかった。
68+
69+
MySQL 8であることが要因かもしれない。また、PyMySQLが対応しているだけかもしれない。
70+
4671
## 参考文献
4772

4873
[^1]: [Loading SQL data into Pandas without running out of memory](https://pythonspeed.com/articles/pandas-sql-chunking/)

docs/results-0.txt

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
#1 [internal] load build definition from Dockerfile
2+
#1 sha256:c2f2ac51cfc1635c596013c9855da173a3818a170154caae19890f352b6abae8
3+
#1 transferring dockerfile: 32B done
4+
#1 DONE 0.0s
5+
6+
#2 [internal] load .dockerignore
7+
#2 sha256:5ff0849c8ee3daa45c339d2a8dae28af9364f4bb8d2db6e0f13dc65674775558
8+
#2 transferring context: 2B done
9+
#2 DONE 0.0s
10+
11+
#3 [internal] load metadata for docker.io/library/python:3.10
12+
#3 sha256:c787ec5cc33e1cbee663ba2529b5d51f0293f2c277b40d9bd37129383a68d5ac
13+
#3 DONE 0.8s
14+
15+
#6 [internal] load build context
16+
#6 sha256:326f51692c8872a5c280f97aff8e7f829f6aeb902b6d6e7080184a45acdc7f1f
17+
#6 transferring context: 188B done
18+
#6 DONE 0.0s
19+
20+
#4 [1/5] FROM docker.io/library/python:3.10@sha256:d4685e083565b8d6290e2b19c367a1ad6623129a4968e187c803b12fefb38c0c
21+
#4 sha256:777e175c3abfb2243123bd3d2f662bfcdc7f7b8a73a141b8ff1bf9b1df79aabc
22+
#4 resolve docker.io/library/python:3.10@sha256:d4685e083565b8d6290e2b19c367a1ad6623129a4968e187c803b12fefb38c0c 0.0s done
23+
#4 DONE 0.0s
24+
25+
#8 [4/5] RUN pip install --no-cache-dir -r requirements.txt
26+
#8 sha256:c52d91651a64cc49d7c3035d618dcbfcacf5f56229c8e559429b58c111c9d605
27+
#8 CACHED
28+
29+
#5 [2/5] WORKDIR /usr/src/app
30+
#5 sha256:495bb222b3141b1c79d577574c909f24d8131b5a83248cfc2d783d05a57770e1
31+
#5 CACHED
32+
33+
#7 [3/5] COPY requirements.txt ./
34+
#7 sha256:78a0ee74f4190d52c8fa150aae4699deae42dac0d50df19ff0db4c8a26e70a98
35+
#7 CACHED
36+
37+
#9 [5/5] COPY . .
38+
#9 sha256:cca02203394cf1b36613cb92c7750dafe763cf4a1980078ca219f5b8ffe3982e
39+
#9 CACHED
40+
41+
#10 exporting to image
42+
#10 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
43+
#10 exporting layers done
44+
#10 writing image sha256:e57ce2fb86de21881e045e6c21aa0f8af05be6aefb22cf7cf9f9d445e916e6f2 done
45+
#10 naming to docker.io/library/measure-pandas-read-sql_app done
46+
#10 DONE 0.0s
47+
Network measure-pandas-read-sql_default Creating
48+
Network measure-pandas-read-sql_default Created
49+
Container measure-pandas-read-sql-mysql-1 Creating
50+
Container measure-pandas-read-sql-postgres-1 Creating
51+
Container measure-pandas-read-sql-mysql-1 Created
52+
Container measure-pandas-read-sql-postgres-1 Created
53+
Container measure-pandas-read-sql-mysql-1 Starting
54+
Container measure-pandas-read-sql-postgres-1 Starting
55+
Container measure-pandas-read-sql-mysql-1 Started
56+
Container measure-pandas-read-sql-postgres-1 Started
57+
[INFO] init: Start waiting databases initialization
58+
[INFO] init: Done waiting databases initialization
59+
[INFO] initialize_mysql: Start adding 100000 rows
60+
[INFO] initialize_mysql: Added rows count = 100000
61+
[INFO] initialize_postgres: Start adding 100000 rows
62+
[INFO] initialize_postgres: Added rows count = 100000
63+
[INFO] exec_experiment_1: Execute the experiment#1 with db = mysql
64+
[INFO] exec_experiment_1: Got 100000 records
65+
Filename: /usr/src/app/experiments.py
66+
67+
Line # Mem usage Increment Occurrences Line Contents
68+
=============================================================
69+
12 74.8 MiB 74.8 MiB 1 @profile
70+
13 def exec_experiment_1(db: str):
71+
14 74.8 MiB 0.0 MiB 1 logger.info("Execute the experiment#1 with db = %s", db)
72+
15
73+
16 77.6 MiB 2.8 MiB 1 conn = create_db_engine(db).connect()
74+
17 123.6 MiB 45.9 MiB 1 dataframe = pd.read_sql("SELECT * FROM users", conn)
75+
18
76+
19 123.6 MiB 0.0 MiB 1 logger.info("Got %s records", len(dataframe))
77+
78+
79+
[INFO] exec_experiment_2: Execute the experiment#2 with db = mysql, chunksize = 1000
80+
[INFO] exec_experiment_2: Got 100000 records
81+
Filename: /usr/src/app/experiments.py
82+
83+
Line # Mem usage Increment Occurrences Line Contents
84+
=============================================================
85+
22 74.8 MiB 74.8 MiB 1 @profile
86+
23 def exec_experiment_2(db: str, chunksize: int):
87+
24 74.8 MiB 0.0 MiB 1 logger.info("Execute the experiment#2 with db = %s, chunksize = %s", db, chunksize)
88+
25 74.8 MiB 0.0 MiB 1 total = 0
89+
26 74.8 MiB 0.0 MiB 1 chunksize = int(chunksize)
90+
27
91+
28 77.6 MiB 2.8 MiB 1 conn = create_db_engine(db).connect()
92+
29 117.6 MiB 40.1 MiB 1 it = pd.read_sql("SELECT * FROM users", conn, chunksize=chunksize)
93+
30 118.2 MiB 0.5 MiB 101 for chunk in it:
94+
31 118.2 MiB 0.0 MiB 100 total += len(chunk)
95+
32
96+
33 118.2 MiB 0.0 MiB 1 logger.info("Got %s records", total)
97+
98+
99+
[INFO] exec_experiment_3: Execute the experiment#3 with db = mysql
100+
[INFO] exec_experiment_3: Got 100000 records
101+
Filename: /usr/src/app/experiments.py
102+
103+
Line # Mem usage Increment Occurrences Line Contents
104+
=============================================================
105+
36 74.7 MiB 74.7 MiB 1 @profile
106+
37 def exec_experiment_3(db: str):
107+
38 74.7 MiB 0.0 MiB 1 logger.info("Execute the experiment#3 with db = %s", db)
108+
39
109+
40 77.4 MiB 2.8 MiB 1 conn = create_db_engine(db).connect().execution_options(stream_results=True)
110+
41 120.9 MiB 43.4 MiB 1 dataframe = pd.read_sql("SELECT * FROM users", conn)
111+
42
112+
43 120.9 MiB 0.0 MiB 1 logger.info("Got %s records", len(dataframe))
113+
114+
115+
[INFO] exec_experiment_4: Execute the experiment#4 with db = mysql, chunksize = 1000
116+
[INFO] exec_experiment_4: Got 100000 records
117+
Filename: /usr/src/app/experiments.py
118+
119+
Line # Mem usage Increment Occurrences Line Contents
120+
=============================================================
121+
46 74.8 MiB 74.8 MiB 1 @profile
122+
47 def exec_experiment_4(db: str, chunksize: int):
123+
48 74.8 MiB 0.0 MiB 1 logger.info("Execute the experiment#4 with db = %s, chunksize = %s", db, chunksize)
124+
49 74.8 MiB 0.0 MiB 1 total = 0
125+
50 74.8 MiB 0.0 MiB 1 chunksize = int(chunksize)
126+
51
127+
52 77.6 MiB 2.8 MiB 1 conn = create_db_engine(db).connect().execution_options(stream_results=True)
128+
53 78.0 MiB 0.4 MiB 1 it = pd.read_sql("SELECT * FROM users", conn, chunksize=chunksize)
129+
54 80.1 MiB 2.1 MiB 101 for chunk in it:
130+
55 80.1 MiB 0.0 MiB 100 total += len(chunk)
131+
56
132+
57 80.1 MiB 0.0 MiB 1 logger.info("Got %s records", total)
133+
134+
135+
[INFO] exec_experiment_1: Execute the experiment#1 with db = postgres
136+
[INFO] exec_experiment_1: Got 100000 records
137+
Filename: /usr/src/app/experiments.py
138+
139+
Line # Mem usage Increment Occurrences Line Contents
140+
=============================================================
141+
12 74.5 MiB 74.5 MiB 1 @profile
142+
13 def exec_experiment_1(db: str):
143+
14 74.5 MiB 0.0 MiB 1 logger.info("Execute the experiment#1 with db = %s", db)
144+
15
145+
16 78.8 MiB 4.3 MiB 1 conn = create_db_engine(db).connect()
146+
17 147.3 MiB 68.5 MiB 1 dataframe = pd.read_sql("SELECT * FROM users", conn)
147+
18
148+
19 147.3 MiB 0.0 MiB 1 logger.info("Got %s records", len(dataframe))
149+
150+
151+
[INFO] exec_experiment_2: Execute the experiment#2 with db = postgres, chunksize = 1000
152+
[INFO] exec_experiment_2: Got 100000 records
153+
Filename: /usr/src/app/experiments.py
154+
155+
Line # Mem usage Increment Occurrences Line Contents
156+
=============================================================
157+
22 74.6 MiB 74.6 MiB 1 @profile
158+
23 def exec_experiment_2(db: str, chunksize: int):
159+
24 74.6 MiB 0.0 MiB 1 logger.info("Execute the experiment#2 with db = %s, chunksize = %s", db, chunksize)
160+
25 74.6 MiB 0.0 MiB 1 total = 0
161+
26 74.6 MiB 0.0 MiB 1 chunksize = int(chunksize)
162+
27
163+
28 78.9 MiB 4.2 MiB 1 conn = create_db_engine(db).connect()
164+
29 108.4 MiB 29.5 MiB 1 it = pd.read_sql("SELECT * FROM users", conn, chunksize=chunksize)
165+
30 110.2 MiB 1.4 MiB 101 for chunk in it:
166+
31 110.2 MiB 0.0 MiB 100 total += len(chunk)
167+
32
168+
33 109.8 MiB -0.5 MiB 1 logger.info("Got %s records", total)
169+
170+
171+
[INFO] exec_experiment_3: Execute the experiment#3 with db = postgres
172+
[INFO] exec_experiment_3: Got 100000 records
173+
Filename: /usr/src/app/experiments.py
174+
175+
Line # Mem usage Increment Occurrences Line Contents
176+
=============================================================
177+
36 74.8 MiB 74.8 MiB 1 @profile
178+
37 def exec_experiment_3(db: str):
179+
38 74.8 MiB 0.0 MiB 1 logger.info("Execute the experiment#3 with db = %s", db)
180+
39
181+
40 79.0 MiB 4.3 MiB 1 conn = create_db_engine(db).connect().execution_options(stream_results=True)
182+
41 150.5 MiB 71.4 MiB 1 dataframe = pd.read_sql("SELECT * FROM users", conn)
183+
42
184+
43 150.5 MiB 0.0 MiB 1 logger.info("Got %s records", len(dataframe))
185+
186+
187+
[INFO] exec_experiment_4: Execute the experiment#4 with db = postgres, chunksize = 1000
188+
[INFO] exec_experiment_4: Got 100000 records
189+
Filename: /usr/src/app/experiments.py
190+
191+
Line # Mem usage Increment Occurrences Line Contents
192+
=============================================================
193+
46 74.6 MiB 74.6 MiB 1 @profile
194+
47 def exec_experiment_4(db: str, chunksize: int):
195+
48 74.6 MiB 0.0 MiB 1 logger.info("Execute the experiment#4 with db = %s, chunksize = %s", db, chunksize)
196+
49 74.6 MiB 0.0 MiB 1 total = 0
197+
50 74.6 MiB 0.0 MiB 1 chunksize = int(chunksize)
198+
51
199+
52 78.9 MiB 4.3 MiB 1 conn = create_db_engine(db).connect().execution_options(stream_results=True)
200+
53 80.5 MiB 1.6 MiB 1 it = pd.read_sql("SELECT * FROM users", conn, chunksize=chunksize)
201+
54 82.7 MiB 2.2 MiB 101 for chunk in it:
202+
55 82.7 MiB 0.0 MiB 100 total += len(chunk)
203+
56
204+
57 82.7 MiB 0.0 MiB 1 logger.info("Got %s records", total)
205+
206+
207+
Container measure-pandas-read-sql-postgres-1 Stopping
208+
Container measure-pandas-read-sql-mysql-1 Stopping
209+
Container measure-pandas-read-sql-mysql-1 Stopping
210+
Container measure-pandas-read-sql-postgres-1 Stopping
211+
Container measure-pandas-read-sql-postgres-1 Stopped
212+
Container measure-pandas-read-sql-postgres-1 Removing
213+
Container measure-pandas-read-sql-postgres-1 Removed
214+
Container measure-pandas-read-sql-mysql-1 Stopped
215+
Container measure-pandas-read-sql-mysql-1 Removing
216+
Container measure-pandas-read-sql-mysql-1 Removed
217+
Network measure-pandas-read-sql_default Removing
218+
Network measure-pandas-read-sql_default Removed

0 commit comments

Comments
 (0)