@@ -7,7 +7,7 @@ also find us in `#tools-data-diff` in the [Locally Optimistic Slack.][slack]**
7
7
** data-diff** is a command-line tool and Python library to efficiently diff
8
8
rows across two different databases.
9
9
10
- * ⇄ Verifies across [ many different databases] [ dbs ] (e.g. Postgres -> Snowflake)
10
+ * ⇄ Verifies across [ many different databases] [ dbs ] (e.g. PostgreSQL -> Snowflake)
11
11
* 🔍 Outputs [ diff of rows] ( #example-command-and-output ) in detail
12
12
* 🚨 Simple CLI/API to create monitoring and alerts
13
13
* 🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~ 5min.
@@ -28,7 +28,7 @@ comparing every row.
28
28
29
29
** †:** The implementation for downloading all rows that ` data-diff ` and
30
30
` count(*) ` is compared to is not optimal. It is a single Python multi-threaded
31
- process. The performance is fairly driver-specific, e.g. Postgres' performs 10x
31
+ process. The performance is fairly driver-specific, e.g. PostgreSQL's performs 10x
32
32
better than MySQL.
33
33
34
34
## Table of Contents
@@ -45,7 +45,7 @@ better than MySQL.
45
45
## Common use-cases
46
46
47
47
* ** Verify data migrations.** Verify that all data was copied when doing a
48
- critical data migration. For example, migrating from Heroku Postgres to Amazon RDS.
48
+ critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
49
49
* ** Verifying data pipelines.** Moving data from a relational database to a
50
50
warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
51
51
* ** Alerting and maintaining data integrity SLOs.** You can create and monitor
@@ -63,13 +63,13 @@ better than MySQL.
63
63
64
64
## Example Command and Output
65
65
66
- Below we run a comparison with the CLI for 25M rows in Postgres where the
66
+ Below we run a comparison with the CLI for 25M rows in PostgreSQL where the
67
67
right-hand table is missing single row with ` id=12500048 ` :
68
68
69
69
```
70
70
$ data-diff \
71
- postgres ://postgres :password@localhost/postgres rating \
72
- postgres ://postgres :password@localhost/postgres rating_del1 \
71
+ postgresql ://user :password@localhost/database rating \
72
+ postgresql ://user :password@localhost/database rating_del1 \
73
73
--bisection-threshold 100000 \ # for readability, try default first
74
74
--bisection-factor 6 \ # for readability, try default first
75
75
--update-column timestamp \
@@ -111,7 +111,7 @@ $ data-diff \
111
111
112
112
| Database | Connection string | Status |
113
113
| ---------------| -----------------------------------------------------------------------------------------| --------|
114
- | Postgres | ` postgres ://user:password@hostname:5432/database` | 💚 |
114
+ | PostgreSQL | ` postgresql ://user:password@hostname:5432/database` | 💚 |
115
115
| MySQL | ` mysql://user:password@hostname:5432/database ` | 💚 |
116
116
| Snowflake | ` snowflake://user:password@account/database/SCHEMA?warehouse=WAREHOUSE&role=role ` | 💚 |
117
117
| Oracle | ` oracle://username:password@hostname/database ` | 💛 |
@@ -140,9 +140,9 @@ Requires Python 3.7+ with pip.
140
140
141
141
``` pip install data-diff ```
142
142
143
- or when you need extras like mysql and postgres
143
+ or when you need extras like mysql and postgresql
144
144
145
- ``` pip install "data-diff[mysql,pgsql ]" ```
145
+ ``` pip install "data-diff[mysql,postgresql ]" ```
146
146
147
147
# How to use
148
148
@@ -185,7 +185,7 @@ logging.basicConfig(level=logging.INFO)
185
185
186
186
from data_diff import connect_to_table, diff_tables
187
187
188
- table1 = connect_to_table(" postgres :///" , " table_name" , " id" )
188
+ table1 = connect_to_table(" postgresql :///" , " table_name" , " id" )
189
189
table2 = connect_to_table(" mysql:///" , " table_name" , " id" )
190
190
191
191
for different_row in diff_tables(table1, table2):
@@ -201,11 +201,11 @@ In this section we'll be doing a walk-through of exactly how **data-diff**
201
201
works, and how to tune ` --bisection-factor ` and ` --bisection-threshold ` .
202
202
203
203
Let's consider a scenario with an ` orders ` table with 1M rows. Fivetran is
204
- replicating it contionously from Postgres to Snowflake:
204
+ replicating it contionously from PostgreSQL to Snowflake:
205
205
206
206
```
207
207
┌─────────────┐ ┌─────────────┐
208
- │ Postgres │ │ Snowflake │
208
+ │ PostgreSQL │ │ Snowflake │
209
209
├─────────────┤ ├─────────────┤
210
210
│ │ │ │
211
211
│ │ │ │
@@ -233,7 +233,7 @@ of the table. Then it splits the table into `--bisection-factor=10` segments of
233
233
234
234
```
235
235
┌──────────────────────┐ ┌──────────────────────┐
236
- │ Postgres │ │ Snowflake │
236
+ │ PostgreSQL │ │ Snowflake │
237
237
├──────────────────────┤ ├──────────────────────┤
238
238
│ id=1..100k │ │ id=1..100k │
239
239
├──────────────────────┤ ├──────────────────────┤
@@ -281,7 +281,7 @@ are the same except `id=100k..200k`:
281
281
282
282
```
283
283
┌──────────────────────┐ ┌──────────────────────┐
284
- │ Postgres │ │ Snowflake │
284
+ │ PostgreSQL │ │ Snowflake │
285
285
├──────────────────────┤ ├──────────────────────┤
286
286
│ checksum=0102 │ │ checksum=0102 │
287
287
├──────────────────────┤ mismatch! ├──────────────────────┤
@@ -306,7 +306,7 @@ and compare them in memory in **data-diff**.
306
306
307
307
```
308
308
┌──────────────────────┐ ┌──────────────────────┐
309
- │ Postgres │ │ Snowflake │
309
+ │ PostgreSQL │ │ Snowflake │
310
310
├──────────────────────┤ ├──────────────────────┤
311
311
│ id=100k..110k │ │ id=100k..110k │
312
312
├──────────────────────┤ ├──────────────────────┤
@@ -337,7 +337,7 @@ If you pass `--stats` you'll see e.g. what % of rows were different.
337
337
queries.
338
338
* Consider increasing the number of simultaneous threads executing
339
339
queries per database with ` --threads ` . For databases that limit concurrency
340
- per query, e.g. Postgres /MySQL, this can improve performance dramatically.
340
+ per query, e.g. PostgreSQL /MySQL, this can improve performance dramatically.
341
341
* If you are only interested in _ whether_ something changed, pass ` --limit 1 ` .
342
342
This can be useful if changes are very rare. This is often faster than doing a
343
343
` count(*) ` , for the reason mentioned above.
@@ -419,7 +419,7 @@ Now you can insert it into the testing database(s):
419
419
``` shell-session
420
420
# It's optional to seed more than one to run data-diff(1) against.
421
421
$ poetry run preql -f dev/prepare_db.pql mysql://mysql:[email protected] :3306/mysql
422
- $ poetry run preql -f dev/prepare_db.pql postgres ://postgres:[email protected] :5432/postgres
422
+ $ poetry run preql -f dev/prepare_db.pql postgresql ://postgres:[email protected] :5432/postgres
423
423
424
424
# Cloud databases
425
425
$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>
@@ -430,7 +430,7 @@ $ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
430
430
** 5. Run ** data-diff** against seeded database**
431
431
432
432
``` bash
433
- poetry run python3 -m data_diff postgres ://postgres:Password1@localhost/postgres rating postgres ://postgres:Password1@localhost/postgres rating_del1 --verbose
433
+ poetry run python3 -m data_diff postgresql ://postgres:Password1@localhost/postgres rating postgresql ://postgres:Password1@localhost/postgres rating_del1 --verbose
434
434
```
435
435
436
436
# License
0 commit comments