Stop data-diff
when maximum time or # different records is exceeded #402
Description
Is your feature request related to a problem? Please describe.
We run data-diff
for many tables. Sometimes there are a lot of differences between the diffed tables. If so, the data diff for this tablepair might take a very long time (multiple hours). I prefer to skip this diff at a certain point, e.g., when a maximum diff time or # different records is exceeded. For such a diff, I do not care which records differ precisely, I am ok with knowing that this table is very off.
Describe the solution you'd like
Define a:
- maximum diff time
- OR, a maximum # different records
- OR, a maximum % different records
If this threshold is exceeded, the diff is aborted, with a WARNING or ERROR message, and maybe an Exception.
Describe alternatives you've considered
I run data-diff
programmatically and built this feature myself in the Python script that calls data-diff
. This did not work as I hoped because data-diff
uses a ThreadPool
that continued with the diff after I broke out of the diff_tables
iterable.
Additional context