Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
This repository was archived by the owner on May 17, 2024. It is now read-only.

Stop data-diff when maximum time or # different records is exceeded #402

Closed
@JCZuurmond

Description

@JCZuurmond

Is your feature request related to a problem? Please describe.

We run data-diff for many tables. Sometimes there are a lot of differences between the diffed tables. If so, the data diff for this tablepair might take a very long time (multiple hours). I prefer to skip this diff at a certain point, e.g., when a maximum diff time or # different records is exceeded. For such a diff, I do not care which records differ precisely, I am ok with knowing that this table is very off.

Describe the solution you'd like

Define a:

  • maximum diff time
  • OR, a maximum # different records
  • OR, a maximum % different records

If this threshold is exceeded, the diff is aborted, with a WARNING or ERROR message, and maybe an Exception.

Describe alternatives you've considered

I run data-diff programmatically and built this feature myself in the Python script that calls data-diff. This did not work as I hoped because data-diff uses a ThreadPool that continued with the diff after I broke out of the diff_tables iterable.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    --dbtIssues/features related to the dbt integrationenhancementNew feature or requestnon-dbtUse cases outside of dbtstale_immuneImmunity to stale bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions