Skip to content

Release 0.21.0

Compare
Choose a tag to compare
@dhruvesh09 dhruvesh09 released this 21 Jan 19:44
· 642 commits to master since this release

Release 0.21.0

Major Features and Improvements

  • Started depending on the CSV parsing / type inferring utilities provided
    by tfx-bsl (since tfx-bsl 0.15.2). This also brings performance improvements
    to the CSV decoder (~2x faster in decoding. Type inferring performance is not
    affected).
  • Compute bytes statistics for features of BYTES type. Avoid computing topk and
    uniques for such features.
  • Added LiftStatsGenerator which computes lift between one feature (typically a
    label) and all other categorical features.

Bug Fixes and Other Changes

  • Exclude examples in which the entire sparse feature is missing when
    calculating sparse feature statistics.
  • Validate min_examples_count dataset constraint.
  • Document the schema fields, statistics fields, and detection condition for
    each anomaly type that TFDV detects.
  • Handle null array in cross feature stats generator, top-k & uniques combiner
    stats generator, and sklearn mutual information generator.
  • Handle infinity in basic stats generator.
  • Set num_missing and num_examples correctly in the presence of sparse
    features.
  • Compute weighted feature stats for all weighted features declared in schema.
  • Depends on tensorflow-metadata>=0.21.0,<0.22.
  • Depends on pyarrow>=0.15 (removed the upper bound as it is determined by
    tfx-bsl).
  • Depends on tfx-bsl>=0.21.0,<0.22
  • Depends on apache-beam>=2.17,<3

Breaking Changes

  • Changed the behavior regarding to statistics over CSV data:

    • Previously, if a CSV column was mixed with integers and empty strings, FLOAT
      statistics will be collected for that column. A change was made so INT
      statistics would be collected instead.
  • Removed csv_decoder.DecodeCSVToDict as Dict[str, np.ndarray] had no longer
    been the internal data representation any more since 0.14.

Deprecations