Release 0.21.0

dhruvesh09 released this 21 Jan 19:44

· 642 commits to master since this release

aafb9eb

Release 0.21.0

Major Features and Improvements

Started depending on the CSV parsing / type inferring utilities provided
by tfx-bsl (since tfx-bsl 0.15.2). This also brings performance improvements
to the CSV decoder (~2x faster in decoding. Type inferring performance is not
affected).
Compute bytes statistics for features of BYTES type. Avoid computing topk and
uniques for such features.
Added LiftStatsGenerator which computes lift between one feature (typically a
label) and all other categorical features.

Bug Fixes and Other Changes

Exclude examples in which the entire sparse feature is missing when
calculating sparse feature statistics.
Validate min_examples_count dataset constraint.
Document the schema fields, statistics fields, and detection condition for
each anomaly type that TFDV detects.
Handle null array in cross feature stats generator, top-k & uniques combiner
stats generator, and sklearn mutual information generator.
Handle infinity in basic stats generator.
Set num_missing and num_examples correctly in the presence of sparse
features.
Compute weighted feature stats for all weighted features declared in schema.
Depends on tensorflow-metadata>=0.21.0,<0.22.
Depends on pyarrow>=0.15 (removed the upper bound as it is determined by
tfx-bsl).
Depends on tfx-bsl>=0.21.0,<0.22
Depends on apache-beam>=2.17,<3

Breaking Changes

Changed the behavior regarding to statistics over CSV data:
- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
  statistics will be collected for that column. A change was made so INT
  statistics would be collected instead.
Removed csv_decoder.DecodeCSVToDict as Dict[str, np.ndarray] had no longer
been the internal data representation any more since 0.14.

Deprecations

Assets 2