Release 0.21.0
Release 0.21.0
Major Features and Improvements
- Started depending on the CSV parsing / type inferring utilities provided
bytfx-bsl
(since tfx-bsl 0.15.2). This also brings performance improvements
to the CSV decoder (~2x faster in decoding. Type inferring performance is not
affected). - Compute bytes statistics for features of BYTES type. Avoid computing topk and
uniques for such features. - Added LiftStatsGenerator which computes lift between one feature (typically a
label) and all other categorical features.
Bug Fixes and Other Changes
- Exclude examples in which the entire sparse feature is missing when
calculating sparse feature statistics. - Validate min_examples_count dataset constraint.
- Document the schema fields, statistics fields, and detection condition for
each anomaly type that TFDV detects. - Handle null array in cross feature stats generator, top-k & uniques combiner
stats generator, and sklearn mutual information generator. - Handle infinity in basic stats generator.
- Set num_missing and num_examples correctly in the presence of sparse
features. - Compute weighted feature stats for all weighted features declared in schema.
- Depends on
tensorflow-metadata>=0.21.0,<0.22
. - Depends on
pyarrow>=0.15
(removed the upper bound as it is determined by
tfx-bsl
). - Depends on
tfx-bsl>=0.21.0,<0.22
- Depends on
apache-beam>=2.17,<3
Breaking Changes
-
Changed the behavior regarding to statistics over CSV data:
- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
statistics will be collected for that column. A change was made so INT
statistics would be collected instead.
- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
-
Removed
csv_decoder.DecodeCSVToDict
asDict[str, np.ndarray]
had no longer
been the internal data representation any more since 0.14.