Skip to content

Releases: tensorflow/data-validation

# Version 0.23.0

14 Aug 21:34
ee57dbf
Compare
Choose a tag to compare

Major Features and Improvements

  • Data validation is now able to handle arbitrarily nested arrow
    List/LargeList types. Schema entries for features with multiple nest levels
    describe the value count at each level in the value_counts field.
  • Add combiner stats generator to estimate top-K and uniques using Misra-Gries
    and K-Minimum Values sketches.

Bug Fixes and Other Changes

  • Validate that enough supported images are present (if
    image_domain.minimum_supported_image_fraction is provided).
  • Stopped requiring avro-python3.
  • Depends on apache-beam[gcp]>=2.23,<3.
  • Depends on pyarrow>=0.17,<0.18.
  • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,<3.
  • Depends on tensorflow-metadata>=0.23,<0.24.
  • Depends on tensorflow-transform>=0.23,<0.24.
  • Depends on tfx-bsl>=0.23,<0.24.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A

TFDV 0.22.2 Release

29 Jun 22:30
Compare
Choose a tag to compare

Major Features and Improvements

Bug Fixes and Other Changes

  • Fixed a bug that affected tfx 0.22.0 to work with TFDV 0.22.1.
  • Depends on 'avro-python3>=1.8.1,<1.9.2' on Python 3.5 + MacOS

Known Issues

Breaking Changes

Deprecations

TFDV 0.22.1 Release

24 Jun 23:31
Compare
Choose a tag to compare

Major Features and Improvements

  • Statistics generation is now able to handle arbitrarily nested arrow
    List/LargeList types. Stats about the list elements' presence and valency
    are computed at each nest level, and stored in a newly added field,
    valency_and_presence_stats in CommonStatistics.

Bug Fixes and Other Changes

  • Trigger DATASET_HIGH_NUM_EXAMPLES when a dataset has more than the specified
    limit on number of examples.
  • Fix bug in display_anomalies that prevented dataset-level anomalies from
    being displayed.
  • Trigger anomalies when a feature has a number of unique values that does not
    conform to the specified minimum/maximum.
  • Depends on pandas>=0.24,<2.
  • Depends on tensorflow-metadata>=0.22.2,<0.23.0.
  • Depends on tfx-bsl>=0.22.1,<0.23.0.

Known Issues

Breaking Changes

Deprecations

Version 0.22.0

15 May 23:36
9e29f13
Compare
Choose a tag to compare

Major Features and Improvements

Bug Fixes and Other Changes

  • Crop values in natural language stats generator.
  • Switch to using PyBind11 instead of SWIG for wrapping C++ libraries.
  • CSV decoder support for multivalent columns by using tfx_bsl's decoder.
  • When inferring a schema entry for a feature, do not add a shape with dim = 0
    when min_num_values = 0.
  • Add utility methods tfdv.get_slice_stats to get statistics for a slice and
    tfdv.compare_slices to compare statistics of two slices using Facets.
  • Make tfdv.load_stats_text and tfdv.write_stats_text public.
  • Add PTransforms tfdv.WriteStatisticsToText and
    tfdv.WriteStatisticsToTFRecord to write statistics proto to text and
    tfrecord files respectively.
  • Modify tfdv.load_statistics to handle reading statistics from TFRecord and
    text files.
  • Added an extra requirement group mutual-information. As a result, barebone
    TFDV does not require scikit-learn any more.
  • Added an extra requirement group visualization. As a result, barebone TFDV
    does not require ipython any more.
  • Added an extra requirement group all that specifies all the extra
    dependencies TFDV needs. Use pip install tensorflow-data-validation[all]
    to pull in those dependencies.
  • Depends on pyarrow>=0.16,<0.17.
  • Depends on apache-beam[gcp]>=2.20,<3.
  • Depends on `ipython>=7,<8;python_version>="3"'.
  • Depends on `scikit-learn>=0.18,<0.24'.
  • Depends on tensorflow>=1.15,!=2.0.*,<3.
  • Depends on tensorflow-metadata>=0.22.0,<0.23.
  • Depends on tensorflow-transform>=0.22,<0.23.
  • Depends on tfx-bsl>=0.22,<0.23.

Known Issues

  • (Known issue resolution) It is no longer necessary to use Apache Beam 2.17
    when running TFDV on Windows. The current release of Apache Beam will work.

Breaking Changes

  • tfdv.GenerateStatistics now accepts a PCollection of pa.RecordBatch
    instead of pa.Table.
  • All the TFDV coders now output a PCollection of pa.RecordBatch instead of
    a PCollection of pa.Table.
  • tfdv.validate_instances and
    tfdv.api.validation_api.IdentifyAnomalousExamples now takes
    pa.RecordBatch as input instead of pa.Table.
  • The StatsGenerator interface (and all its sub-classes) now takes
    pa.RecordBatch as the input data instead of pa.Table.
  • Custom slicing functions now accepts a pa.RecordBatch instead of
    pa.Table as input and should output a tuple (slice_key, record_batch).

Deprecations

  • Deprecating Py2 support.

Release 0.21.5

06 Mar 23:36
Compare
Choose a tag to compare

Release 0.21.5

Major Features and Improvements

  • Add label_feature to StatsOptions and enable LiftStatsGenerator when
    label_feature and schema are provided.
  • Add JSON serialization support for StatsOptions.

Bug Fixes and Other Changes

  • Only requires avro-python3>=1.8.1,!=1.9.2.*,<2.0.0 on Python 3.5 + MacOS

Breaking Changes

Deprecations

Release 0.21.4

05 Mar 03:00
Compare
Choose a tag to compare

Release 0.21.4

Major Features and Improvements

  • Support visualizing feature value lift in facets visualization.

Bug Fixes and Other Changes

  • Fix issue writing out string feature values in LiftStatsGenerator.
  • Requires 'apache-beam[gcp]>=2.17,<3'.
  • Requires 'tensorflow-transform>=0.21.1,<0.22'.
  • Requires 'tfx-bsl>=0.21.3,<0.22'.

Breaking Changes

Deprecations

Release 0.21.2

20 Feb 17:28
Compare
Choose a tag to compare

Release 0.21.2

Major Features and Improvements

Bug Fixes and Other Changes

  • Fix facets visualization.

Breaking Changes

Deprecations

  • tfdv.TFExampleDecoder has been removed. This legacy decoder converts
    serialized tf.Example to a dict of numpy arrays, which is the legacy
    input format (prior to Apache Arrow). TFDV has stopped accepting that format
    since 0.14. Use tfdv.DecodeTFExample instead.

Release 0.21.1

11 Feb 21:31
Compare
Choose a tag to compare

Release 0.21.1

Major Features and Improvements

Bug Fixes and Other Changes

  • Do validation on weighted feature stats.
  • During schema inference, skip features which are missing common stats. This
    makes schema inference work when the input stats are generated from some
    pre-existing, unknown schema.
  • Fix facets visualization in Chrome >=M80.

Known Issues

  • Running TFDV with Apache Beam 2.18 or 2.19 does not work on Windows. If you
    are using TFDV on Windows, use Apache Beam 2.17.

Breaking Changes

Deprecations

Release 0.21.0

21 Jan 19:44
Compare
Choose a tag to compare

Release 0.21.0

Major Features and Improvements

  • Started depending on the CSV parsing / type inferring utilities provided
    by tfx-bsl (since tfx-bsl 0.15.2). This also brings performance improvements
    to the CSV decoder (~2x faster in decoding. Type inferring performance is not
    affected).
  • Compute bytes statistics for features of BYTES type. Avoid computing topk and
    uniques for such features.
  • Added LiftStatsGenerator which computes lift between one feature (typically a
    label) and all other categorical features.

Bug Fixes and Other Changes

  • Exclude examples in which the entire sparse feature is missing when
    calculating sparse feature statistics.
  • Validate min_examples_count dataset constraint.
  • Document the schema fields, statistics fields, and detection condition for
    each anomaly type that TFDV detects.
  • Handle null array in cross feature stats generator, top-k & uniques combiner
    stats generator, and sklearn mutual information generator.
  • Handle infinity in basic stats generator.
  • Set num_missing and num_examples correctly in the presence of sparse
    features.
  • Compute weighted feature stats for all weighted features declared in schema.
  • Depends on tensorflow-metadata>=0.21.0,<0.22.
  • Depends on pyarrow>=0.15 (removed the upper bound as it is determined by
    tfx-bsl).
  • Depends on tfx-bsl>=0.21.0,<0.22
  • Depends on apache-beam>=2.17,<3

Breaking Changes

  • Changed the behavior regarding to statistics over CSV data:

    • Previously, if a CSV column was mixed with integers and empty strings, FLOAT
      statistics will be collected for that column. A change was made so INT
      statistics would be collected instead.
  • Removed csv_decoder.DecodeCSVToDict as Dict[str, np.ndarray] had no longer
    been the internal data representation any more since 0.14.

Deprecations

Release 0.15.0

23 Oct 02:18
Compare
Choose a tag to compare

Major Features and Improvements

  • Generate statistics for sparse features.
  • Directly convert a batch of tf.Examples to Arrow tables. Avoids conversion of
    tf.Example to intermediate Dict representation.

Bug Fixes and Other Changes

  • Generate statistics for the weight feature.
  • Support validation and schema inference from sliced statistics that include
    the default slice (validation/inference will be done using the default slice
    statistics).
  • Avoid flattening null arrays.
  • Set weighted_num_examples field in the statistics proto if a weight
    feature is specified.
  • Replace DecodedExamplesToTable with a Python implementation.
  • Building TFDV from source does not need pyarrow anymore.
  • Depends on apache-beam[gcp]>=2.16,<3.
  • Depends on six>=1.12,<2.
  • Depends on scikit-learn>=0.18,<0.22.
  • Depends on tfx-bsl>=0.15,<0.16.
  • Depends on tensorflow-metadata>=0.15,<0.16.
  • Depends on tensorflow-transform>=0.15,<0.16.
  • Depends on tensorflow>=1.15,<3.
    • Starting from 1.15, package
      tensorflow comes with GPU support. Users won't need to choose between
      tensorflow and tensorflow-gpu.
    • Caveat: tensorflow 2.0.0 is an exception and does not have GPU
      support. If tensorflow-gpu 2.0.0 is installed before installing
      tensorflow-data-validation, it will be replaced with tensorflow 2.0.0.
      Re-install tensorflow-gpu 2.0.0 if needed.

Breaking Changes

Deprecations