Releases: tensorflow/data-validation
Releases · tensorflow/data-validation
# Version 0.23.0
Major Features and Improvements
- Data validation is now able to handle arbitrarily nested arrow
List/LargeList types. Schema entries for features with multiple nest levels
describe the value count at each level in the value_counts field. - Add combiner stats generator to estimate top-K and uniques using Misra-Gries
and K-Minimum Values sketches.
Bug Fixes and Other Changes
- Validate that enough supported images are present (if
image_domain.minimum_supported_image_fraction is provided). - Stopped requiring avro-python3.
- Depends on
apache-beam[gcp]>=2.23,<3
. - Depends on
pyarrow>=0.17,<0.18
. - Depends on
tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,<3
. - Depends on
tensorflow-metadata>=0.23,<0.24
. - Depends on
tensorflow-transform>=0.23,<0.24
. - Depends on
tfx-bsl>=0.23,<0.24
.
Known Issues
- N/A
Breaking Changes
- N/A
Deprecations
- N/A
TFDV 0.22.2 Release
Major Features and Improvements
Bug Fixes and Other Changes
- Fixed a bug that affected tfx 0.22.0 to work with TFDV 0.22.1.
- Depends on 'avro-python3>=1.8.1,<1.9.2' on Python 3.5 + MacOS
Known Issues
Breaking Changes
Deprecations
TFDV 0.22.1 Release
Major Features and Improvements
- Statistics generation is now able to handle arbitrarily nested arrow
List/LargeList types. Stats about the list elements' presence and valency
are computed at each nest level, and stored in a newly added field,
valency_and_presence_stats
inCommonStatistics
.
Bug Fixes and Other Changes
- Trigger DATASET_HIGH_NUM_EXAMPLES when a dataset has more than the specified
limit on number of examples. - Fix bug in display_anomalies that prevented dataset-level anomalies from
being displayed. - Trigger anomalies when a feature has a number of unique values that does not
conform to the specified minimum/maximum. - Depends on
pandas>=0.24,<2
. - Depends on
tensorflow-metadata>=0.22.2,<0.23.0
. - Depends on
tfx-bsl>=0.22.1,<0.23.0
.
Known Issues
Breaking Changes
Deprecations
Version 0.22.0
Major Features and Improvements
Bug Fixes and Other Changes
- Crop values in natural language stats generator.
- Switch to using PyBind11 instead of SWIG for wrapping C++ libraries.
- CSV decoder support for multivalent columns by using tfx_bsl's decoder.
- When inferring a schema entry for a feature, do not add a shape with dim = 0
when min_num_values = 0. - Add utility methods
tfdv.get_slice_stats
to get statistics for a slice and
tfdv.compare_slices
to compare statistics of two slices using Facets. - Make
tfdv.load_stats_text
andtfdv.write_stats_text
public. - Add PTransforms
tfdv.WriteStatisticsToText
and
tfdv.WriteStatisticsToTFRecord
to write statistics proto to text and
tfrecord files respectively. - Modify
tfdv.load_statistics
to handle reading statistics from TFRecord and
text files. - Added an extra requirement group
mutual-information
. As a result, barebone
TFDV does not requirescikit-learn
any more. - Added an extra requirement group
visualization
. As a result, barebone TFDV
does not requireipython
any more. - Added an extra requirement group
all
that specifies all the extra
dependencies TFDV needs. Usepip install tensorflow-data-validation[all]
to pull in those dependencies. - Depends on
pyarrow>=0.16,<0.17
. - Depends on
apache-beam[gcp]>=2.20,<3
. - Depends on `ipython>=7,<8;python_version>="3"'.
- Depends on `scikit-learn>=0.18,<0.24'.
- Depends on
tensorflow>=1.15,!=2.0.*,<3
. - Depends on
tensorflow-metadata>=0.22.0,<0.23
. - Depends on
tensorflow-transform>=0.22,<0.23
. - Depends on
tfx-bsl>=0.22,<0.23
.
Known Issues
- (Known issue resolution) It is no longer necessary to use Apache Beam 2.17
when running TFDV on Windows. The current release of Apache Beam will work.
Breaking Changes
tfdv.GenerateStatistics
now accepts a PCollection ofpa.RecordBatch
instead ofpa.Table
.- All the TFDV coders now output a PCollection of
pa.RecordBatch
instead of
a PCollection ofpa.Table
. tfdv.validate_instances
and
tfdv.api.validation_api.IdentifyAnomalousExamples
now takes
pa.RecordBatch
as input instead ofpa.Table
.- The
StatsGenerator
interface (and all its sub-classes) now takes
pa.RecordBatch
as the input data instead ofpa.Table
. - Custom slicing functions now accepts a
pa.RecordBatch
instead of
pa.Table
as input and should output a tuple(slice_key, record_batch)
.
Deprecations
- Deprecating Py2 support.
Release 0.21.5
Release 0.21.5
Major Features and Improvements
- Add
label_feature
toStatsOptions
and enableLiftStatsGenerator
when
label_feature
andschema
are provided. - Add JSON serialization support for StatsOptions.
Bug Fixes and Other Changes
- Only requires
avro-python3>=1.8.1,!=1.9.2.*,<2.0.0
on Python 3.5 + MacOS
Breaking Changes
Deprecations
Release 0.21.4
Release 0.21.4
Major Features and Improvements
- Support visualizing feature value lift in facets visualization.
Bug Fixes and Other Changes
- Fix issue writing out string feature values in LiftStatsGenerator.
- Requires 'apache-beam[gcp]>=2.17,<3'.
- Requires 'tensorflow-transform>=0.21.1,<0.22'.
- Requires 'tfx-bsl>=0.21.3,<0.22'.
Breaking Changes
Deprecations
Release 0.21.2
Release 0.21.2
Major Features and Improvements
Bug Fixes and Other Changes
- Fix facets visualization.
Breaking Changes
Deprecations
tfdv.TFExampleDecoder
has been removed. This legacy decoder converts
serializedtf.Example
to a dict of numpy arrays, which is the legacy
input format (prior to Apache Arrow). TFDV has stopped accepting that format
since 0.14. Usetfdv.DecodeTFExample
instead.
Release 0.21.1
Release 0.21.1
Major Features and Improvements
Bug Fixes and Other Changes
- Do validation on weighted feature stats.
- During schema inference, skip features which are missing common stats. This
makes schema inference work when the input stats are generated from some
pre-existing, unknown schema. - Fix facets visualization in Chrome >=M80.
Known Issues
- Running TFDV with Apache Beam 2.18 or 2.19 does not work on Windows. If you
are using TFDV on Windows, use Apache Beam 2.17.
Breaking Changes
Deprecations
Release 0.21.0
Release 0.21.0
Major Features and Improvements
- Started depending on the CSV parsing / type inferring utilities provided
bytfx-bsl
(since tfx-bsl 0.15.2). This also brings performance improvements
to the CSV decoder (~2x faster in decoding. Type inferring performance is not
affected). - Compute bytes statistics for features of BYTES type. Avoid computing topk and
uniques for such features. - Added LiftStatsGenerator which computes lift between one feature (typically a
label) and all other categorical features.
Bug Fixes and Other Changes
- Exclude examples in which the entire sparse feature is missing when
calculating sparse feature statistics. - Validate min_examples_count dataset constraint.
- Document the schema fields, statistics fields, and detection condition for
each anomaly type that TFDV detects. - Handle null array in cross feature stats generator, top-k & uniques combiner
stats generator, and sklearn mutual information generator. - Handle infinity in basic stats generator.
- Set num_missing and num_examples correctly in the presence of sparse
features. - Compute weighted feature stats for all weighted features declared in schema.
- Depends on
tensorflow-metadata>=0.21.0,<0.22
. - Depends on
pyarrow>=0.15
(removed the upper bound as it is determined by
tfx-bsl
). - Depends on
tfx-bsl>=0.21.0,<0.22
- Depends on
apache-beam>=2.17,<3
Breaking Changes
-
Changed the behavior regarding to statistics over CSV data:
- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
statistics will be collected for that column. A change was made so INT
statistics would be collected instead.
- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
-
Removed
csv_decoder.DecodeCSVToDict
asDict[str, np.ndarray]
had no longer
been the internal data representation any more since 0.14.
Deprecations
Release 0.15.0
Major Features and Improvements
- Generate statistics for sparse features.
- Directly convert a batch of tf.Examples to Arrow tables. Avoids conversion of
tf.Example to intermediate Dict representation.
Bug Fixes and Other Changes
- Generate statistics for the weight feature.
- Support validation and schema inference from sliced statistics that include
the default slice (validation/inference will be done using the default slice
statistics). - Avoid flattening null arrays.
- Set
weighted_num_examples
field in the statistics proto if a weight
feature is specified. - Replace DecodedExamplesToTable with a Python implementation.
- Building TFDV from source does not need pyarrow anymore.
- Depends on
apache-beam[gcp]>=2.16,<3
. - Depends on
six>=1.12,<2
. - Depends on
scikit-learn>=0.18,<0.22
. - Depends on
tfx-bsl>=0.15,<0.16
. - Depends on
tensorflow-metadata>=0.15,<0.16
. - Depends on
tensorflow-transform>=0.15,<0.16
. - Depends on
tensorflow>=1.15,<3
.- Starting from 1.15, package
tensorflow
comes with GPU support. Users won't need to choose between
tensorflow
andtensorflow-gpu
. - Caveat:
tensorflow
2.0.0 is an exception and does not have GPU
support. Iftensorflow-gpu
2.0.0 is installed before installing
tensorflow-data-validation
, it will be replaced withtensorflow
2.0.0.
Re-installtensorflow-gpu
2.0.0 if needed.
- Starting from 1.15, package