description: Various tensorflow ops related to text-processing.
Various tensorflow ops related to text-processing.
keras
module: Tensorflow Text Layers for Keras API.
metrics
module: Tensorflow text-processing metrics.
tflite_registrar
module: tflite_registrar
class BertTokenizer
: Tokenizer used for BERT.
class ByteSplitter
: Splits a string tensor into
bytes.
class Detokenizer
: Base class for detokenizer
implementations.
class FastBertNormalizer
: Normalizes a tensor
of UTF-8 strings.
class FastBertTokenizer
: Tokenizer used for
BERT, a faster version with TFLite support.
class FastSentencepieceTokenizer
:
Sentencepiece tokenizer with tf.text interface.
class FastWordpieceTokenizer
: Tokenizes a
tensor of UTF-8 string tokens into subword pieces.
class FirstNItemSelector
: An ItemSelector
that selects the first n
items in the batch.
class HubModuleSplitter
: Splitter that uses a
Hub module.
class HubModuleTokenizer
: Tokenizer that uses
a Hub module.
class LastNItemSelector
: An ItemSelector
that
selects the last n
items in the batch.
class MaskValuesChooser
: Assigns values to the
items chosen for masking.
class PhraseTokenizer
: Tokenizes a tensor of
UTF-8 string tokens into phrases.
class RandomItemSelector
: An ItemSelector
implementation that randomly selects items in a batch.
class Reduction
: Type of reduction to be done by the
n-gram op.
class RegexSplitter
: RegexSplitter
splits text on
the given regular expression.
class RoundRobinTrimmer
: A Trimmer
that
allocates a length budget to segments via round robin.
class SentencepieceTokenizer
: Tokenizes a
tensor of UTF-8 strings.
class ShrinkLongestTrimmer
: A Trimmer
that
truncates the longest segment.
class SplitMergeFromLogitsTokenizer
:
Tokenizes a tensor of UTF-8 string into words according to logits.
class SplitMergeTokenizer
: Tokenizes a tensor
of UTF-8 string into words according to labels.
class Splitter
: An abstract base class for splitting
text.
class SplitterWithOffsets
: An abstract base
class for splitters that return offsets.
class StateBasedSentenceBreaker
: A
Splitter
that uses a state machine to determine sentence breaks.
class Tokenizer
: Base class for tokenizer
implementations.
class TokenizerWithOffsets
: Base class for
tokenizer implementations that return offsets.
class Trimmer
: Truncates a list of segments using a
pre-determined truncation strategy.
class UnicodeCharTokenizer
: Tokenizes a
tensor of UTF-8 strings on Unicode character boundaries.
class UnicodeScriptTokenizer
: Tokenizes
UTF-8 by splitting when there is a change in Unicode script.
class WaterfallTrimmer
: A Trimmer
that
allocates a length budget to segments in order.
class WhitespaceTokenizer
: Tokenizes a tensor
of UTF-8 strings on whitespaces.
class WordShape
: Values for the 'pattern' arg of the
wordshape op.
class WordpieceTokenizer
: Tokenizes a tensor
of UTF-8 string tokens into subword pieces.
boise_tags_to_offsets(...)
: Converts the
token offsets and BOISE tags into span offsets and span type.
build_fast_bert_normalizer_model(...)
:
build_fast_bert_normalizer_model(arg0: bool) -> bytes
build_fast_wordpiece_model(...)
:
build_fast_wordpiece_model(arg0: List[str], arg1: int, arg2: str, arg3: str,
arg4: bool, arg5: bool) -> bytes
case_fold_utf8(...)
: Applies case folding to every
UTF-8 string in the input.
coerce_to_structurally_valid_utf8(...)
: Coerce UTF-8 input strings to structurally valid UTF-8.
combine_segments(...)
: Combine one or more input
segments for a model's input sequence.
concatenate_segments(...)
: Concatenate input
segments for a model's input sequence.
find_source_offsets(...)
: Maps the input
post-normalized string offsets to pre-normalized offsets.
gather_with_default(...)
: Gather slices with indices=-1
mapped to default
.
greedy_constrained_sequence(...)
: Performs greedy constrained sequence on a batch of examples.
mask_language_model(...)
: Applies dynamic
language model masking.
max_spanning_tree(...)
: Finds the maximum
directed spanning tree of a digraph.
max_spanning_tree_gradient(...)
:
Returns a subgradient of the MaximumSpanningTree op.
ngrams(...)
: Create a tensor of n-grams based on the input data data
.
normalize_utf8(...)
: Normalizes each UTF-8 string
in the input tensor using the specified rule.
normalize_utf8_with_offsets_map(...)
:
Normalizes each UTF-8 string in the input tensor using the specified rule.
offsets_to_boise_tags(...)
: Converts the
given tokens and spans in offsets format into BOISE tags.
pad_along_dimension(...)
: Add padding to the beginning and end of data in a specific dimension.
pad_model_inputs(...)
: Pad model input and
generate corresponding input masks.
regex_split(...)
: Split input
by delimiters that
match a regex pattern.
regex_split_with_offsets(...)
: Split
input
by delimiters that match a regex pattern; returns offsets.
sentence_fragments(...)
: Find the sentence
fragments in a given text. (deprecated)
sliding_window(...)
: Builds a sliding window for data
with a specified width.
span_alignment(...)
: Return an alignment from a set of source spans to a set of target spans.
span_overlaps(...)
: Returns a boolean tensor indicating which source and target spans overlap.
utf8_binarize(...)
: Decode UTF8 tokens into code
points and return their bits.
viterbi_constrained_sequence(...)
: Performs greedy constrained sequence on a batch of examples.
wordshape(...)
: Determine wordshape features for each input string.
**version** | `'2.12.0'` |