From f59fb1ff5d07be47b45b10444298e45140ffdc42 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Thu, 5 Mar 2020 07:12:56 -0500 Subject: [PATCH 01/10] [DOCS] Reformat word delimiter graph token filter Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph` --- .../tokenfilters/trim-tokenfilter.asciidoc | 7 +- .../word-delimiter-graph-tokenfilter.asciidoc | 427 ++++++++++++++---- 2 files changed, 352 insertions(+), 82 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc index 19d47f203afb8..ff0775ae39915 100644 --- a/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc @@ -4,7 +4,9 @@ Trim ++++ -Removes leading and trailing whitespace from each token in a stream. +Removes leading and trailing whitespace from each token in a stream. While this +can change the length of a token, the `trim` filter does _not_ change a token's +offsets. The `trim` filter uses Lucene's https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html[TrimFilter]. @@ -69,7 +71,8 @@ GET _analyze ---- The API returns the following response. The returned `fox` token does not -include any leading or trailing whitespace. +include any leading or trailing whitespace. Note that despite changing the +token's length, the `start_offset` and `end_offset` remain the same. [source,console-result] ---- diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 65cdd2575be20..b4c803dbb7fd9 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -4,105 +4,372 @@ Word delimiter graph ++++ -experimental[This functionality is marked as experimental in Lucene] +experimental::["The `word_delimiter_graph` filter uses Lucene's {lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter], which is marked as experimental in Lucene."] -Named `word_delimiter_graph`, it splits words into subwords and performs -optional transformations on subword groups. Words are split into -subwords with the following rules: +Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter +also performs optional token normalization based on a set of rules. By default, +the filter uses the following rules: -* split on intra-word delimiters (by default, all non alpha-numeric -characters). -* "Wi-Fi" -> "Wi", "Fi" -* split on case transitions: "PowerShot" -> "Power", "Shot" -* split on letter-number transitions: "SD500" -> "SD", "500" -* leading and trailing intra-word delimiters on each subword are -ignored: "//hello---there, 'dude'" -> "hello", "there", "dude" -* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil" +* Split tokens at non-alphanumeric characters. + The filter uses these characters as delimiters. + For example: `Wi-Fi` -> `Wi`, `Fi` +* Remove leading or trailing delimiters from each token. + For example: `hello---there, 'dude'` -> `hello`, `there`, `dude` +* Split tokens at letter case transitions. + For example: `PowerShot` -> `Power`, `Shot` +* Split tokens at letter-number transitions. + For example: `SD500` -> `SD`, `500` +* Remove the English possessive (`'s`) from the end of each token. + For example: `Neil's` -> `Neil` -Unlike the `word_delimiter`, this token filter correctly handles positions for -multi terms expansion at search-time when any of the following options -are set to true: +[[analysis-word-delimiter-graph-tokenfilter-analyze-ex]] +==== Example - * `preserve_original` - * `catenate_numbers` - * `catenate_words` - * `catenate_all` +The following analyze API request uses the `word_delimiter_graph` filter to +split `Neil's Wi-Fi-enabled PowerShot SD500` into normalized tokens using the +filter's default rules: -Parameters include: +[source,console] +---- +GET /_analyze +{ + "tokenizer": "whitespace", + "filter": [ "word_delimiter_graph" ], + "text": "Neil's Wi-Fi-enabled PowerShot SD500" +} +---- -`generate_word_parts`:: - If `true` causes parts of words to be - generated: "PowerShot" -> "Power" "Shot". Defaults to `true`. +The filter produces the following tokens: -`generate_number_parts`:: - If `true` causes number subwords to be - generated: "500-42" -> "500" "42". Defaults to `true`. +[source,txt] +---- +[ Neil, Wi, Fi, enabled, Power, Shot, SD, 500 ] +---- -`catenate_words`:: - If `true` causes maximum runs of word parts to be - catenated: "wi-fi" -> "wifi". Defaults to `false`. +//// +[source,console-result] +---- +{ + "tokens" : [ + { + "token" : "Neil", + "start_offset" : 0, + "end_offset" : 4, + "type" : "word", + "position" : 0 + }, + { + "token" : "Wi", + "start_offset" : 7, + "end_offset" : 9, + "type" : "word", + "position" : 1 + }, + { + "token" : "Fi", + "start_offset" : 10, + "end_offset" : 12, + "type" : "word", + "position" : 2 + }, + { + "token" : "enabled", + "start_offset" : 13, + "end_offset" : 20, + "type" : "word", + "position" : 3 + }, + { + "token" : "Power", + "start_offset" : 21, + "end_offset" : 26, + "type" : "word", + "position" : 4 + }, + { + "token" : "Shot", + "start_offset" : 26, + "end_offset" : 30, + "type" : "word", + "position" : 5 + }, + { + "token" : "SD", + "start_offset" : 31, + "end_offset" : 33, + "type" : "word", + "position" : 6 + }, + { + "token" : "500", + "start_offset" : 33, + "end_offset" : 36, + "type" : "word", + "position" : 7 + } + ] +} +---- +//// -`catenate_numbers`:: - If `true` causes maximum runs of number parts to - be catenated: "500-42" -> "50042". Defaults to `false`. +[analysis-word-delimiter-tokenfilter-analyzer-ex]] +==== Add to an analyzer + +The following <> request uses the +`word_delimiter_graph` filter to configure a new +<>. + +[source,console] +---- +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "whitespace", + "filter": [ "word_delimiter_graph" ] + } + } + } + } +} +---- + +[WARNING] +==== +Avoid using the `word_delimiter_graph` filter with tokenizers that remove +punctuation, such as the <> tokenizer. +This could prevent the `word_delimiter_graph` filter from splitting tokens +correctly. It can also interfere with the filter's configurable parameters, such +as <> or +<>. We +recommend using the <> tokenizer +instead. +==== +[[word-delimiter-graph-tokenfilter-configure-parms]] +==== Configurable parameters + +[[word-delimiter-graph-tokenfilter-adjust-offsets]] +`adjust_offsets`:: ++ +-- +(Optional, boolean) +If `true`, adjust the offsets of split or catenated tokens to better reflect +their actual position in the token stream. Defaults to `true`. + +[WARNING] +==== +Set `adjust_offsets` to `false` if your analyzer uses filters, such as the +<> filter, that change the length of tokens +without changing their offsets. Otherwise, the `word_delimiter_graph` filter +could produce tokens with illegal offsets. +==== +-- + +[[word-delimiter-graph-tokenfilter-catenate-all]] `catenate_all`:: - If `true` causes all subword parts to be catenated: - "wi-fi-4000" -> "wifi4000". Defaults to `false`. +(Optional, boolean) +If `true`, produce catenated tokens for chains of alphanumeric characters +separated by non-alphabetic delimiters. For example: `wi-fi-232-enabled` -> [ +**`wifi232enabled`**, `wi`, `fi`, `232`, `enabled` ]. Defaults to `false`. -`split_on_case_change`:: - If `true` causes "PowerShot" to be two tokens; - ("Power-Shot" remains two parts regards). Defaults to `true`. +[[word-delimiter-graph-tokenfilter-catenate-numbers]] +`catenate_numbers`:: +(Optional, boolean) +If `true`, produce additional catenated tokens for chains of numeric characters +separated by non-alphabetic delimiters. For example: `01-02-03` -> +[**`010203`**, `01`, `02`, `03` ]. Defaults to `false`. +[[word-delimiter-graph-tokenfilter-catenate-words]] +`catenate_words`:: +(Optional, boolean) +If `true`, produce catenated tokens for chains of alphabetical characters +separated by non-alphabetic delimiters. For example: `wi-fi-enabled` -> [ +**`wifienabled`**, `wi`, `fi`, `enabled`]. Defaults to `false`. + +`generate_number_parts`:: +(Optional, boolean) +If `true`, include tokens consisting of only numeric characters in the output. +If `false`, exclude these tokens from the output. Defaults to `true`. + +`generate_word_parts`:: +(Optional, boolean) +If `true`, include tokens consisting of only alphabetical characters in the +output. If `false`, exclude these tokens from the output. Defaults to `true`. + +[[word-delimiter-graph-tokenfilter-preserve-original]] `preserve_original`:: - If `true` includes original words in subwords: - "500-42" -> "500-42" "500" "42". Defaults to `false`. +(Optional, boolean) +If `true`, include the original version of any split tokens in the output. This +original version includes non-alphanumeric delimiters. For example: `wi-fi-232` +-> [**`wi-fi-232`**, `wi`, `fi`, `232` ]. Defaults to `false`. + +`protected_words`:: +(Optional, array of strings) +Array of tokens the filter won't split. + +`protected_words_path`:: ++ +-- +(Optional, string) +Path to a file that contains a list of tokens the filter won't split. + +This path must be absolute or relative to the `config` location, and the file +must be UTF-8 encoded. Each token in the file must be separated by a line +break. +-- + +`split_on_case_change`:: +(Optional, boolean) +If `true`, split tokens at letter case transitions. For example: `camelCase` -> +[ `camel`, `Case`]. Defaults to `true`. `split_on_numerics`:: - If `true` causes "j2se" to be three tokens; "j" - "2" "se". Defaults to `true`. +(Optional, boolean) +If `true`, split tokens at letter-number transitions. For example: `j2se` -> +[ `j`, `2`, `se` ]. Defaults to `true`. `stem_english_possessive`:: - If `true` causes trailing "'s" to be - removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`. +(Optional, boolean) +If `true`, remove the English possessive (`'s`) from the end of each token. +For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`. -Advance settings include: +`type_table`:: ++ +-- +(Optional, array of strings) +Array of custom type mappings for characters. This allows you to map +non-alphanumeric characters as numeric or alphanumeric to avoid splitting on +those characters. -`protected_words`:: - A list of protected words from being delimiter. - Either an array, or also can set `protected_words_path` which resolved - to a file configured with protected words (one on each line). - Automatically resolves to `config/` based location if exists. +For example, the following array maps the plus (`+`) and hyphen (`-`) characters +as alphanumeric, which means they won't be treated as delimiters: -`adjust_offsets`:: - By default, the filter tries to output subtokens with adjusted offsets - to reflect their actual position in the token stream. However, when - used in combination with other filters that alter the length or starting - position of tokens without changing their offsets - (e.g. <>) this can cause tokens with - illegal offsets to be emitted. Setting `adjust_offsets` to false will - stop `word_delimiter_graph` from adjusting these internal offsets. +`["+ => ALPHA", "- => ALPHA"]` -`type_table`:: - A custom type mapping table, for example (when configured - using `type_table_path`): - -[source,type_table] --------------------------------------------------- - # Map the $, %, '.', and ',' characters to DIGIT - # This might be useful for financial data. - $ => DIGIT - % => DIGIT - . => DIGIT - \\u002C => DIGIT - - # in some cases you might not want to split on ZWJ - # this also tests the case where we need a bigger byte[] - # see http://en.wikipedia.org/wiki/Zero-width_joiner - \\u200D => ALPHANUM --------------------------------------------------- - -NOTE: Using a tokenizer like the `standard` tokenizer may interfere with -the `catenate_*` and `preserve_original` parameters, as the original -string may already have lost punctuation during tokenization. Instead, -you may want to use the `whitespace` tokenizer. +Supported types include: + +* `ALPHA` (Alphabetical) +* `ALPHANUM` (Alphanumeric) +* `DIGIT` (Numeric) +* `LOWER` (Lowercase alphabetical) +* `SUBWORD_DELIM` (Non-alphanumeric delimiter) +* `UPPER` (Uppercase alphabetical) +-- + +`type_table_path`:: ++ +-- +(Optional, string) +Path to a file that contains custom type mappings for characters. This allows +you to map non-alphanumeric characters as numeric or alphanumeric to avoid +splitting on those characters. + +For example, the contents of this file may contain the following: + +[source,txt] +---- +# Map the $, %, '.', and ',' characters to DIGIT +# This might be useful for financial data. +$ => DIGIT +% => DIGIT +. => DIGIT +\\u002C => DIGIT + +# in some cases you might not want to split on ZWJ +# this also tests the case where we need a bigger byte[] +# see http://en.wikipedia.org/wiki/Zero-width_joiner +\\u200D => ALPHANUM +---- + +Supported types include: + +* `ALPHA` (Alphabetical) +* `ALPHANUM` (Alphanumeric) +* `DIGIT` (Numeric) +* `LOWER` (Lowercase alphabetical) +* `SUBWORD_DELIM` (Non-alphanumeric delimiter) +* `UPPER` (Uppercase alphabetical) + +This file path must be absolute or relative to the `config` location, and the +file must be UTF-8 encoded. Each mapping in the file must be separated by a line +break. +-- + +[[analysis-word-delimiter-graph-tokenfilter-customize]] +==== Customize + +To customize the `word_delimiter_graph` filter, duplicate it to create the basis +for a new custom token filter. You can modify the filter using its configurable +parameters. + +For example, the following request creates a `word_delimiter_graph` +filter that uses the following rules: + +* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`) + character. +* Remove leading or trailing delimiters from each token. +* Do _not_ split tokens at letter case transitions. +* Do _not_ split tokens at letter-number transitions. +* Remove the English possessive (`'s`) from the end of each token. + +[source,console] +---- +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "default": { + "tokenizer": "whitespace", + "filter": [ "my_custom_word_delimiter_graph_filter" ] + } + }, + "filter": { + "my_custom_word_delimiter_graph_filter": { + "type": "word_delimiter_graph", + "type_table": [ "- => ALPHA" ], + "split_on_case_change": false, + "split_on_numerics": false, + "stem_english_possessive": true + } + } + } + } +} +---- + +[[analysis-word-delimiter-graph-differences]] +==== Differences between `word_delimiter` and `word_delimiter_graph` + +Both the <> and +`word_delimiter_graph` token filters can produce catenated tokens when any of +the following parameters are `true`: + + * <> + * <> + * <> + +When adding these new tokens to a stream, the `word_delimiter` filter places +catenated tokens _after_ the first delimited token. For example, with +`catenate_words` set to `true`, the `word_delimiter` filter changes [ `the`, +`wi-fi`, `is`, `enabled`] to [`the`, `wi`, **`wifi`**, `fi`, `is`, `enabled` ]. + +This can cause issues for the <> +query and other queries that rely on the sequence of token streams for matching. + +The `word_delimiter_graph` filter places catenated tokens _before_ the first +delimited token. For example, with `catenate_words` set to `true`, the +`word_delimiter_graph` filter changes [ `the`, `wi-fi`, `is`, `enabled` ] to +[ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ]. + +This better preserves the token stream's original sequence and doesn't usually +interfere with `match_phrase` or similar queries. + +The `word_delimiter_graph` also supports the +<> parameter, +which adjusts the offsets of split or catenated tokens to reflect their actual +position in the token stream. The `adjust_offsets` parameter is not supported by +the `word_delimiter` filter. \ No newline at end of file From d48caa6fa0732510b8ba1ccb4b1ebfea812d5464 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Thu, 5 Mar 2020 15:47:14 -0500 Subject: [PATCH 02/10] Tweak wording for boolean parms --- .../word-delimiter-graph-tokenfilter.asciidoc | 47 ++++++++++--------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index b4c803dbb7fd9..bd6917a87022b 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -156,8 +156,8 @@ instead. + -- (Optional, boolean) -If `true`, adjust the offsets of split or catenated tokens to better reflect -their actual position in the token stream. Defaults to `true`. +If `true`, the filter adjusts the offsets of split or catenated tokens to better +reflect their actual position in the token stream. Defaults to `true`. [WARNING] ==== @@ -171,40 +171,43 @@ could produce tokens with illegal offsets. [[word-delimiter-graph-tokenfilter-catenate-all]] `catenate_all`:: (Optional, boolean) -If `true`, produce catenated tokens for chains of alphanumeric characters -separated by non-alphabetic delimiters. For example: `wi-fi-232-enabled` -> [ -**`wifi232enabled`**, `wi`, `fi`, `232`, `enabled` ]. Defaults to `false`. +If `true`, the filter produces catenated tokens for chains of alphanumeric +characters separated by non-alphabetic delimiters. For example: +`wi-fi-232-enabled` -> [**`wifi232enabled`**, `wi`, `fi`, `232`, `enabled` ]. +Defaults to `false`. [[word-delimiter-graph-tokenfilter-catenate-numbers]] `catenate_numbers`:: (Optional, boolean) -If `true`, produce additional catenated tokens for chains of numeric characters +If `true`, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example: `01-02-03` -> [**`010203`**, `01`, `02`, `03` ]. Defaults to `false`. [[word-delimiter-graph-tokenfilter-catenate-words]] `catenate_words`:: (Optional, boolean) -If `true`, produce catenated tokens for chains of alphabetical characters -separated by non-alphabetic delimiters. For example: `wi-fi-enabled` -> [ -**`wifienabled`**, `wi`, `fi`, `enabled`]. Defaults to `false`. +If `true`, the filter produces catenated tokens for chains of alphabetical +characters separated by non-alphabetic delimiters. For example: `wi-fi-enabled` +-> [**`wifienabled`**, `wi`, `fi`, `enabled`]. Defaults to `false`. `generate_number_parts`:: (Optional, boolean) -If `true`, include tokens consisting of only numeric characters in the output. -If `false`, exclude these tokens from the output. Defaults to `true`. +If `true`, the filter includes tokens consisting of only numeric characters in +the output. If `false`, the filter excludes these tokens from the output. +Defaults to `true`. `generate_word_parts`:: (Optional, boolean) -If `true`, include tokens consisting of only alphabetical characters in the -output. If `false`, exclude these tokens from the output. Defaults to `true`. +If `true`, the filter includes tokens consisting of only alphabetical characters +in the output. If `false`, the filter excludes these tokens from the output. +Defaults to `true`. [[word-delimiter-graph-tokenfilter-preserve-original]] `preserve_original`:: (Optional, boolean) -If `true`, include the original version of any split tokens in the output. This -original version includes non-alphanumeric delimiters. For example: `wi-fi-232` --> [**`wi-fi-232`**, `wi`, `fi`, `232` ]. Defaults to `false`. +If `true`, the filter includes the original version of any split tokens in the +output. This original version includes non-alphanumeric delimiters. For example: +`wi-fi-232` -> [**`wi-fi-232`**, `wi`, `fi`, `232` ]. Defaults to `false`. `protected_words`:: (Optional, array of strings) @@ -223,18 +226,18 @@ break. `split_on_case_change`:: (Optional, boolean) -If `true`, split tokens at letter case transitions. For example: `camelCase` -> -[ `camel`, `Case`]. Defaults to `true`. +If `true`, the filter splits tokens at letter case transitions. For example: +`camelCase` -> [ `camel`, `Case`]. Defaults to `true`. `split_on_numerics`:: (Optional, boolean) -If `true`, split tokens at letter-number transitions. For example: `j2se` -> -[ `j`, `2`, `se` ]. Defaults to `true`. +If `true`, the filter splits tokens at letter-number transitions. For example: +`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`. `stem_english_possessive`:: (Optional, boolean) -If `true`, remove the English possessive (`'s`) from the end of each token. -For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`. +If `true`, the filter removes the English possessive (`'s`) from the end of each +token. For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`. `type_table`:: + From aafad4f0a91272a2cefc74e1361dcc100d82b82a Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 6 Mar 2020 05:40:16 -0500 Subject: [PATCH 03/10] Remove experimental flag per #53217 --- .../tokenfilters/word-delimiter-graph-tokenfilter.asciidoc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index bd6917a87022b..6922741843aea 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -4,8 +4,6 @@ Word delimiter graph ++++ -experimental::["The `word_delimiter_graph` filter uses Lucene's {lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter], which is marked as experimental in Lucene."] - Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter also performs optional token normalization based on a set of rules. By default, the filter uses the following rules: @@ -22,6 +20,9 @@ the filter uses the following rules: * Remove the English possessive (`'s`) from the end of each token. For example: `Neil's` -> `Neil` +The `word_delimiter_graph` filter uses Lucene's +{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter] + [[analysis-word-delimiter-graph-tokenfilter-analyze-ex]] ==== Example From a15173a7402a1399cebe6fb64e884618bc14f493 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 6 Mar 2020 05:40:42 -0500 Subject: [PATCH 04/10] Add missing period --- .../tokenfilters/word-delimiter-graph-tokenfilter.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 6922741843aea..c546c832e3bf1 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -21,7 +21,7 @@ the filter uses the following rules: For example: `Neil's` -> `Neil` The `word_delimiter_graph` filter uses Lucene's -{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter] +{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter]. [[analysis-word-delimiter-graph-tokenfilter-analyze-ex]] ==== Example From ae34f1e1843ed71a1323cb6dd054a13f16c506e8 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 6 Mar 2020 06:00:01 -0500 Subject: [PATCH 05/10] Add analyze API link --- .../tokenfilters/word-delimiter-graph-tokenfilter.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index c546c832e3bf1..08ee67a61b8ac 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -26,9 +26,9 @@ The `word_delimiter_graph` filter uses Lucene's [[analysis-word-delimiter-graph-tokenfilter-analyze-ex]] ==== Example -The following analyze API request uses the `word_delimiter_graph` filter to -split `Neil's Wi-Fi-enabled PowerShot SD500` into normalized tokens using the -filter's default rules: +The following <> request uses the +`word_delimiter_graph` filter to split `Neil's Wi-Fi-enabled PowerShot SD500` +into normalized tokens using the filter's default rules: [source,console] ---- From acf8127265c78eab62ee86454058868b7cdc60a8 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 6 Mar 2020 06:44:09 -0500 Subject: [PATCH 06/10] Reset trim filter changes --- .../analysis/tokenfilters/trim-tokenfilter.asciidoc | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc index ff0775ae39915..19d47f203afb8 100644 --- a/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc @@ -4,9 +4,7 @@ Trim ++++ -Removes leading and trailing whitespace from each token in a stream. While this -can change the length of a token, the `trim` filter does _not_ change a token's -offsets. +Removes leading and trailing whitespace from each token in a stream. The `trim` filter uses Lucene's https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html[TrimFilter]. @@ -71,8 +69,7 @@ GET _analyze ---- The API returns the following response. The returned `fox` token does not -include any leading or trailing whitespace. Note that despite changing the -token's length, the `start_offset` and `end_offset` remain the same. +include any leading or trailing whitespace. [source,console-result] ---- From e3c1144bdcb72501a93f88ed81f3783e02ce1e70 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 6 Mar 2020 15:58:04 -0500 Subject: [PATCH 07/10] Address review feedback --- .../word-delimiter-graph-tokenfilter.asciidoc | 217 ++++++++++++++---- .../images/analysis/token-graph-basic.svg | 44 ++++ .../images/analysis/token-graph-wd.svg | 52 +++++ .../images/analysis/token-graph-wdg.svg | 53 +++++ 4 files changed, 317 insertions(+), 49 deletions(-) create mode 100644 docs/reference/images/analysis/token-graph-basic.svg create mode 100644 docs/reference/images/analysis/token-graph-wd.svg create mode 100644 docs/reference/images/analysis/token-graph-wdg.svg diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 08ee67a61b8ac..1a135c6b69388 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -10,24 +10,37 @@ the filter uses the following rules: * Split tokens at non-alphanumeric characters. The filter uses these characters as delimiters. - For example: `Wi-Fi` -> `Wi`, `Fi` + For example: `Super-Duper` -> `Super`, `Duper` * Remove leading or trailing delimiters from each token. - For example: `hello---there, 'dude'` -> `hello`, `there`, `dude` + For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder` * Split tokens at letter case transitions. For example: `PowerShot` -> `Power`, `Shot` * Split tokens at letter-number transitions. - For example: `SD500` -> `SD`, `500` + For example: `XL500` -> `XL`, `500` * Remove the English possessive (`'s`) from the end of each token. For example: `Neil's` -> `Neil` The `word_delimiter_graph` filter uses Lucene's {lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter]. +[TIP] +==== +The `word_delimiter_graph` filter was designed to remove punctuation from +complex identifiers, such as product IDs or part numbers. For these use cases, +we recommend using the `word_delimiter_graph` filter with the +<> tokenizer. + +Avoid using the `word_delimiter_graph` filter to split hyphenated words, such as +`wi-fi`. Because users often search for these words both with and without +hyphens, we recommend using the +<> filter instead. +==== + [[analysis-word-delimiter-graph-tokenfilter-analyze-ex]] ==== Example The following <> request uses the -`word_delimiter_graph` filter to split `Neil's Wi-Fi-enabled PowerShot SD500` +`word_delimiter_graph` filter to split `Neil's Super-Duper-XL500--42+AutoCoder` into normalized tokens using the filter's default rules: [source,console] @@ -36,7 +49,7 @@ GET /_analyze { "tokenizer": "whitespace", "filter": [ "word_delimiter_graph" ], - "text": "Neil's Wi-Fi-enabled PowerShot SD500" + "text": "Neil's Super-Duper-XL500--42+AutoCoder" } ---- @@ -44,7 +57,7 @@ The filter produces the following tokens: [source,txt] ---- -[ Neil, Wi, Fi, enabled, Power, Shot, SD, 500 ] +[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ] ---- //// @@ -60,51 +73,51 @@ The filter produces the following tokens: "position" : 0 }, { - "token" : "Wi", + "token" : "Super", "start_offset" : 7, - "end_offset" : 9, + "end_offset" : 12, "type" : "word", "position" : 1 }, { - "token" : "Fi", - "start_offset" : 10, - "end_offset" : 12, + "token" : "Duper", + "start_offset" : 13, + "end_offset" : 18, "type" : "word", "position" : 2 }, { - "token" : "enabled", - "start_offset" : 13, - "end_offset" : 20, + "token" : "XL", + "start_offset" : 19, + "end_offset" : 21, "type" : "word", "position" : 3 }, { - "token" : "Power", + "token" : "500", "start_offset" : 21, - "end_offset" : 26, + "end_offset" : 24, "type" : "word", "position" : 4 }, { - "token" : "Shot", + "token" : "42", "start_offset" : 26, - "end_offset" : 30, + "end_offset" : 28, "type" : "word", "position" : 5 }, { - "token" : "SD", - "start_offset" : 31, + "token" : "Auto", + "start_offset" : 29, "end_offset" : 33, "type" : "word", "position" : 6 }, { - "token" : "500", + "token" : "Coder", "start_offset" : 33, - "end_offset" : 36, + "end_offset" : 38, "type" : "word", "position" : 7 } @@ -145,8 +158,8 @@ This could prevent the `word_delimiter_graph` filter from splitting tokens correctly. It can also interfere with the filter's configurable parameters, such as <> or <>. We -recommend using the <> tokenizer -instead. +recommend using the <> or +<> tokenizer instead. ==== [[word-delimiter-graph-tokenfilter-configure-parms]] @@ -171,25 +184,79 @@ could produce tokens with illegal offsets. [[word-delimiter-graph-tokenfilter-catenate-all]] `catenate_all`:: ++ +-- (Optional, boolean) If `true`, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example: -`wi-fi-232-enabled` -> [**`wifi232enabled`**, `wi`, `fi`, `232`, `enabled` ]. +`super-duper-xl-500` -> [**`superduperxl500`**, `super`, `duper`, `xl`, `500` ]. Defaults to `false`. +[WARNING] +==== +Setting this parameter to `true` produces multi-position tokens, which are not +supported by indexing. + +If this parameter is `true`, avoid using this filter in an index analyzer or +use the <> filter after +this filter to make the token stream suitable for indexing. + +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- + [[word-delimiter-graph-tokenfilter-catenate-numbers]] `catenate_numbers`:: ++ +-- (Optional, boolean) If `true`, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example: `01-02-03` -> [**`010203`**, `01`, `02`, `03` ]. Defaults to `false`. +[WARNING] +==== +Setting this parameter to `true` produces multi-position tokens, which are not +supported by indexing. + +If this parameter is `true`, avoid using this filter in an index analyzer or +use the <> filter after +this filter to make the token stream suitable for indexing. + +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- + [[word-delimiter-graph-tokenfilter-catenate-words]] `catenate_words`:: ++ +-- (Optional, boolean) If `true`, the filter produces catenated tokens for chains of alphabetical -characters separated by non-alphabetic delimiters. For example: `wi-fi-enabled` --> [**`wifienabled`**, `wi`, `fi`, `enabled`]. Defaults to `false`. +characters separated by non-alphabetic delimiters. For example: `super-duper-xl` +-> [**`superduperxl`**, `super`, `duper`, `xl`]. Defaults to `false`. + +[WARNING] +==== +Setting this parameter to `true` produces multi-position tokens, which are not +supported by indexing. + +If this parameter is `true`, avoid using this filter in an index analyzer or +use the <> filter after +this filter to make the token stream suitable for indexing. + +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- `generate_number_parts`:: (Optional, boolean) @@ -205,10 +272,24 @@ Defaults to `true`. [[word-delimiter-graph-tokenfilter-preserve-original]] `preserve_original`:: ++ +-- (Optional, boolean) If `true`, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example: -`wi-fi-232` -> [**`wi-fi-232`**, `wi`, `fi`, `232` ]. Defaults to `false`. +`super-duper-xl-500` -> [**`super-duper-xl-500`**, `super`, `duper`, `xl`, `500` +]. Defaults to `false`. + +[WARNING] +==== +Setting this parameter to `true` produces multi-position tokens, which are not +supported by indexing. + +If this parameter is `true`, avoid using this filter in an index analyzer or +use the <> filter after +this filter to make the token stream suitable for indexing. +==== +-- `protected_words`:: (Optional, array of strings) @@ -326,7 +407,7 @@ PUT /my_index "settings": { "analysis": { "analyzer": { - "default": { + "my_analyzer": { "tokenizer": "whitespace", "filter": [ "my_custom_word_delimiter_graph_filter" ] } @@ -348,32 +429,70 @@ PUT /my_index [[analysis-word-delimiter-graph-differences]] ==== Differences between `word_delimiter` and `word_delimiter_graph` -Both the <> and -`word_delimiter_graph` token filters can produce catenated tokens when any of -the following parameters are `true`: +Both the <> and +`word_delimiter_graph` filters produce tokens that span multiple positions when +any of the following parameters are `true`: * <> * <> * <> + * <> + +However, only the `word_delimiter_graph` filter assigns multi-position tokens a +`positionLength` attribute, which indicates the number of positions a token +spans. This ensures the `word_delimiter_graph` filter always produces valid token +https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs]. + +The `word_delimiter` filter does not assign multi-position tokens a +`positionLength` attribute. This means it produces invalid graphs for streams +including these tokens. + +While indexing does not support token graphs containing multi-position tokens, +queries, such as the <> query, can +use these graphs to generate multiple sub-queries from a single query string. -When adding these new tokens to a stream, the `word_delimiter` filter places -catenated tokens _after_ the first delimited token. For example, with -`catenate_words` set to `true`, the `word_delimiter` filter changes [ `the`, -`wi-fi`, `is`, `enabled`] to [`the`, `wi`, **`wifi`**, `fi`, `is`, `enabled` ]. +To see how token graphs produced by the `word_delimiter` and +`word_delimiter_graph` filters differ, check out the following example. + +.*Example* +[%collapsible] +==== + +[[analysis-word-delimiter-graph-basic-token-graph]] +*Basic token graph* + +Both the `word_delimiter` and `word_delimiter_graph` produce the following token +graph for `PowerShot2000` when the following parameters are `false`: + + * <> + * <> + * <> + * <> + +This graph does not contain multi-position tokens. All tokens span only one +position. + +image::images/analysis/token-graph-basic.svg[align="center"] + +[[analysis-word-delimiter-graph-wdg-token-graph]] +*`word_delimiter_graph` graph with a multi-position token* + +The `word_delimiter_graph` filter produces the following token graph for +`PowerShot2000` when `catenate_words` is `true`. + +image::images/analysis/token-graph-wdg.svg[align="center"] + +This graph correctly indicates the catenated `PowerShot` token spans two +positions. +==== -This can cause issues for the <> -query and other queries that rely on the sequence of token streams for matching. +[[analysis-word-delimiter-graph-wd-token-graph]] +*`word_delimiter` graph with a multi-position token* -The `word_delimiter_graph` filter places catenated tokens _before_ the first -delimited token. For example, with `catenate_words` set to `true`, the -`word_delimiter_graph` filter changes [ `the`, `wi-fi`, `is`, `enabled` ] to -[ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ]. +When `catenate_words` is `true`, the `word_delimiter` filter produces +the following token graph for `PowerShot2000`. -This better preserves the token stream's original sequence and doesn't usually -interfere with `match_phrase` or similar queries. +Note that the catenated `PowerShot` token should span two positions but only +spans one in the token graph, making it invalid. -The `word_delimiter_graph` also supports the -<> parameter, -which adjusts the offsets of split or catenated tokens to reflect their actual -position in the token stream. The `adjust_offsets` parameter is not supported by -the `word_delimiter` filter. \ No newline at end of file +image::images/analysis/token-graph-wd.svg[align="center"] \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-basic.svg b/docs/reference/images/analysis/token-graph-basic.svg new file mode 100644 index 0000000000000..b4a2bcc70bab9 --- /dev/null +++ b/docs/reference/images/analysis/token-graph-basic.svg @@ -0,0 +1,44 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-wd.svg b/docs/reference/images/analysis/token-graph-wd.svg new file mode 100644 index 0000000000000..cdbbfb8a0845c --- /dev/null +++ b/docs/reference/images/analysis/token-graph-wd.svg @@ -0,0 +1,52 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-wdg.svg b/docs/reference/images/analysis/token-graph-wdg.svg new file mode 100644 index 0000000000000..992637bd668d5 --- /dev/null +++ b/docs/reference/images/analysis/token-graph-wdg.svg @@ -0,0 +1,53 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file From 97a13e950a6d765e5d99c5ec586e0fd581fd5c87 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Mon, 9 Mar 2020 05:54:58 -0400 Subject: [PATCH 08/10] Fix formatting --- .../word-delimiter-graph-tokenfilter.asciidoc | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 1a135c6b69388..aa8c2f3267533 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -478,14 +478,11 @@ image::images/analysis/token-graph-basic.svg[align="center"] *`word_delimiter_graph` graph with a multi-position token* The `word_delimiter_graph` filter produces the following token graph for -`PowerShot2000` when `catenate_words` is `true`. +`PowerShot2000` when `catenate_words` is `true`.This graph correctly indicates +the catenated `PowerShot` token spans two positions. image::images/analysis/token-graph-wdg.svg[align="center"] -This graph correctly indicates the catenated `PowerShot` token spans two -positions. -==== - [[analysis-word-delimiter-graph-wd-token-graph]] *`word_delimiter` graph with a multi-position token* @@ -495,4 +492,6 @@ the following token graph for `PowerShot2000`. Note that the catenated `PowerShot` token should span two positions but only spans one in the token graph, making it invalid. -image::images/analysis/token-graph-wd.svg[align="center"] \ No newline at end of file +image::images/analysis/token-graph-wd.svg[align="center"] + +==== \ No newline at end of file From 932eb892834f7c32d4e963bb4c5629914fbe01bd Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Mon, 9 Mar 2020 05:57:37 -0400 Subject: [PATCH 09/10] Another formatting fix --- .../tokenfilters/word-delimiter-graph-tokenfilter.asciidoc | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index aa8c2f3267533..1b6a40c9a3ae7 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -478,8 +478,10 @@ image::images/analysis/token-graph-basic.svg[align="center"] *`word_delimiter_graph` graph with a multi-position token* The `word_delimiter_graph` filter produces the following token graph for -`PowerShot2000` when `catenate_words` is `true`.This graph correctly indicates -the catenated `PowerShot` token spans two positions. +`PowerShot2000` when `catenate_words` is `true`. + +This graph correctly indicates the catenated `PowerShot` token spans two +positions. image::images/analysis/token-graph-wdg.svg[align="center"] From 1885f0218c7a8d9df8895c7ac2239e35cd26d7c9 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Mon, 9 Mar 2020 05:58:54 -0400 Subject: [PATCH 10/10] Change heading order --- .../word-delimiter-graph-tokenfilter.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 1b6a40c9a3ae7..08cf0789a4ec1 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -427,11 +427,11 @@ PUT /my_index ---- [[analysis-word-delimiter-graph-differences]] -==== Differences between `word_delimiter` and `word_delimiter_graph` +==== Differences between `word_delimiter_graph` and `word_delimiter` -Both the <> and -`word_delimiter_graph` filters produce tokens that span multiple positions when -any of the following parameters are `true`: +Both the `word_delimiter_graph` and +<> filters produce tokens +that span multiple positions when any of the following parameters are `true`: * <> * <>