Skip to content

Commit 2fe9ba5

Browse files
authored
[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers (#49007)
The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
1 parent 9159af5 commit 2fe9ba5

File tree

1 file changed

+36
-5
lines changed

1 file changed

+36
-5
lines changed

docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,16 @@ configure the `edge_ngram` before using it.
7272

7373
The `edge_ngram` tokenizer accepts the following parameters:
7474

75-
[horizontal]
7675
`min_gram`::
7776
Minimum length of characters in a gram. Defaults to `1`.
7877

7978
`max_gram`::
80-
Maximum length of characters in a gram. Defaults to `2`.
79+
+
80+
--
81+
Maximum length of characters in a gram. Defaults to `2`.
82+
83+
See <<max-gram-limits>>.
84+
--
8185

8286
`token_chars`::
8387

@@ -93,6 +97,29 @@ Character classes may be any of the following:
9397
* `punctuation` -- for example `!` or `"`
9498
* `symbol` -- for example `$` or `√`
9599

100+
[[max-gram-limits]]
101+
=== Limitations of the `max_gram` parameter
102+
103+
The `edge_ngram` tokenizer's `max_gram` value limits the character length of
104+
tokens. When the `edge_ngram` tokenizer is used with an index analyzer, this
105+
means search terms longer than the `max_gram` length may not match any indexed
106+
terms.
107+
108+
For example, if the `max_gram` is `3`, searches for `apple` won't match the
109+
indexed term `app`.
110+
111+
To account for this, you can use the <<analysis-truncate-tokenfilter,`truncate`
112+
token filter>> token filter with a search analyzer to shorten search terms to
113+
the `max_gram` character length. However, this could return irrelevant results.
114+
115+
For example, if the `max_gram` is `3` and search terms are truncated to three
116+
characters, the search term `apple` is shortened to `app`. This means searches
117+
for `apple` return any indexed terms matching `app`, such as `apply`, `snapped`,
118+
and `apple`.
119+
120+
We recommend testing both approaches to see which best fits your
121+
use case and desired search experience.
122+
96123
[float]
97124
=== Example configuration
98125

@@ -209,12 +236,16 @@ The above example produces the following terms:
209236
---------------------------
210237

211238
Usually we recommend using the same `analyzer` at index time and at search
212-
time. In the case of the `edge_ngram` tokenizer, the advice is different. It
239+
time. In the case of the `edge_ngram` tokenizer, the advice is different. It
213240
only makes sense to use the `edge_ngram` tokenizer at index time, to ensure
214-
that partial words are available for matching in the index. At search time,
241+
that partial words are available for matching in the index. At search time,
215242
just search for the terms the user has typed in, for instance: `Quick Fo`.
216243

217-
Below is an example of how to set up a field for _search-as-you-type_:
244+
Below is an example of how to set up a field for _search-as-you-type_.
245+
246+
Note that the `max_gram` value for the index analyzer is `10`, which limits
247+
indexed terms to 10 characters. Search terms are not truncated, meaning that
248+
search terms longer than 10 characters may not match any indexed terms.
218249

219250
[source,console]
220251
-----------------------------------

0 commit comments

Comments
 (0)