-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[DOCS] Reformat word_delimiter_graph
token filter
#53170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Reformat word_delimiter_graph
token filter
#53170
Conversation
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
Pinging @elastic/es-search (:Search/Analysis) |
Pinging @elastic/es-docs (>docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bit of a rambly comment on when to use this filter. It's very easy to misuse it - @jimczi may have an opinion here as well.
[ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ]. | ||
|
||
This better preserves the token stream's original sequence and doesn't usually | ||
interfere with `match_phrase` or similar queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true? catenate_X
parameters break phrase searching in general - for example, searching for the exact phrase the wifi is enabled
won't match against the token stream above because fi
introduces an extra position, so is
is indexes as if it were two positions away from wifi
. This is a hard problem in lucene - we don't want to start indexing position lengths because that will make phrase queries much slower.
The advantage of the _graph
variant is that it produces graphs which can be used at query time to generate several queries, so a query for the wi-fi is enabled
will produce two phrase queries, the wi fi is enabled
and the wifi is enabled
. All good if you've indexed the phrase the wi-fi is enabled
, as the first query will match. However, searching for the wifi is enabled
won't match - it's all lowercase in the query, so the filter doesn't recognise the need to break it up, and in the index wifi
is two positions away from is
.
Breaking up words with hyphens in is tricky because of the possibility that people will try and search for the word without the hyphen; I think these are probably better dealt with via synonyms. A better usecase for removing punctuation is for things like part numbers, where you only really want phrase searching within a multi-part token, so you use WDGF with a keyword tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! The query time bit makes sense to me. I can also add some warnings to the catenate parameters so users know they'll break phrase searches. I'll also amend the intro a bit to cover the product/part number use case.
Thanks again for your feedback @romseygeek. I've made some adjustments throughout the page. Here are the changes:
To better check out the diagrams, you can use this preview link: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrodewig , this looks much better - the token graph diagrams especially are very helpful.
Thanks @romseygeek! |
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
Makes the following changes to the
word_delimiter_graph
token filter docs:word_delimiter
andword_delimiter_graph
Also updates theMoved to #53220trim
filter docs to note that thetrim
filter does not change token offsets.Preview
http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html