Skip to content

[DOCS] Reformat word_delimiter_graph token filter #53170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 9, 2020
Merged

[DOCS] Reformat word_delimiter_graph token filter #53170

merged 10 commits into from
Mar 9, 2020

Conversation

jrodewig
Copy link
Contributor

@jrodewig jrodewig commented Mar 5, 2020

Makes the following changes to the word_delimiter_graph token filter docs:

  • Updates the Lucene experimental admonition.
  • Updates description
  • Adds detailed analyze snippet
  • Adds custom analyzer and custom filter snippets
  • Reorganizes and updates parameter documentation
  • Expands and updates section re: differences between word_delimiter and word_delimiter_graph

Also updates the trim filter docs to note that the trim filter does not change token offsets. Moved to #53220

Preview

http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html

Makes the following changes to the `word_delimiter_graph` token filter
docs:

* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
  and `word_delimiter_graph`
@jrodewig jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Mar 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bit of a rambly comment on when to use this filter. It's very easy to misuse it - @jimczi may have an opinion here as well.

[ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ].

This better preserves the token stream's original sequence and doesn't usually
interfere with `match_phrase` or similar queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true? catenate_X parameters break phrase searching in general - for example, searching for the exact phrase the wifi is enabled won't match against the token stream above because fi introduces an extra position, so is is indexes as if it were two positions away from wifi. This is a hard problem in lucene - we don't want to start indexing position lengths because that will make phrase queries much slower.

The advantage of the _graph variant is that it produces graphs which can be used at query time to generate several queries, so a query for the wi-fi is enabled will produce two phrase queries, the wi fi is enabled and the wifi is enabled. All good if you've indexed the phrase the wi-fi is enabled, as the first query will match. However, searching for the wifi is enabled won't match - it's all lowercase in the query, so the filter doesn't recognise the need to break it up, and in the index wifi is two positions away from is.

Breaking up words with hyphens in is tricky because of the possibility that people will try and search for the word without the hyphen; I think these are probably better dealt with via synonyms. A better usecase for removing punctuation is for things like part numbers, where you only really want phrase searching within a multi-part token, so you use WDGF with a keyword tokenizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! The query time bit makes sense to me. I can also add some warnings to the catenate parameters so users know they'll break phrase searches. I'll also amend the intro a bit to cover the product/part number use case.

@jrodewig
Copy link
Contributor Author

jrodewig commented Mar 9, 2020

Thanks again for your feedback @romseygeek.

I've made some adjustments throughout the page. Here are the changes:

  • Updated the examples in the description to better fit identifiers, such as part numbers
  • Added a tip to the description re: identifiers
  • Updated the analyze example to fit identifiers
  • Added a warning re: indexing and multi-position tokens to the catenate_* and preserve_original parms
  • Added a warning re: match_phrase to the catenate_* parms
  • Rewrote the differences section to focus on token graph output, including some token graph diagrams.

To better check out the diagrams, you can use this preview link:
http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html

@jrodewig jrodewig requested a review from romseygeek March 9, 2020 10:11
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrodewig , this looks much better - the token graph diagrams especially are very helpful.

@jrodewig
Copy link
Contributor Author

jrodewig commented Mar 9, 2020

Thanks @romseygeek!

@jrodewig jrodewig merged commit 1c8ab01 into elastic:master Mar 9, 2020
@jrodewig jrodewig deleted the docs__reformat-word-delimiter-graph-tokenfilter branch March 9, 2020 10:27
jrodewig added a commit that referenced this pull request Mar 9, 2020
Makes the following changes to the `word_delimiter_graph` token filter
docs:

* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
  and `word_delimiter_graph`
jrodewig added a commit that referenced this pull request Mar 9, 2020
Makes the following changes to the `word_delimiter_graph` token filter
docs:

* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
  and `word_delimiter_graph`
@jrodewig
Copy link
Contributor Author

jrodewig commented Mar 9, 2020

Backport commits

master 1c8ab01
7.x 28cb4a1
7.6 8e3c1cb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens v7.6.2 v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants