Update Edge NGram Tokenizer documentation #48956

Stefanqn · 2019-11-11T18:18:02Z

Elasticsearch version (bin/elasticsearch --version): current

Plugins installed: []

JVM version (java -version): -

OS version (uname -a if on a Unix-like system): -

Description of the problem including expected versus actual behavior: bad behaviour

Steps to reproduce:

read the elasticsearch documentation regarding Edge NGram and use the 'Example configuration' with split search_analyzer and analyzer
this will result in unexpected behavior when exceeding the max_gram length, e.g. for "max_gram": 3 searching for an existing, indexed "aaaa" returns an empty result set.
please update the documentation, e.g. with following configuration

 "settings": {
        "analysis": {
            "filter": {
                "truncate_search": {
                    "type": "truncate",
                    "length": 10
                }
            },
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "autocomplete",
                    "filter": [
                        "lowercase"
                    ]
                },
                "autocomplete_search": {
                    "tokenizer": "lowercase",
                    "filter": "truncate_search"
                }
            },
            "tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 10,
                    "token_chars": [
                        "letter"
                    ]
                }
            }
        }
    }

showing the need for a truncate filter.

Provide logs (if relevant): -

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-12T08:27:05Z

Pinging @elastic/es-search (:Search/Analysis)

elasticmachine · 2019-11-12T08:27:06Z

Pinging @elastic/es-docs (:Docs)

jrodewig · 2019-11-12T14:06:37Z

Hi @Stefanqn

Thanks for reaching out. Unfortunately, I wasn't able to reproduce your reported problem using the current snippet setup. I've outlined my steps below.

Can you highlight the expected behavior or where your steps differed?

Create an index with split search and index analyzers.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

Index a document with a title of aaaa.

PUT my_index/_doc/1
{
  "title": "aaaa" 
}

Refresh the index.

POST my_index/_refresh

Search for aaaa in the title field.


GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "aaaa", 
        "operator": "and"
      }
    }
  }
}

Response:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "aaaa"
        }
      }
    ]
  }
}

Stefanqn · 2019-11-12T16:44:31Z

please set "max_gram": 3 or use 11x 'a' = "aaaaaaaaaaa"

jrodewig · 2019-11-12T17:06:00Z

Thanks @Stefanqn. I was able to reproduce. I'll work on getting this fixed in our docs.

Thanks for reporting.

Stefanqn · 2019-11-13T11:11:26Z

thanks!

…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.

alpar-t added :Docs :Search Relevance/Analysis How text is split into tokens labels Nov 12, 2019

cbuescher added >docs General docs changes and removed :Docs labels Nov 12, 2019

jrodewig self-assigned this Nov 12, 2019

jrodewig assigned jrodewig and unassigned jrodewig Nov 12, 2019

jrodewig mentioned this issue Nov 12, 2019

[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers #49007

Merged

Stefanqn closed this as completed Nov 13, 2019

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Edge NGram Tokenizer documentation #48956

Update Edge NGram Tokenizer documentation #48956

Stefanqn commented Nov 11, 2019

elasticmachine commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

jrodewig commented Nov 12, 2019

Uh oh!

Stefanqn commented Nov 12, 2019

Uh oh!

jrodewig commented Nov 12, 2019

Uh oh!

Stefanqn commented Nov 13, 2019

Uh oh!

Update Edge NGram Tokenizer documentation #48956

Update Edge NGram Tokenizer documentation #48956

Comments

Stefanqn commented Nov 11, 2019

elasticmachine commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

jrodewig commented Nov 12, 2019

Uh oh!

Stefanqn commented Nov 12, 2019

Uh oh!

jrodewig commented Nov 12, 2019

Uh oh!

Stefanqn commented Nov 13, 2019

Uh oh!