Skip to content

Update Edge NGram Tokenizer documentation #48956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Stefanqn opened this issue Nov 11, 2019 · 6 comments · Fixed by #49007
Closed

Update Edge NGram Tokenizer documentation #48956

Stefanqn opened this issue Nov 11, 2019 · 6 comments · Fixed by #49007
Assignees
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@Stefanqn
Copy link

Elasticsearch version (bin/elasticsearch --version): current

Plugins installed: []

JVM version (java -version): -

OS version (uname -a if on a Unix-like system): -

Description of the problem including expected versus actual behavior: bad behaviour

Steps to reproduce:

  1. read the elasticsearch documentation regarding Edge NGram and use the 'Example configuration' with split search_analyzer and analyzer
  2. this will result in unexpected behavior when exceeding the max_gram length, e.g. for "max_gram": 3 searching for an existing, indexed "aaaa" returns an empty result set.
  3. please update the documentation, e.g. with following configuration
 "settings": {
        "analysis": {
            "filter": {
                "truncate_search": {
                    "type": "truncate",
                    "length": 10
                }
            },
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "autocomplete",
                    "filter": [
                        "lowercase"
                    ]
                },
                "autocomplete_search": {
                    "tokenizer": "lowercase",
                    "filter": "truncate_search"
                }
            },
            "tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 10,
                    "token_chars": [
                        "letter"
                    ]
                }
            }
        }
    }

showing the need for a truncate filter.

Provide logs (if relevant): -

@alpar-t alpar-t added :Docs :Search Relevance/Analysis How text is split into tokens labels Nov 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (:Docs)

@cbuescher cbuescher added >docs General docs changes and removed :Docs labels Nov 12, 2019
@jrodewig jrodewig self-assigned this Nov 12, 2019
@jrodewig
Copy link
Contributor

Hi @Stefanqn

Thanks for reaching out. Unfortunately, I wasn't able to reproduce your reported problem using the current snippet setup. I've outlined my steps below.

Can you highlight the expected behavior or where your steps differed?

  1. Create an index with split search and index analyzers.
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
  1. Index a document with a title of aaaa.
PUT my_index/_doc/1
{
  "title": "aaaa" 
}
  1. Refresh the index.
POST my_index/_refresh
  1. Search for aaaa in the title field.

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "aaaa", 
        "operator": "and"
      }
    }
  }
}

Response:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "aaaa"
        }
      }
    ]
  }
}

@Stefanqn
Copy link
Author

please set "max_gram": 3 or use 11x 'a' = "aaaaaaaaaaa"

@jrodewig jrodewig assigned jrodewig and unassigned jrodewig Nov 12, 2019
@jrodewig
Copy link
Contributor

Thanks @Stefanqn. I was able to reproduce. I'll work on getting this fixed in our docs.

Thanks for reporting.

@Stefanqn
Copy link
Author

thanks!

jrodewig added a commit that referenced this issue Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this issue Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this issue Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this issue Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this issue Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants