Skip to content

Prefixed tokens (ending with wildcard) with only filtered characters don't get removed from query #31702

Closed
@rezecib

Description

@rezecib

Elasticsearch version (bin/elasticsearch --version):

Version: 6.3.0, Build: default/tar/424e937/2018-06-11T23:38:03.357887Z, JVM: 10.0.1

Plugins installed:

analysis-icu
ingest-geoip
ingest-user-agent
mapper-size
repository-gcs
]

JVM version (java -version):

openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)

OS version (uname -a if on a Unix-like system): Linux es-master-0 4.4.111+ #1 SMP Sat May 5 12:48:47 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: When searching an index, if a prefixed token contains only filtered characters (e.g. @*), Elasticsearch 5.5 previously filtered that token out of the query entirely (the expected behavior). In 6.3.0, this token is preserved, causing the query to match nothing if the same token character filtering is applied at indexing time.

Steps to reproduce:

  1. Create the index:
curl -X PUT localhost:9200/punct-wildcard-test -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "icu_analyzer": {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer"
                }
            }
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "txt": {
                    "type": "text",
                    "analyzer": "icu_analyzer"
                }
            }
        }
    }
}'
  1. Analyze text (what would happen during indexing):
curl -X POST localhost:9200/punct-wildcard-test/_analyze -d '{
    "text": ["foo @bar baz@qux"],
    "tokenizer": "icu_tokenizer"
}'

Result:

{
    "tokens": [
        {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "bar",
            "start_offset": 5,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "baz",
            "start_offset": 9,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "qux",
            "start_offset": 13,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}
  1. Validate/explain a problem query:
curl -X POST "localhost:9200/punct-wildcard-test/_validate/query?explain" -d '{
    "query": {
        "query_string": {
            "query": "foo @* @bar* baz@*",
            "analyzer": "icu_analyzer",
            "default_field": "txt",
            "analyze_wildcard": true
        }
    }
}'

Elasticsearch 5.5 Response:

{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "punct-wildcard-test",
            "valid": true,
            "explanation": "txt:foo txt:bar* txt:baz*"
        }
    ]
}

Elasticsearch 6.3.0 Response:

{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "valid": true,
    "explanations": [
        {
            "index": "punct-wildcard-test",
            "valid": true,
            "explanation": "txt:foo txt:@* txt:bar* txt:baz*"
        }
    ]
}

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions