Skip to content

Match query does not match tokens from path_hierarchy tokenizer #67225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgov opened this issue Jan 10, 2021 · 3 comments
Closed

Match query does not match tokens from path_hierarchy tokenizer #67225

rgov opened this issue Jan 10, 2021 · 3 comments

Comments

@rgov
Copy link

rgov commented Jan 10, 2021

This was posted to the discussion forum for one month without finding any resolution, so I am assuming it is a bug and posting here.

Elasticsearch version: 7.5.1

Plugins installed: []

JVM version (java -version): (whatever is in the Docker container)

OS version (uname -a if on a Unix-like system): host is 4.15.0-96-generic #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I have a field, source.file containing a file path which is tokenized using a path_hierarchy tokenizer that operates in "reverse" mode so that I can query by the file's base name (the last past component).

For instance if I have a document with source.file set to /foo/bar/some_boring_filename I should be able to match the field against simply some_boring_filename.

If I ask Elasticsearch what the terms for this document are,

Request to get _termvectors
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/my_index/_doc/gDEhSHYBoWN8sy6cFN7j/_termvectors' -d '{
  "fields" : ["source.file"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

I find indeed that one of the tokens is some_boring_filename:

        "some_boring_filename" : {
          "doc_freq" : 1,
          "ttf" : 1,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 66,
              "end_offset" : 92
            }
          ]
        },

Yet when I query it,

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/my_index/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "some_boring_filename"
    }
  }
}'

there are no hits.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Steps to reproduce:

Create the index according to:

curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/foo?pretty' -d '{
  "mappings": {
    "properties": {
      "source.file": {
        "properties": {
          "name": {
            "type": "text"
          },
          "path": {
            "analyzer": "path_tree_rev",
            "type": "text"
          }
        },
        "type": "object"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'

I use the bulk API to insert lots of documents. I give ES plenty of time to index them, in fact this problem manifests with documents that were inserted months ago.

Run the query above and try to match a file based on its base name.

@rgov rgov added >bug needs:triage Requires assignment of a team area label labels Jan 10, 2021
@cbuescher
Copy link
Member

This was posted to the discussion forum for one month without finding any resolution, so I am assuming it is a bug and posting here.

Sorry this didn't get resolved in the forums, however I believe it is a general usage question that belongs there. I quickly tried this on 7.10 with your examples and a document indexing "source.file.path" : "/foo/bar/some_boring_filename" and retrieving it with

"match": {
      "source.file.path": "some_boring_filename"
    }

since that was the field name the "path_tree_rev" analyzer was used in. That retrieved the document for me. I will close this issue but please let me know if you want to discuss this further in the forums.

@cbuescher cbuescher removed >bug needs:triage Requires assignment of a team area label labels Jan 11, 2021
@rgov
Copy link
Author

rgov commented Jan 11, 2021

I'm sorry, I bungled my reduced test case, since in my real application there are many other fields.

Here is the actual creation of the index, you can see that it is indeed source.file:

curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/holo?pretty' -d '{
  "mappings": {
    "properties": {
      ...
      "source": {
        "properties": {
          "file": {
            "analyzer": "path_tree_rev",
            "type": "text"
          },
          "kind": {
            "type": "keyword"
          }
        },
        "type": "nested"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'

Here's inserting an entry:

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/_bulk?pretty' -d '{"index":{"_index":"holo"}}
{"source":{"kind":"holo","file":"<redacted>/20180813/20180813_0000/H.20180812.2343008.757.tif"},
...

You can see it in Kibana:

image

The termvectors API for this same document shows the path basename under terms, suggesting the tokenizer is working as expected:

"H.20180812.2343008.757.tif" : {
  "doc_freq" : 1,
  "ttf" : 1,
  "term_freq" : 1,
  "tokens" : [
    {
      "position" : 0,
      "start_offset" : 66,
      "end_offset" : 92
    }
  ]
},

Yet a search on this field for this term fails:

curl -H 'Content-Type: application/json' -XPOST 'http://staging.otz-db.whoi.edu:9200/holo/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "H.20180812.2343008.757.tif"
    }
  }
}'

Kibana also shows no results for the KQL query source.file : H.20180812.2343008.757.tif and in fact if I do a match query for the entire full source.file field as it appears in the document, this also finds nothing.

I hope this does a better job demonstrating that this could be a bug, rather than user error, though this is my first time working with tokenizers.

@cbuescher
Copy link
Member

"type": "nested"

"source" is a nested object. You cannot directly query fields inside nested object, you need to use a "nested_query" around it. Again, the forum would be the right place to discuss this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants