Match query does not match tokens from path_hierarchy tokenizer #67225

rgov · 2021-01-10T23:29:05Z

This was posted to the discussion forum for one month without finding any resolution, so I am assuming it is a bug and posting here.

Elasticsearch version: 7.5.1

Plugins installed: []

JVM version (java -version): (whatever is in the Docker container)

OS version (uname -a if on a Unix-like system): host is 4.15.0-96-generic #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I have a field, source.file containing a file path which is tokenized using a path_hierarchy tokenizer that operates in "reverse" mode so that I can query by the file's base name (the last past component).

For instance if I have a document with source.file set to /foo/bar/some_boring_filename I should be able to match the field against simply some_boring_filename.

If I ask Elasticsearch what the terms for this document are,

Request to get _termvectors

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/my_index/_doc/gDEhSHYBoWN8sy6cFN7j/_termvectors' -d '{
  "fields" : ["source.file"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

I find indeed that one of the tokens is some_boring_filename:

        "some_boring_filename" : {
          "doc_freq" : 1,
          "ttf" : 1,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 66,
              "end_offset" : 92
            }
          ]
        },

Yet when I query it,

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/my_index/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "some_boring_filename"
    }
  }
}'

there are no hits.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Steps to reproduce:

Create the index according to:

curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/foo?pretty' -d '{
  "mappings": {
    "properties": {
      "source.file": {
        "properties": {
          "name": {
            "type": "text"
          },
          "path": {
            "analyzer": "path_tree_rev",
            "type": "text"
          }
        },
        "type": "object"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'

I use the bulk API to insert lots of documents. I give ES plenty of time to index them, in fact this problem manifests with documents that were inserted months ago.

Run the query above and try to match a file based on its base name.

The text was updated successfully, but these errors were encountered:

cbuescher · 2021-01-11T11:42:16Z

This was posted to the discussion forum for one month without finding any resolution, so I am assuming it is a bug and posting here.

Sorry this didn't get resolved in the forums, however I believe it is a general usage question that belongs there. I quickly tried this on 7.10 with your examples and a document indexing "source.file.path" : "/foo/bar/some_boring_filename" and retrieving it with

"match": {
      "source.file.path": "some_boring_filename"
    }

since that was the field name the "path_tree_rev" analyzer was used in. That retrieved the document for me. I will close this issue but please let me know if you want to discuss this further in the forums.

rgov · 2021-01-11T15:31:01Z

I'm sorry, I bungled my reduced test case, since in my real application there are many other fields.

Here is the actual creation of the index, you can see that it is indeed source.file:

curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/holo?pretty' -d '{
  "mappings": {
    "properties": {
      ...
      "source": {
        "properties": {
          "file": {
            "analyzer": "path_tree_rev",
            "type": "text"
          },
          "kind": {
            "type": "keyword"
          }
        },
        "type": "nested"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'

Here's inserting an entry:

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/_bulk?pretty' -d '{"index":{"_index":"holo"}}
{"source":{"kind":"holo","file":"<redacted>/20180813/20180813_0000/H.20180812.2343008.757.tif"},
...

You can see it in Kibana:

The termvectors API for this same document shows the path basename under terms, suggesting the tokenizer is working as expected:

"H.20180812.2343008.757.tif" : {
  "doc_freq" : 1,
  "ttf" : 1,
  "term_freq" : 1,
  "tokens" : [
    {
      "position" : 0,
      "start_offset" : 66,
      "end_offset" : 92
    }
  ]
},

Yet a search on this field for this term fails:

curl -H 'Content-Type: application/json' -XPOST 'http://staging.otz-db.whoi.edu:9200/holo/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "H.20180812.2343008.757.tif"
    }
  }
}'

Kibana also shows no results for the KQL query source.file : H.20180812.2343008.757.tif and in fact if I do a match query for the entire full source.file field as it appears in the document, this also finds nothing.

I hope this does a better job demonstrating that this could be a bug, rather than user error, though this is my first time working with tokenizers.

cbuescher · 2021-01-11T15:49:36Z

"type": "nested"

"source" is a nested object. You cannot directly query fields inside nested object, you need to use a "nested_query" around it. Again, the forum would be the right place to discuss this.

rgov added >bug needs:triage Requires assignment of a team area label labels Jan 10, 2021

cbuescher closed this as completed Jan 11, 2021

cbuescher removed >bug needs:triage Requires assignment of a team area label labels Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match query does not match tokens from path_hierarchy tokenizer #67225

Match query does not match tokens from path_hierarchy tokenizer #67225

rgov commented Jan 10, 2021

cbuescher commented Jan 11, 2021

rgov commented Jan 11, 2021 •

edited

Loading

cbuescher commented Jan 11, 2021

Match query does not match tokens from path_hierarchy tokenizer #67225

Match query does not match tokens from path_hierarchy tokenizer #67225

Comments

rgov commented Jan 10, 2021

cbuescher commented Jan 11, 2021

rgov commented Jan 11, 2021 • edited Loading

cbuescher commented Jan 11, 2021

rgov commented Jan 11, 2021 •

edited

Loading