Skip to content

Commit 296fbba

Browse files
committed
backport of elastic#39630
1 parent b09ba08 commit 296fbba

File tree

3 files changed

+199
-0
lines changed

3 files changed

+199
-0
lines changed

docs/reference/analysis/tokenizers.asciidoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,3 +155,7 @@ include::tokenizers/simplepattern-tokenizer.asciidoc[]
155155
include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]
156156

157157
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
158+
159+
include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]
160+
161+
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
[[analysis-pathhierarchy-tokenizer-examples]]
2+
=== Path Hierarchy Tokenizer Examples
3+
4+
A common use-case for the `path_hierarchy` tokenizer is filtering results by
5+
file paths. If indexing a file path along with the data, the use of the
6+
`path_hierarchy` tokenizer to analyze the path allows filtering the results
7+
by different parts of the file path string.
8+
9+
10+
This example configures an index to have two custom analyzers and applies
11+
those analyzers to multifields of the `file_path` text field that will
12+
store filenames. One of the two analyzers uses reverse tokenization.
13+
Some sample documents are then indexed to represent some file paths
14+
for photos inside photo folders of two different users.
15+
16+
17+
[source,js]
18+
--------------------------------------------------
19+
PUT file-path-test
20+
{
21+
"settings": {
22+
"analysis": {
23+
"analyzer": {
24+
"custom_path_tree": {
25+
"tokenizer": "custom_hierarchy"
26+
},
27+
"custom_path_tree_reversed": {
28+
"tokenizer": "custom_hierarchy_reversed"
29+
}
30+
},
31+
"tokenizer": {
32+
"custom_hierarchy": {
33+
"type": "path_hierarchy",
34+
"delimiter": "/"
35+
},
36+
"custom_hierarchy_reversed": {
37+
"type": "path_hierarchy",
38+
"delimiter": "/",
39+
"reverse": "true"
40+
}
41+
}
42+
}
43+
},
44+
"mappings": {
45+
"properties": {
46+
"file_path": {
47+
"type": "text",
48+
"fields": {
49+
"tree": {
50+
"type": "text",
51+
"analyzer": "custom_path_tree"
52+
},
53+
"tree_reversed": {
54+
"type": "text",
55+
"analyzer": "custom_path_tree_reversed"
56+
}
57+
}
58+
}
59+
}
60+
}
61+
}
62+
63+
POST file-path-test/_doc/1
64+
{
65+
"file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
66+
}
67+
68+
POST file-path-test/_doc/2
69+
{
70+
"file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
71+
}
72+
73+
POST file-path-test/_doc/3
74+
{
75+
"file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
76+
}
77+
78+
POST file-path-test/_doc/4
79+
{
80+
"file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
81+
}
82+
83+
POST file-path-test/_doc/5
84+
{
85+
"file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
86+
}
87+
--------------------------------------------------
88+
// CONSOLE
89+
// TESTSETUP
90+
91+
92+
A search for a particular file path string against the text field matches all
93+
the example documents, with Bob's documents ranking highest due to `bob` also
94+
being one of the terms created by the standard analyzer boosting relevance for
95+
Bob's documents.
96+
97+
[source,js]
98+
--------------------------------------------------
99+
GET file-path-test/_search
100+
{
101+
"query": {
102+
"match": {
103+
"file_path": "/User/bob/photos/2017/05"
104+
}
105+
}
106+
}
107+
--------------------------------------------------
108+
// CONSOLE
109+
110+
111+
It's simple to match or filter documents with file paths that exist within a
112+
particular directory using the `file_path.tree` field.
113+
114+
[source,js]
115+
--------------------------------------------------
116+
GET file-path-test/_search
117+
{
118+
"query": {
119+
"term": {
120+
"file_path.tree": "/User/alice/photos/2017/05/16"
121+
}
122+
}
123+
}
124+
--------------------------------------------------
125+
// CONSOLE
126+
127+
With the reverse parameter for this tokenizer, it's also possible to match
128+
from the other end of the file path, such as individual file names or a deep
129+
level subdirectory. The following example shows a search for all files named
130+
`my_photo1.jpg` within any directory via the `file_path.tree_reversed` field
131+
configured to use the reverse parameter in the mapping.
132+
133+
134+
[source,js]
135+
--------------------------------------------------
136+
GET file-path-test/_search
137+
{
138+
"query": {
139+
"term": {
140+
"file_path.tree_reversed": {
141+
"value": "my_photo1.jpg"
142+
}
143+
}
144+
}
145+
}
146+
--------------------------------------------------
147+
// CONSOLE
148+
149+
150+
Viewing the tokens generated with both forward and reverse is instructive
151+
in showing the tokens created for the same file path value.
152+
153+
154+
[source,js]
155+
--------------------------------------------------
156+
POST file-path-test/_analyze
157+
{
158+
"analyzer": "custom_path_tree",
159+
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
160+
}
161+
162+
POST file-path-test/_analyze
163+
{
164+
"analyzer": "custom_path_tree_reversed",
165+
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
166+
}
167+
--------------------------------------------------
168+
// CONSOLE
169+
170+
171+
It's also useful to be able to filter with file paths when combined with other
172+
types of searches, such as this example looking for any files paths with `16`
173+
that also must be in Alice's photo directory.
174+
175+
[source,js]
176+
--------------------------------------------------
177+
GET file-path-test/_search
178+
{
179+
"query": {
180+
"bool" : {
181+
"must" : {
182+
"match" : { "file_path" : "16" }
183+
},
184+
"filter": {
185+
"term" : { "file_path.tree" : "/User/alice" }
186+
}
187+
}
188+
}
189+
}
190+
--------------------------------------------------
191+
// CONSOLE

docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,3 +170,7 @@ If we were to set `reverse` to `true`, it would produce the following:
170170
---------------------------
171171
[ one/two/three/, two/three/, three/ ]
172172
---------------------------
173+
174+
[float]
175+
=== Detailed Examples
176+
See <<analysis-pathhierarchy-tokenizer-examples, detailed examples here>>.

0 commit comments

Comments
 (0)