4
4
<titleabbrev>MinHash</titleabbrev>
5
5
++++
6
6
7
- The `min_hash` token filter hashes each token of the token stream and divides
8
- the resulting hashes into buckets, keeping the lowest-valued hashes per
9
- bucket. It then returns these hashes as tokens .
7
+ Uses the https://en.wikipedia.org/wiki/MinHash[MinHash] technique to produce a
8
+ signature for a token stream. You can use MinHash signatures to estimate the
9
+ similarity of documents. See <<analysis-minhash-tokenfilter-similarity-search>> .
10
10
11
- The following are settings that can be set for a `min_hash` token filter.
11
+ The `min_hash` filter performs the following operations on a token stream in
12
+ order:
12
13
13
- [cols="<,<", options="header",]
14
- |=======================================================================
15
- |Setting |Description
16
- |`hash_count` |The number of hashes to hash the token stream with. Defaults to `1` .
14
+ . Hashes each token in the stream.
15
+ . Assigns the hashes to buckets, keeping only the smallest hashes of each
16
+ bucket.
17
+ . Outputs the smallest hash from each bucket as a token stream .
17
18
18
- |`bucket_count` |The number of buckets to divide the minhashes into. Defaults to `512`.
19
+ This filter uses Lucene's
20
+ {lucene-analysis-docs}/minhash/MinHashFilter.html[MinHashFilter].
19
21
20
- |`hash_set_size` |The number of minhashes to keep per bucket. Defaults to `1`.
22
+ [[analysis-minhash-tokenfilter-configure-parms]]
23
+ ==== Configurable parameters
21
24
22
- |`with_rotation` |Whether or not to fill empty buckets with the value of the first non-empty
23
- bucket to its circular right. Only takes effect if hash_set_size is equal to one.
24
- Defaults to `true` if bucket_count is greater than one, else `false`.
25
- |=======================================================================
25
+ `bucket_count`::
26
+ (Optional, integer)
27
+ Number of buckets to which hashes are assigned. Defaults to `512`.
26
28
27
- Some points to consider while setting up a `min_hash` filter:
29
+ `hash_count`::
30
+ (Optional, integer)
31
+ Number of ways to hash each token in the stream. Defaults to `1`.
32
+
33
+ `hash_set_size`::
34
+ (Optional, integer)
35
+ Number of hashes to keep from each bucket. Defaults to `1`.
36
+ +
37
+ Hashes are retained by ascending size, starting with the bucket's smallest hash
38
+ first.
39
+
40
+ `with_rotation`::
41
+ (Optional, boolean)
42
+ If `true`, the filter fills empty buckets with the value of the first non-empty
43
+ bucket to its circular right if the `hash_set_size` is `1`. If the
44
+ `bucket_count` argument is greater than `1`, this parameter defaults to `true`.
45
+ Otherwise, this parameter defaults to `false`.
46
+
47
+ [[analysis-minhash-tokenfilter-configuration-tips]]
48
+ ==== Tips for configuring the `min_hash` filter
28
49
29
50
* `min_hash` filter input tokens should typically be k-words shingles produced
30
- from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
51
+ from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
31
52
choose `k` large enough so that the probability of any given shingle
32
- occurring in a document is low. At the same time, as
53
+ occurring in a document is low. At the same time, as
33
54
internally each shingle is hashed into to 128-bit hash, you should choose
34
55
`k` small enough so that all possible
35
56
different k-words shingles can be hashed to 128-bit hash with
36
57
minimal collision.
37
58
38
- * choosing the right settings for `hash_count`, `bucket_count` and
39
- `hash_set_size` needs some experimentation.
40
- ** to improve the precision, you should increase `bucket_count` or
41
- `hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
42
- will provide a higher guarantee that different tokens are
43
- indexed to different buckets.
44
- ** to improve the recall,
45
- you should increase `hash_count` parameter. For example,
46
- setting `hash_count=2`, will make each token to be hashed in
47
- two different ways, thus increasing the number of potential
48
- candidates for search.
49
-
50
- * the default settings makes the `min_hash` filter to produce for
51
- each document 512 `min_hash` tokens, each is of size 16 bytes.
52
- Thus, each document's size will be increased by around 8Kb.
53
-
54
- * `min_hash` filter is used to hash for Jaccard similarity. This means
59
+ * We recommend you test different arguments for the `hash_count`, `bucket_count` and
60
+ `hash_set_size` parameters:
61
+
62
+ ** To improve precision, increase the `bucket_count` or
63
+ `hash_set_size` arguments. Higher `bucket_count` and `hash_set_size` values
64
+ increase the likelihood that different tokens are indexed to different
65
+ buckets.
66
+
67
+ ** To improve the recall, increase the value of the `hash_count` argument. For
68
+ example, setting `hash_count` to `2` hashes each token in two different ways,
69
+ increasing the number of potential candidates for search.
70
+
71
+ * By default, the `min_hash` filter produces 512 tokens for each document. Each
72
+ token is 16 bytes in size. This means each document's size will be increased by
73
+ around 8Kb.
74
+
75
+ * The `min_hash` filter is used for Jaccard similarity. This means
55
76
that it doesn't matter how many times a document contains a certain token,
56
77
only that if it contains it or not.
57
78
58
- ==== Theory
59
- MinHash token filter allows you to hash documents for similarity search.
79
+ [[analysis-minhash-tokenfilter-similarity-search]]
80
+ ==== Using the `min_hash` token filter for similarity search
81
+
82
+ The `min_hash` token filter allows you to hash documents for similarity search.
60
83
Similarity search, or nearest neighbor search is a complex problem.
61
84
A naive solution requires an exhaustive pairwise comparison between a query
62
85
document and every document in an index. This is a prohibitive operation
@@ -88,29 +111,44 @@ document's tokens and chooses the minimum hash code among them.
88
111
The minimum hash codes from all hash functions are combined
89
112
to form a signature for the document.
90
113
114
+ [[analysis-minhash-tokenfilter-customize]]
115
+ ==== Customize and add to an analyzer
116
+
117
+ To customize the `min_hash` filter, duplicate it to create the basis for a new
118
+ custom token filter. You can modify the filter using its configurable
119
+ parameters.
91
120
92
- ==== Example of setting MinHash Token Filter in Elasticsearch
93
- Here is an example of setting up a `min_hash` filter:
121
+ For example, the following <<indices-create-index,create index API>> request
122
+ uses the following custom token filters to configure a new
123
+ <<analysis-custom-analyzer,custom analyzer>>:
94
124
95
- [source,js]
96
- --------------------------------------------------
97
- POST /index1
125
+ * `my_shingle_filter`, a custom <<analysis-shingle-tokenfilter,`shingle`
126
+ filter>>. `my_shingle_filter` only outputs five-word shingles.
127
+ * `my_minhash_filter`, a custom `min_hash` filter. `my_minhash_filter` hashes
128
+ each five-word shingle once. It then assigns the hashes into 512 buckets,
129
+ keeping only the smallest hash from each bucket.
130
+
131
+ The request also assigns the custom analyzer to the `fingerprint` field mapping.
132
+
133
+ [source,console]
134
+ ----
135
+ PUT /my_index
98
136
{
99
137
"settings": {
100
138
"analysis": {
101
139
"filter": {
102
- "my_shingle_filter": { <1>
140
+ "my_shingle_filter": { <1>
103
141
"type": "shingle",
104
142
"min_shingle_size": 5,
105
143
"max_shingle_size": 5,
106
144
"output_unigrams": false
107
145
},
108
146
"my_minhash_filter": {
109
147
"type": "min_hash",
110
- "hash_count": 1, <2>
111
- "bucket_count": 512, <3>
112
- "hash_set_size": 1, <4>
113
- "with_rotation": true <5>
148
+ "hash_count": 1, <2>
149
+ "bucket_count": 512, <3>
150
+ "hash_set_size": 1, <4>
151
+ "with_rotation": true <5>
114
152
}
115
153
},
116
154
"analyzer": {
@@ -133,10 +171,10 @@ POST /index1
133
171
}
134
172
}
135
173
}
136
- --------------------------------------------------
137
- // NOTCONSOLE
138
- <1> setting a shingle filter with 5 -word shingles
139
- <2> setting min_hash filter to hash with 1 hash
140
- <3> setting min_hash filter to hash tokens into 512 buckets
141
- <4> setting min_hash filter to keep only a single smallest hash in each bucket
142
- <5> setting min_hash filter to fill empty buckets with values from neighboring buckets
174
+ ----
175
+
176
+ <1> Configures a custom shingle filter to output only five -word shingles.
177
+ <2> Each five-word shingle in the stream is hashed once.
178
+ <3> The hashes are assigned to 512 buckets.
179
+ <4> Only the smallest hash in each bucket is retained.
180
+ <5> The filter fills empty buckets with the values of neighboring buckets.
0 commit comments