Skip to content

Commit 10d033b

Browse files
authored
[DOCS] Reformat min_hash token filter docs (#57181) (#57248)
Changes: * Rewrites description and adds a Lucene link * Reformats the configurable parameters as a definition list * Changes the `Theory` heading to `Using the min_hash token filter for similarity search` * Adds some additional detail to the analyzer example
1 parent edc0bd9 commit 10d033b

File tree

1 file changed

+91
-53
lines changed

1 file changed

+91
-53
lines changed

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

Lines changed: 91 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -4,59 +4,82 @@
44
<titleabbrev>MinHash</titleabbrev>
55
++++
66

7-
The `min_hash` token filter hashes each token of the token stream and divides
8-
the resulting hashes into buckets, keeping the lowest-valued hashes per
9-
bucket. It then returns these hashes as tokens.
7+
Uses the https://en.wikipedia.org/wiki/MinHash[MinHash] technique to produce a
8+
signature for a token stream. You can use MinHash signatures to estimate the
9+
similarity of documents. See <<analysis-minhash-tokenfilter-similarity-search>>.
1010

11-
The following are settings that can be set for a `min_hash` token filter.
11+
The `min_hash` filter performs the following operations on a token stream in
12+
order:
1213

13-
[cols="<,<", options="header",]
14-
|=======================================================================
15-
|Setting |Description
16-
|`hash_count` |The number of hashes to hash the token stream with. Defaults to `1`.
14+
. Hashes each token in the stream.
15+
. Assigns the hashes to buckets, keeping only the smallest hashes of each
16+
bucket.
17+
. Outputs the smallest hash from each bucket as a token stream.
1718

18-
|`bucket_count` |The number of buckets to divide the minhashes into. Defaults to `512`.
19+
This filter uses Lucene's
20+
{lucene-analysis-docs}/minhash/MinHashFilter.html[MinHashFilter].
1921

20-
|`hash_set_size` |The number of minhashes to keep per bucket. Defaults to `1`.
22+
[[analysis-minhash-tokenfilter-configure-parms]]
23+
==== Configurable parameters
2124

22-
|`with_rotation` |Whether or not to fill empty buckets with the value of the first non-empty
23-
bucket to its circular right. Only takes effect if hash_set_size is equal to one.
24-
Defaults to `true` if bucket_count is greater than one, else `false`.
25-
|=======================================================================
25+
`bucket_count`::
26+
(Optional, integer)
27+
Number of buckets to which hashes are assigned. Defaults to `512`.
2628

27-
Some points to consider while setting up a `min_hash` filter:
29+
`hash_count`::
30+
(Optional, integer)
31+
Number of ways to hash each token in the stream. Defaults to `1`.
32+
33+
`hash_set_size`::
34+
(Optional, integer)
35+
Number of hashes to keep from each bucket. Defaults to `1`.
36+
+
37+
Hashes are retained by ascending size, starting with the bucket's smallest hash
38+
first.
39+
40+
`with_rotation`::
41+
(Optional, boolean)
42+
If `true`, the filter fills empty buckets with the value of the first non-empty
43+
bucket to its circular right if the `hash_set_size` is `1`. If the
44+
`bucket_count` argument is greater than `1`, this parameter defaults to `true`.
45+
Otherwise, this parameter defaults to `false`.
46+
47+
[[analysis-minhash-tokenfilter-configuration-tips]]
48+
==== Tips for configuring the `min_hash` filter
2849

2950
* `min_hash` filter input tokens should typically be k-words shingles produced
30-
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
51+
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
3152
choose `k` large enough so that the probability of any given shingle
32-
occurring in a document is low. At the same time, as
53+
occurring in a document is low. At the same time, as
3354
internally each shingle is hashed into to 128-bit hash, you should choose
3455
`k` small enough so that all possible
3556
different k-words shingles can be hashed to 128-bit hash with
3657
minimal collision.
3758

38-
* choosing the right settings for `hash_count`, `bucket_count` and
39-
`hash_set_size` needs some experimentation.
40-
** to improve the precision, you should increase `bucket_count` or
41-
`hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
42-
will provide a higher guarantee that different tokens are
43-
indexed to different buckets.
44-
** to improve the recall,
45-
you should increase `hash_count` parameter. For example,
46-
setting `hash_count=2`, will make each token to be hashed in
47-
two different ways, thus increasing the number of potential
48-
candidates for search.
49-
50-
* the default settings makes the `min_hash` filter to produce for
51-
each document 512 `min_hash` tokens, each is of size 16 bytes.
52-
Thus, each document's size will be increased by around 8Kb.
53-
54-
* `min_hash` filter is used to hash for Jaccard similarity. This means
59+
* We recommend you test different arguments for the `hash_count`, `bucket_count` and
60+
`hash_set_size` parameters:
61+
62+
** To improve precision, increase the `bucket_count` or
63+
`hash_set_size` arguments. Higher `bucket_count` and `hash_set_size` values
64+
increase the likelihood that different tokens are indexed to different
65+
buckets.
66+
67+
** To improve the recall, increase the value of the `hash_count` argument. For
68+
example, setting `hash_count` to `2` hashes each token in two different ways,
69+
increasing the number of potential candidates for search.
70+
71+
* By default, the `min_hash` filter produces 512 tokens for each document. Each
72+
token is 16 bytes in size. This means each document's size will be increased by
73+
around 8Kb.
74+
75+
* The `min_hash` filter is used for Jaccard similarity. This means
5576
that it doesn't matter how many times a document contains a certain token,
5677
only that if it contains it or not.
5778

58-
==== Theory
59-
MinHash token filter allows you to hash documents for similarity search.
79+
[[analysis-minhash-tokenfilter-similarity-search]]
80+
==== Using the `min_hash` token filter for similarity search
81+
82+
The `min_hash` token filter allows you to hash documents for similarity search.
6083
Similarity search, or nearest neighbor search is a complex problem.
6184
A naive solution requires an exhaustive pairwise comparison between a query
6285
document and every document in an index. This is a prohibitive operation
@@ -88,29 +111,44 @@ document's tokens and chooses the minimum hash code among them.
88111
The minimum hash codes from all hash functions are combined
89112
to form a signature for the document.
90113

114+
[[analysis-minhash-tokenfilter-customize]]
115+
==== Customize and add to an analyzer
116+
117+
To customize the `min_hash` filter, duplicate it to create the basis for a new
118+
custom token filter. You can modify the filter using its configurable
119+
parameters.
91120

92-
==== Example of setting MinHash Token Filter in Elasticsearch
93-
Here is an example of setting up a `min_hash` filter:
121+
For example, the following <<indices-create-index,create index API>> request
122+
uses the following custom token filters to configure a new
123+
<<analysis-custom-analyzer,custom analyzer>>:
94124

95-
[source,js]
96-
--------------------------------------------------
97-
POST /index1
125+
* `my_shingle_filter`, a custom <<analysis-shingle-tokenfilter,`shingle`
126+
filter>>. `my_shingle_filter` only outputs five-word shingles.
127+
* `my_minhash_filter`, a custom `min_hash` filter. `my_minhash_filter` hashes
128+
each five-word shingle once. It then assigns the hashes into 512 buckets,
129+
keeping only the smallest hash from each bucket.
130+
131+
The request also assigns the custom analyzer to the `fingerprint` field mapping.
132+
133+
[source,console]
134+
----
135+
PUT /my_index
98136
{
99137
"settings": {
100138
"analysis": {
101139
"filter": {
102-
"my_shingle_filter": { <1>
140+
"my_shingle_filter": { <1>
103141
"type": "shingle",
104142
"min_shingle_size": 5,
105143
"max_shingle_size": 5,
106144
"output_unigrams": false
107145
},
108146
"my_minhash_filter": {
109147
"type": "min_hash",
110-
"hash_count": 1, <2>
111-
"bucket_count": 512, <3>
112-
"hash_set_size": 1, <4>
113-
"with_rotation": true <5>
148+
"hash_count": 1, <2>
149+
"bucket_count": 512, <3>
150+
"hash_set_size": 1, <4>
151+
"with_rotation": true <5>
114152
}
115153
},
116154
"analyzer": {
@@ -133,10 +171,10 @@ POST /index1
133171
}
134172
}
135173
}
136-
--------------------------------------------------
137-
// NOTCONSOLE
138-
<1> setting a shingle filter with 5-word shingles
139-
<2> setting min_hash filter to hash with 1 hash
140-
<3> setting min_hash filter to hash tokens into 512 buckets
141-
<4> setting min_hash filter to keep only a single smallest hash in each bucket
142-
<5> setting min_hash filter to fill empty buckets with values from neighboring buckets
174+
----
175+
176+
<1> Configures a custom shingle filter to output only five-word shingles.
177+
<2> Each five-word shingle in the stream is hashed once.
178+
<3> The hashes are assigned to 512 buckets.
179+
<4> Only the smallest hash in each bucket is retained.
180+
<5> The filter fills empty buckets with the values of neighboring buckets.

0 commit comments

Comments
 (0)