Skip to content

Commit d855970

Browse files
committed
[DOCS] Reformat fingerprint token filter docs (#49311)
1 parent 0b7d0b3 commit d855970

File tree

1 file changed

+130
-20
lines changed

1 file changed

+130
-20
lines changed
Lines changed: 130 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,138 @@
11
[[analysis-fingerprint-tokenfilter]]
2-
=== Fingerprint Token Filter
2+
=== Fingerprint token filter
3+
++++
4+
<titleabbrev>Fingerprint</titleabbrev>
5+
++++
36

4-
The `fingerprint` token filter emits a single token which is useful for fingerprinting
5-
a body of text, and/or providing a token that can be clustered on. It does this by
6-
sorting the tokens, deduplicating and then concatenating them back into a single token.
7+
Sorts and removes duplicate tokens from a token stream, then concatenates the
8+
stream into a single output token.
79

8-
For example, the tokens `["the", "quick", "quick", "brown", "fox", "was", "very", "brown"]` will be
9-
transformed into a single token: `"brown fox quick the very was"`. Notice how the tokens were sorted
10-
alphabetically, and there is only one `"quick"`.
10+
For example, this filter changes the `[ the, fox, was, very, very, quick ]`
11+
token stream as follows:
1112

12-
The following are settings that can be set for a `fingerprint` token
13-
filter type:
13+
. Sorts the tokens alphabetically to `[ fox, quick, the, very, very, was ]`
1414

15-
[cols="<,<",options="header",]
16-
|======================================================
17-
|Setting |Description
18-
|`separator` |Defaults to a space.
19-
|`max_output_size` |Defaults to `255`.
20-
|======================================================
15+
. Removes a duplicate instance of the `very` token.
16+
17+
. Concatenates the token stream to a output single token: `[fox quick the very was ]`
18+
19+
Output tokens produced by this filter are useful for
20+
fingerprinting and clustering a body of text as described in the
21+
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[OpenRefine
22+
project].
23+
24+
This filter uses Lucene's
25+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene//analysis/miscellaneous/FingerprintFilter.html[FingerprintFilter].
26+
27+
[[analysis-fingerprint-tokenfilter-analyze-ex]]
28+
==== Example
29+
30+
The following <<indices-analyze,analyze API>> request uses the `fingerprint`
31+
filter to create a single output token for the text `zebra jumps over resting
32+
resting dog`:
33+
34+
[source,console]
35+
--------------------------------------------------
36+
GET _analyze
37+
{
38+
"tokenizer" : "whitespace",
39+
"filter" : ["fingerprint"],
40+
"text" : "zebra jumps over resting resting dog"
41+
}
42+
--------------------------------------------------
43+
44+
The filter produces the following token:
45+
46+
[source,text]
47+
--------------------------------------------------
48+
[ dog jumps over resting zebra ]
49+
--------------------------------------------------
50+
51+
/////////////////////
52+
[source,console-result]
53+
--------------------------------------------------
54+
{
55+
"tokens" : [
56+
{
57+
"token" : "dog jumps over resting zebra",
58+
"start_offset" : 0,
59+
"end_offset" : 36,
60+
"type" : "fingerprint",
61+
"position" : 0
62+
}
63+
]
64+
}
65+
--------------------------------------------------
66+
/////////////////////
67+
68+
[[analysis-fingerprint-tokenfilter-analyzer-ex]]
69+
==== Add to an analyzer
70+
71+
The following <<indices-create-index,create index API>> request uses the
72+
`fingerprint` filter to configure a new <<analysis-custom-analyzer,custom
73+
analyzer>>.
74+
75+
[source,console]
76+
--------------------------------------------------
77+
PUT fingerprint_example
78+
{
79+
"settings": {
80+
"analysis": {
81+
"analyzer": {
82+
"whitespace_fingerprint": {
83+
"tokenizer": "whitespace",
84+
"filter": [ "elision" ]
85+
}
86+
}
87+
}
88+
}
89+
}
90+
--------------------------------------------------
91+
92+
[[analysis-fingerprint-tokenfilter-configure-parms]]
93+
==== Configurable parameters
2194

2295
[[analysis-fingerprint-tokenfilter-max-size]]
23-
==== Maximum token size
96+
`max_output_size`::
97+
(Optional, integer)
98+
Maximum character length, including whitespace, of the output token. Defaults to
99+
`255`. Concatenated tokens longer than this will result in no token output.
100+
101+
`separator`::
102+
(Optional, string)
103+
Character to use to concatenate the token stream input. Defaults to a space.
104+
105+
[[analysis-fingerprint-tokenfilter-customize]]
106+
==== Customize
107+
108+
To customize the `fingerprint` filter, duplicate it to create the basis
109+
for a new custom token filter. You can modify the filter using its configurable
110+
parameters.
111+
112+
For example, the following request creates a custom `fingerprint` filter with
113+
that use `+` to concatenate token streams. The filter also limits
114+
output tokens to `100` characters or fewer.
24115

25-
Because a field may have many unique tokens, it is important to set a cutoff so that fields do not grow
26-
too large. The `max_output_size` setting controls this behavior. If the concatenated fingerprint
27-
grows larger than `max_output_size`, the token filter will exit and will not emit a token (e.g. the
28-
field will be empty).
116+
[source,console]
117+
--------------------------------------------------
118+
PUT custom_fingerprint_example
119+
{
120+
"settings": {
121+
"analysis": {
122+
"analyzer": {
123+
"whitespace_": {
124+
"tokenizer": "whitespace",
125+
"filter": [ "fingerprint_plus_concat" ]
126+
}
127+
},
128+
"filter": {
129+
"fingerprint_plus_concat": {
130+
"type": "fingerprint",
131+
"max_output_size": 100,
132+
"separator": "+"
133+
}
134+
}
135+
}
136+
}
137+
}
138+
--------------------------------------------------

0 commit comments

Comments
 (0)