Skip to content

Commit 0595778

Browse files
Update README.md
1 parent 880c864 commit 0595778

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,32 @@ Folding of unicode characters based on `UTR#30`. It registers itself under `icu_
7272
}
7373
}
7474

75+
ICU Filtering
76+
-------------
77+
78+
The folding can be filtered by a set of unicode characters with the parameter `unicodeSetFilter`. This is useful for a non-internationalized search engine where retaining a set of national characters which are primary letters in a specific language is wanted. See syntax for the UnicodeSet "here":http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html.
79+
80+
The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.
81+
82+
{
83+
"index" : {
84+
"analysis" : {
85+
"analyzer" : {
86+
"folding" : {
87+
"tokenizer" : "standard",
88+
"filter" : ["my_icu_folding", "lowercase"]
89+
}
90+
}
91+
"filter" : {
92+
"my_icu_folding" : {
93+
"type" : "icu_folding"
94+
"unicodeSetFilter" : "[^åäöÅÄÖ]"
95+
}
96+
}
97+
}
98+
}
99+
}
100+
75101
ICU Collation
76102
-------------
77103

0 commit comments

Comments
 (0)