Skip to content

Commit 90a45d2

Browse files
authored
[DOCS] Reformat html_strip charfilter (#57764) (#57811)
Changes: * Converts title to sentence case * Adds a title abbreviation * Adds Lucene link to description * Reformat sections
1 parent ee98bcb commit 90a45d2

File tree

1 file changed

+80
-77
lines changed

1 file changed

+80
-77
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,44 @@
11
[[analysis-htmlstrip-charfilter]]
2-
=== HTML Strip Char Filter
2+
=== HTML strip character filter
3+
++++
4+
<titleabbrev>HTML strip</titleabbrev>
5+
++++
36

4-
The `html_strip` character filter strips HTML elements from the text and
5-
replaces HTML entities with their decoded value (e.g. replacing `&amp;` with
6-
`&`).
7+
Strips HTML elements from a text and replaces HTML entities with their decoded
8+
value (e.g, replaces `&amp;` with `&`).
79

8-
[float]
9-
=== Example output
10+
The `html_strip` filter uses Lucene's
11+
{lucene-analysis-docs}/charfilter/HTMLStripCharFilter.html[HTMLStripCharFilter].
12+
13+
[[analysis-htmlstrip-charfilter-analyze-ex]]
14+
==== Example
15+
16+
The following <<indices-analyze,analyze API>> request uses the
17+
`html_strip` filter to change the text `<p>I&apos;m so <b>happy</b>!</p>` to
18+
`\nI'm so happy!\n`.
1019

1120
[source,console]
12-
---------------------------
13-
POST _analyze
21+
----
22+
GET /_analyze
1423
{
15-
"tokenizer": "keyword", <1>
16-
"char_filter": [ "html_strip" ],
24+
"tokenizer": "keyword",
25+
"char_filter": [
26+
"html_strip"
27+
],
1728
"text": "<p>I&apos;m so <b>happy</b>!</p>"
1829
}
19-
---------------------------
30+
----
2031

21-
<1> The <<analysis-keyword-tokenizer,`keyword` tokenizer>> returns a single term.
32+
The filter produces the following text:
2233

23-
/////////////////////
34+
[source,text]
35+
----
36+
[ \nI'm so happy!\n ]
37+
----
2438

39+
////
2540
[source,console-result]
26-
----------------------------
41+
----
2742
{
2843
"tokens": [
2944
{
@@ -35,93 +50,81 @@ POST _analyze
3550
}
3651
]
3752
}
38-
----------------------------
39-
40-
/////////////////////
41-
53+
----
54+
////
4255

43-
The above example returns the term:
56+
[[analysis-htmlstrip-charfilter-analyzer-ex]]
57+
==== Add to an analyzer
4458

45-
[source,text]
46-
---------------------------
47-
[ \nI'm so happy!\n ]
48-
---------------------------
49-
50-
The same example with the `standard` tokenizer would return the following terms:
59+
The following <<indices-create-index,create index API>> request uses the
60+
`html_strip` filter to configure a new
61+
<<analysis-custom-analyzer,custom analyzer>>.
5162

52-
[source,text]
53-
---------------------------
54-
[ I'm, so, happy ]
55-
---------------------------
56-
57-
[float]
58-
=== Configuration
63+
[source,console]
64+
----
65+
PUT /my_index
66+
{
67+
"settings": {
68+
"analysis": {
69+
"analyzer": {
70+
"my_analyzer": {
71+
"tokenizer": "keyword",
72+
"char_filter": [
73+
"html_strip"
74+
]
75+
}
76+
}
77+
}
78+
}
79+
}
80+
----
5981

60-
The `html_strip` character filter accepts the following parameter:
82+
[[analysis-htmlstrip-charfilter-configure-parms]]
83+
==== Configurable parameters
6184

62-
[horizontal]
6385
`escaped_tags`::
86+
(Optional, array of strings)
87+
Array of HTML elements without enclosing angle brackets (`< >`). The filter
88+
skips these HTML elements when stripping HTML from the text. For example, a
89+
value of `[ "p" ]` skips the `<p>` HTML element.
6490

65-
An array of HTML tags which should not be stripped from the original text.
91+
[[analysis-htmlstrip-charfilter-customize]]
92+
==== Customize
6693

67-
[float]
68-
=== Example configuration
94+
To customize the `html_strip` filter, duplicate it to create the basis
95+
for a new custom token filter. You can modify the filter using its configurable
96+
parameters.
6997

70-
In this example, we configure the `html_strip` character filter to leave `<b>`
71-
tags in place:
98+
The following <<indices-create-index,create index API>> request
99+
configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
100+
`html_strip` filter, `my_custom_html_strip_char_filter`.
101+
102+
The `my_custom_html_strip_char_filter` filter skips the removal of the `<b>`
103+
HTML element.
72104

73105
[source,console]
74-
----------------------------
106+
----
75107
PUT my_index
76108
{
77109
"settings": {
78110
"analysis": {
79111
"analyzer": {
80112
"my_analyzer": {
81113
"tokenizer": "keyword",
82-
"char_filter": ["my_char_filter"]
114+
"char_filter": [
115+
"my_custom_html_strip_char_filter"
116+
]
83117
}
84118
},
85119
"char_filter": {
86-
"my_char_filter": {
120+
"my_custom_html_strip_char_filter": {
87121
"type": "html_strip",
88-
"escaped_tags": ["b"]
122+
"escaped_tags": [
123+
"b"
124+
]
89125
}
90126
}
91127
}
92128
}
93129
}
94-
95-
POST my_index/_analyze
96-
{
97-
"analyzer": "my_analyzer",
98-
"text": "<p>I&apos;m so <b>happy</b>!</p>"
99-
}
100-
----------------------------
101-
102-
/////////////////////
103-
104-
[source,console-result]
105-
----------------------------
106-
{
107-
"tokens": [
108-
{
109-
"token": "\nI'm so <b>happy</b>!\n",
110-
"start_offset": 0,
111-
"end_offset": 32,
112-
"type": "word",
113-
"position": 0
114-
}
115-
]
116-
}
117-
----------------------------
118-
119-
/////////////////////
120-
121-
122-
The above example produces the following term:
123-
124-
[source,text]
125-
---------------------------
126-
[ \nI'm so <b>happy</b>!\n ]
127-
---------------------------
130+
----

0 commit comments

Comments
 (0)