1
1
[[analysis-htmlstrip-charfilter]]
2
- === HTML Strip Char Filter
2
+ === HTML strip character filter
3
+ ++++
4
+ <titleabbrev>HTML strip</titleabbrev>
5
+ ++++
3
6
4
- The `html_strip` character filter strips HTML elements from the text and
5
- replaces HTML entities with their decoded value (e.g. replacing `&` with
6
- `&`).
7
+ Strips HTML elements from a text and replaces HTML entities with their decoded
8
+ value (e.g, replaces `&` with `&`).
7
9
8
- [float]
9
- === Example output
10
+ The `html_strip` filter uses Lucene's
11
+ {lucene-analysis-docs}/charfilter/HTMLStripCharFilter.html[HTMLStripCharFilter].
12
+
13
+ [[analysis-htmlstrip-charfilter-analyze-ex]]
14
+ ==== Example
15
+
16
+ The following <<indices-analyze,analyze API>> request uses the
17
+ `html_strip` filter to change the text `<p>I'm so <b>happy</b>!</p>` to
18
+ `\nI'm so happy!\n`.
10
19
11
20
[source,console]
12
- ---------------------------
13
- POST _analyze
21
+ ----
22
+ GET / _analyze
14
23
{
15
- "tokenizer": "keyword", <1>
16
- "char_filter": [ "html_strip" ],
24
+ "tokenizer": "keyword",
25
+ "char_filter": [
26
+ "html_strip"
27
+ ],
17
28
"text": "<p>I'm so <b>happy</b>!</p>"
18
29
}
19
- ---------------------------
30
+ ----
20
31
21
- <1> The <<analysis-keyword-tokenizer,`keyword` tokenizer>> returns a single term.
32
+ The filter produces the following text:
22
33
23
- /////////////////////
34
+ [source,text]
35
+ ----
36
+ [ \nI'm so happy!\n ]
37
+ ----
24
38
39
+ ////
25
40
[source,console-result]
26
- ----------------------------
41
+ ----
27
42
{
28
43
"tokens": [
29
44
{
@@ -35,93 +50,81 @@ POST _analyze
35
50
}
36
51
]
37
52
}
38
- ----------------------------
39
-
40
- /////////////////////
41
-
53
+ ----
54
+ ////
42
55
43
- The above example returns the term:
56
+ [[analysis-htmlstrip-charfilter-analyzer-ex]]
57
+ ==== Add to an analyzer
44
58
45
- [source,text]
46
- ---------------------------
47
- [ \nI'm so happy!\n ]
48
- ---------------------------
49
-
50
- The same example with the `standard` tokenizer would return the following terms:
59
+ The following <<indices-create-index,create index API>> request uses the
60
+ `html_strip` filter to configure a new
61
+ <<analysis-custom-analyzer,custom analyzer>>.
51
62
52
- [source,text]
53
- ---------------------------
54
- [ I'm, so, happy ]
55
- ---------------------------
56
-
57
- [float]
58
- === Configuration
63
+ [source,console]
64
+ ----
65
+ PUT /my_index
66
+ {
67
+ "settings": {
68
+ "analysis": {
69
+ "analyzer": {
70
+ "my_analyzer": {
71
+ "tokenizer": "keyword",
72
+ "char_filter": [
73
+ "html_strip"
74
+ ]
75
+ }
76
+ }
77
+ }
78
+ }
79
+ }
80
+ ----
59
81
60
- The `html_strip` character filter accepts the following parameter:
82
+ [[analysis-htmlstrip-charfilter-configure-parms]]
83
+ ==== Configurable parameters
61
84
62
- [horizontal]
63
85
`escaped_tags`::
86
+ (Optional, array of strings)
87
+ Array of HTML elements without enclosing angle brackets (`< >`). The filter
88
+ skips these HTML elements when stripping HTML from the text. For example, a
89
+ value of `[ "p" ]` skips the `<p>` HTML element.
64
90
65
- An array of HTML tags which should not be stripped from the original text.
91
+ [[analysis-htmlstrip-charfilter-customize]]
92
+ ==== Customize
66
93
67
- [float]
68
- === Example configuration
94
+ To customize the `html_strip` filter, duplicate it to create the basis
95
+ for a new custom token filter. You can modify the filter using its configurable
96
+ parameters.
69
97
70
- In this example, we configure the `html_strip` character filter to leave `<b>`
71
- tags in place:
98
+ The following <<indices-create-index,create index API>> request
99
+ configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
100
+ `html_strip` filter, `my_custom_html_strip_char_filter`.
101
+
102
+ The `my_custom_html_strip_char_filter` filter skips the removal of the `<b>`
103
+ HTML element.
72
104
73
105
[source,console]
74
- ----------------------------
106
+ ----
75
107
PUT my_index
76
108
{
77
109
"settings": {
78
110
"analysis": {
79
111
"analyzer": {
80
112
"my_analyzer": {
81
113
"tokenizer": "keyword",
82
- "char_filter": ["my_char_filter"]
114
+ "char_filter": [
115
+ "my_custom_html_strip_char_filter"
116
+ ]
83
117
}
84
118
},
85
119
"char_filter": {
86
- "my_char_filter ": {
120
+ "my_custom_html_strip_char_filter ": {
87
121
"type": "html_strip",
88
- "escaped_tags": ["b"]
122
+ "escaped_tags": [
123
+ "b"
124
+ ]
89
125
}
90
126
}
91
127
}
92
128
}
93
129
}
94
-
95
- POST my_index/_analyze
96
- {
97
- "analyzer": "my_analyzer",
98
- "text": "<p>I'm so <b>happy</b>!</p>"
99
- }
100
- ----------------------------
101
-
102
- /////////////////////
103
-
104
- [source,console-result]
105
- ----------------------------
106
- {
107
- "tokens": [
108
- {
109
- "token": "\nI'm so <b>happy</b>!\n",
110
- "start_offset": 0,
111
- "end_offset": 32,
112
- "type": "word",
113
- "position": 0
114
- }
115
- ]
116
- }
117
- ----------------------------
118
-
119
- /////////////////////
120
-
121
-
122
- The above example produces the following term:
123
-
124
- [source,text]
125
- ---------------------------
126
- [ \nI'm so <b>happy</b>!\n ]
127
- ---------------------------
130
+ ----
0 commit comments