You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on PR elastic#45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection
By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.
Closeselastic#45.
Closeselastic#44.
Copy file name to clipboardExpand all lines: README.md
+14-2Lines changed: 14 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -63,6 +63,7 @@ The metadata supported are:
63
63
*`keywords`
64
64
*`content_type`
65
65
*`content_length` is the original content_length before text extraction (aka file size)
66
+
*`language`
66
67
67
68
They can be queried using the "dot notation", for example: `my_attachment.author`.
68
69
@@ -81,7 +82,8 @@ Both the meta data and the actual content are simple core type mappers (string,
81
82
"author": {"analyzer":"myAnalyzer"},
82
83
"keywords": {store :"yes"},
83
84
"content_type": {store :"yes"},
84
-
"content_length": {store :"yes"}
85
+
"content_length": {store :"yes"},
86
+
"language": {store :"yes"}
85
87
}
86
88
}
87
89
}
@@ -96,7 +98,7 @@ Indexed Characters
96
98
97
99
By default, `100000` characters are extracted when indexing the content. This default value can be changed by setting the `index.mapping.attachment.indexed_chars` setting. It can also be provided on a per document indexed using the `_indexed_chars` parameter. `-1` can be set to extract all text, but note that all the text needs to be allowed to be represented in memory.
98
100
99
-
Note, this feature is support since `1.3.0` version.
101
+
Note, this feature is supported since `1.3.0` version.
100
102
101
103
Metadata parsing error handling
102
104
-------------------------------
@@ -106,6 +108,16 @@ Since version `1.9.0`, parsing errors are ignored so your document is indexed.
106
108
107
109
You can disable this feature by setting the `index.mapping.attachment.ignore_errors` setting to `false`.
108
110
111
+
Language Detection
112
+
------------------
113
+
114
+
By default, language detection is disabled (`false`) as it could come with a cost.
115
+
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
116
+
It can also be provided on a per document indexed using the `_detect_language` parameter.
117
+
118
+
Note, this feature is supported since `2.0.0` version.
0 commit comments