@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
8
8
In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
9
9
stream. The exact impact will vary depending on your data set.
10
10
11
- The following features are enabled in a logs data stream:
12
-
13
- * <<synthetic-source,Synthetic source>>, which omits storing the `_source` field. When the document source is requested, it is synthesized from document fields upon retrieval.
14
-
15
- * Index sorting. This yields a lower storage footprint. By default indices are sorted by `host.name` and `@timestamp` fields at index time.
16
-
17
- * More space efficient compression for fields with <<doc-values,`doc_values`>> enabled.
18
-
19
11
[discrete]
20
12
[[how-to-use-logsds]]
21
13
=== Create a logs data stream
@@ -50,3 +42,175 @@ DELETE _index_template/my-index-template
50
42
----
51
43
// TEST[continued]
52
44
////
45
+
46
+ [[logsdb-default-settings]]
47
+
48
+ [discrete]
49
+ [[logsdb-synthtic-source]]
50
+ === Synthetic source
51
+
52
+ By default, `logsdb` mode uses <<synthetic-source,synthetic source>>, which omits storing the original `_source`
53
+ field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
54
+ restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
55
+
56
+ NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
57
+ are preserved for <<synthetic-source,synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
58
+ which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
59
+ array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
60
+ case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
61
+
62
+ For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.
63
+
64
+ [discrete]
65
+ [[logsdb-sort-settings]]
66
+ === Index sort settings
67
+
68
+ The following settings are applied by default when using the `logsdb` mode for index sorting:
69
+
70
+ * `index.sort.field`: `["host.name", "@timestamp"]`
71
+ In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the
72
+ `@timestamp` field is automatically injected if it is not present.
73
+
74
+ * `index.sort.order`: `["desc", "desc"]`
75
+ The default sort order for both fields is descending (`desc`), prioritizing the latest data.
76
+
77
+ * `index.sort.mode`: `["min", "min"]`
78
+ The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields.
79
+
80
+ * `index.sort.missing`: `["_first", "_first"]`
81
+ Missing values are sorted to appear first (`_first`) in `logsdb` index mode.
82
+
83
+ `logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields
84
+ and order for sorting by modifying the `index.sort.field` and `index.sort.order`.
85
+
86
+ When using default sort settings, the `host.name` field is automatically injected into the mappings of the
87
+ index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and
88
+ retrieved based on the `host.name` and `@timestamp` fields.
89
+
90
+ NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field
91
+ named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`,
92
+ a single `host.name` field will be mapped as a `keyword` field.
93
+
94
+ Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings,
95
+ a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index
96
+ rollover after updating relevant (component) templates.
97
+
98
+ If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort
99
+ settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data
100
+ is organized after sorting. For more details, refer to our documentation on
101
+ <<index-modules-index-sorting,index sorting>>.
102
+
103
+ NOTE: For <<data-streams, data streams>>, the `@timestamp` field is automatically injected if not already present.
104
+ However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not
105
+ automatically added to the list of sort fields.
106
+
107
+ [discrete]
108
+ [[logsdb-specialized-codecs]]
109
+ === Specialized codecs
110
+
111
+ `logsdb` index mode uses the `best_compression` <<index-codec,codec>> by default, which applies {wikipedia}/Zstd[ZSTD]
112
+ compression to stored fields. Users are allowed to override it and switch to the `default` codec for faster compression
113
+ at the expense of slightly larger storage footprint.
114
+
115
+ `logsdb` index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage.
116
+ Users can rely on these specialized codecs being applied by default when using `logsdb` index mode.
117
+
118
+ Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the
119
+ following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference
120
+ (FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution.
121
+ For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or
122
+ non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.
123
+
124
+ The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment
125
+ may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.
126
+
127
+ The following methods are applied sequentially:
128
+
129
+ * **Delta encoding**:
130
+ a compression method that stores the difference between consecutive values instead of the actual values.
131
+
132
+ * **Offset encoding**:
133
+ a compression method that stores the difference from a base value rather than between consecutive values.
134
+
135
+ * **Greatest Common Divisor (GCD) encoding**:
136
+ a compression method that finds the greatest common divisor of a set of values and stores the differences
137
+ as multiples of the GCD.
138
+
139
+ * **Frame Of Reference (FOR) encoding**:
140
+ a compression method that determines the smallest number of bits required to encode a block of values and uses
141
+ bit-packing to fit such values into larger 64-bit blocks.
142
+
143
+ For keyword fields, **Run Length Encoding (RLE)** is applied to the ordinals, which represent positions in the Lucene
144
+ segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.
145
+
146
+ [discrete]
147
+ [[logsdb-ignored-settings]]
148
+ === `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit`
149
+
150
+ By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields
151
+ to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
152
+ fields contain invalid or improperly formatted data.
153
+
154
+ Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended
155
+ as it might result in documents with malformed fields being rejected and not indexed at all.
156
+
157
+ In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
158
+ efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191
159
+ **characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.
160
+ The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value
161
+ defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default
162
+ behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while
163
+ still allowing users to customize the limit, overriding it at the mapping level or changing the index level default
164
+ setting.
165
+
166
+ In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by
167
+ default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document
168
+ rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The
169
+ `index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic
170
+ and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document
171
+ indexing, ensuring continued log ingestion without errors.
172
+
173
+ NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When
174
+ `host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
175
+ `subobjects: false` it only consists of one field.
176
+
177
+ [discrete]
178
+ [[logsdb-nodocvalue-fields]]
179
+ === Fields without doc values
180
+
181
+ When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
182
+ Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's
183
+ data is still available for reconstructing the document’s source when retrieving it via
184
+ <<synthetic-source,synthetic source>>.
185
+
186
+ For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to
187
+ reconstruct the original value in <<synthetic-source,synthetic source>>.
188
+
189
+ This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain
190
+ fields.
191
+
192
+ [discrete]
193
+ [[logsdb-settings-summary]]
194
+ === LogsDB settings summary
195
+
196
+ The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch:
197
+
198
+ * **`index.mode`**: `"logsdb"`
199
+
200
+ * **`index.mapping.synthetic_source_keep`**: `"arrays"`
201
+
202
+ * **`index.sort.field`**: `["host.name", "@timestamp"]`
203
+
204
+ * **`index.sort.order`**: `["desc", "desc"]`
205
+
206
+ * **`index.sort.mode`**: `["min", "min"]`
207
+
208
+ * **`index.sort.missing`**: `["_first", "_first"]`
209
+
210
+ * **`index.codec`**: `"best_compression"`
211
+
212
+ * **`index.mapping.ignore_malformed`**: `true`
213
+
214
+ * **`index.mapping.ignore_above`**: `8191`
215
+
216
+ * **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`
0 commit comments