Skip to content

Commit ebec1a2

Browse files
Improve Logsdb docs including default values (elastic#115205)
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options. It includes: - default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.). - usage of synthetic `_source` by default. - information about specialized codecs and how users can override them. - default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules. - explanation of how fields without `doc_values` are handled and what we do if they are missing.
1 parent c64226c commit ebec1a2

File tree

1 file changed

+172
-8
lines changed

1 file changed

+172
-8
lines changed

docs/reference/data-streams/logs.asciidoc

Lines changed: 172 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
88
In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
99
stream. The exact impact will vary depending on your data set.
1010

11-
The following features are enabled in a logs data stream:
12-
13-
* <<synthetic-source,Synthetic source>>, which omits storing the `_source` field. When the document source is requested, it is synthesized from document fields upon retrieval.
14-
15-
* Index sorting. This yields a lower storage footprint. By default indices are sorted by `host.name` and `@timestamp` fields at index time.
16-
17-
* More space efficient compression for fields with <<doc-values,`doc_values`>> enabled.
18-
1911
[discrete]
2012
[[how-to-use-logsds]]
2113
=== Create a logs data stream
@@ -50,3 +42,175 @@ DELETE _index_template/my-index-template
5042
----
5143
// TEST[continued]
5244
////
45+
46+
[[logsdb-default-settings]]
47+
48+
[discrete]
49+
[[logsdb-synthtic-source]]
50+
=== Synthetic source
51+
52+
By default, `logsdb` mode uses <<synthetic-source,synthetic source>>, which omits storing the original `_source`
53+
field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
54+
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
55+
56+
NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
57+
are preserved for <<synthetic-source,synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
58+
which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
59+
array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
60+
case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
61+
62+
For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.
63+
64+
[discrete]
65+
[[logsdb-sort-settings]]
66+
=== Index sort settings
67+
68+
The following settings are applied by default when using the `logsdb` mode for index sorting:
69+
70+
* `index.sort.field`: `["host.name", "@timestamp"]`
71+
In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the
72+
`@timestamp` field is automatically injected if it is not present.
73+
74+
* `index.sort.order`: `["desc", "desc"]`
75+
The default sort order for both fields is descending (`desc`), prioritizing the latest data.
76+
77+
* `index.sort.mode`: `["min", "min"]`
78+
The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields.
79+
80+
* `index.sort.missing`: `["_first", "_first"]`
81+
Missing values are sorted to appear first (`_first`) in `logsdb` index mode.
82+
83+
`logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields
84+
and order for sorting by modifying the `index.sort.field` and `index.sort.order`.
85+
86+
When using default sort settings, the `host.name` field is automatically injected into the mappings of the
87+
index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and
88+
retrieved based on the `host.name` and `@timestamp` fields.
89+
90+
NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field
91+
named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`,
92+
a single `host.name` field will be mapped as a `keyword` field.
93+
94+
Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings,
95+
a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index
96+
rollover after updating relevant (component) templates.
97+
98+
If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort
99+
settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data
100+
is organized after sorting. For more details, refer to our documentation on
101+
<<index-modules-index-sorting,index sorting>>.
102+
103+
NOTE: For <<data-streams, data streams>>, the `@timestamp` field is automatically injected if not already present.
104+
However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not
105+
automatically added to the list of sort fields.
106+
107+
[discrete]
108+
[[logsdb-specialized-codecs]]
109+
=== Specialized codecs
110+
111+
`logsdb` index mode uses the `best_compression` <<index-codec,codec>> by default, which applies {wikipedia}/Zstd[ZSTD]
112+
compression to stored fields. Users are allowed to override it and switch to the `default` codec for faster compression
113+
at the expense of slightly larger storage footprint.
114+
115+
`logsdb` index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage.
116+
Users can rely on these specialized codecs being applied by default when using `logsdb` index mode.
117+
118+
Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the
119+
following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference
120+
(FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution.
121+
For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or
122+
non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.
123+
124+
The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment
125+
may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.
126+
127+
The following methods are applied sequentially:
128+
129+
* **Delta encoding**:
130+
a compression method that stores the difference between consecutive values instead of the actual values.
131+
132+
* **Offset encoding**:
133+
a compression method that stores the difference from a base value rather than between consecutive values.
134+
135+
* **Greatest Common Divisor (GCD) encoding**:
136+
a compression method that finds the greatest common divisor of a set of values and stores the differences
137+
as multiples of the GCD.
138+
139+
* **Frame Of Reference (FOR) encoding**:
140+
a compression method that determines the smallest number of bits required to encode a block of values and uses
141+
bit-packing to fit such values into larger 64-bit blocks.
142+
143+
For keyword fields, **Run Length Encoding (RLE)** is applied to the ordinals, which represent positions in the Lucene
144+
segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.
145+
146+
[discrete]
147+
[[logsdb-ignored-settings]]
148+
=== `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit`
149+
150+
By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields
151+
to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
152+
fields contain invalid or improperly formatted data.
153+
154+
Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended
155+
as it might result in documents with malformed fields being rejected and not indexed at all.
156+
157+
In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
158+
efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191
159+
**characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.
160+
The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value
161+
defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default
162+
behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while
163+
still allowing users to customize the limit, overriding it at the mapping level or changing the index level default
164+
setting.
165+
166+
In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by
167+
default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document
168+
rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The
169+
`index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic
170+
and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document
171+
indexing, ensuring continued log ingestion without errors.
172+
173+
NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When
174+
`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
175+
`subobjects: false` it only consists of one field.
176+
177+
[discrete]
178+
[[logsdb-nodocvalue-fields]]
179+
=== Fields without doc values
180+
181+
When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
182+
Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's
183+
data is still available for reconstructing the document’s source when retrieving it via
184+
<<synthetic-source,synthetic source>>.
185+
186+
For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to
187+
reconstruct the original value in <<synthetic-source,synthetic source>>.
188+
189+
This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain
190+
fields.
191+
192+
[discrete]
193+
[[logsdb-settings-summary]]
194+
=== LogsDB settings summary
195+
196+
The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch:
197+
198+
* **`index.mode`**: `"logsdb"`
199+
200+
* **`index.mapping.synthetic_source_keep`**: `"arrays"`
201+
202+
* **`index.sort.field`**: `["host.name", "@timestamp"]`
203+
204+
* **`index.sort.order`**: `["desc", "desc"]`
205+
206+
* **`index.sort.mode`**: `["min", "min"]`
207+
208+
* **`index.sort.missing`**: `["_first", "_first"]`
209+
210+
* **`index.codec`**: `"best_compression"`
211+
212+
* **`index.mapping.ignore_malformed`**: `true`
213+
214+
* **`index.mapping.ignore_above`**: `8191`
215+
216+
* **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`

0 commit comments

Comments
 (0)