Skip to content

Commit 0400d82

Browse files
jpountzjasontedor
authored andcommitted
Docs: remove notes on sparsity. (#30905)
Sparsity is less of a concern since 6.0. Closes #30833
1 parent e48db6f commit 0400d82

File tree

1 file changed

+0
-91
lines changed

1 file changed

+0
-91
lines changed

docs/reference/how-to/general.asciidoc

-91
Original file line numberDiff line numberDiff line change
@@ -40,94 +40,3 @@ better. For instance if a user searches for two words `foo` and `bar`, a match
4040
across different chapters is probably very poor, while a match within the same
4141
paragraph is likely good.
4242

43-
[float]
44-
[[sparsity]]
45-
=== Avoid sparsity
46-
47-
The data-structures behind Lucene, which Elasticsearch relies on in order to
48-
index and store data, work best with dense data, ie. when all documents have the
49-
same fields. This is especially true for fields that have norms enabled (which
50-
is the case for `text` fields by default) or doc values enabled (which is the
51-
case for numerics, `date`, `ip` and `keyword` by default).
52-
53-
The reason is that Lucene internally identifies documents with so-called doc
54-
ids, which are integers between 0 and the total number of documents in the
55-
index. These doc ids are used for communication between the internal APIs of
56-
Lucene: for instance searching on a term with a `match` query produces an
57-
iterator of doc ids, and these doc ids are then used to retrieve the value of
58-
the `norm` in order to compute a score for these documents. The way this `norm`
59-
lookup is implemented currently is by reserving one byte for each document.
60-
The `norm` value for a given doc id can then be retrieved by reading the
61-
byte at index `doc_id`. While this is very efficient and helps Lucene quickly
62-
have access to the `norm` values of every document, this has the drawback that
63-
documents that do not have a value will also require one byte of storage.
64-
65-
In practice, this means that if an index has `M` documents, norms will require
66-
`M` bytes of storage *per field*, even for fields that only appear in a small
67-
fraction of the documents of the index. Although slightly more complex with doc
68-
values due to the fact that doc values have multiple ways that they can be
69-
encoded depending on the type of field and on the actual data that the field
70-
stores, the problem is very similar. In case you wonder: `fielddata`, which was
71-
used in Elasticsearch pre-2.0 before being replaced with doc values, also
72-
suffered from this issue, except that the impact was only on the memory
73-
footprint since `fielddata` was not explicitly materialized on disk.
74-
75-
Note that even though the most notable impact of sparsity is on storage
76-
requirements, it also has an impact on indexing speed and search speed since
77-
these bytes for documents that do not have a field still need to be written
78-
at index time and skipped over at search time.
79-
80-
It is totally fine to have a minority of sparse fields in an index. But beware
81-
that if sparsity becomes the rule rather than the exception, then the index
82-
will not be as efficient as it could be.
83-
84-
This section mostly focused on `norms` and `doc values` because those are the
85-
two features that are most affected by sparsity. Sparsity also affect the
86-
efficiency of the inverted index (used to index `text`/`keyword` fields) and
87-
dimensional points (used to index `geo_point` and numerics) but to a lesser
88-
extent.
89-
90-
Here are some recommendations that can help avoid sparsity:
91-
92-
[float]
93-
==== Avoid putting unrelated data in the same index
94-
95-
You should avoid putting documents that have totally different structures into
96-
the same index in order to avoid sparsity. It is often better to put these
97-
documents into different indices, you could also consider giving fewer shards
98-
to these smaller indices since they will contain fewer documents overall.
99-
100-
Note that this advice does not apply in the case that you need to use
101-
parent/child relations between your documents since this feature is only
102-
supported on documents that live in the same index.
103-
104-
[float]
105-
==== Normalize document structures
106-
107-
Even if you really need to put different kinds of documents in the same index,
108-
maybe there are opportunities to reduce sparsity. For instance if all documents
109-
in the index have a timestamp field but some call it `timestamp` and others
110-
call it `creation_date`, it would help to rename it so that all documents have
111-
the same field name for the same data.
112-
113-
[float]
114-
==== Avoid types
115-
116-
Types might sound like a good way to store multiple tenants in a single index.
117-
They are not: given that types store everything in a single index, having
118-
multiple types that have different fields in a single index will also cause
119-
problems due to sparsity as described above. If your types do not have very
120-
similar mappings, you might want to consider moving them to a dedicated index.
121-
122-
[float]
123-
==== Disable `norms` and `doc_values` on sparse fields
124-
125-
If none of the above recommendations apply in your case, you might want to
126-
check whether you actually need `norms` and `doc_values` on your sparse fields.
127-
`norms` can be disabled if producing scores is not necessary on a field, this is
128-
typically true for fields that are only used for filtering. `doc_values` can be
129-
disabled on fields that are neither used for sorting nor for aggregations.
130-
Beware that this decision should not be made lightly since these parameters
131-
cannot be changed on a live index, so you would have to reindex if you realize
132-
that you need `norms` or `doc_values`.
133-

0 commit comments

Comments
 (0)