Skip to content

Commit 530089f

Browse files
committed
Merge remote-tracking branch 'es/master' into ccr
* es/master: Take into account the return value of TcpTransport.readMessageLength(...) in Netty4SizeHeaderFrameDecoder Move caching of the size of a directory to `StoreDirectory`. (#30581) Clarify docs about boolean operator precedence. (#30808) Docs: remove notes on sparsity. (#30905) Fix MatchPhrasePrefixQueryBuilderTests#testPhraseOnFieldWithNoTerms run overflow forecast a 2nd time as regression test for elastic/ml-cpp#110 (#30969) Improve documentation of dynamic mappings. (#30952) Decouple MultiValueMode. (#31075) Docs: Clarify constraints on scripted similarities. (#31076) Update get.asciidoc (#31084)
2 parents a76dcaf + 0fad7cc commit 530089f

File tree

29 files changed

+623
-338
lines changed

29 files changed

+623
-338
lines changed

docs/reference/docs/get.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ The result of the above get operation is:
167167
// TESTRESPONSE
168168

169169

170-
Field values fetched from the document it self are always returned as an array.
170+
Field values fetched from the document itself are always returned as an array.
171171
Since the `counter` field is not stored the get request simply ignores it when trying to get the `stored_fields.`
172172

173173
It is also possible to retrieve metadata fields like the `_routing` field:

docs/reference/how-to/general.asciidoc

-91
Original file line numberDiff line numberDiff line change
@@ -40,94 +40,3 @@ better. For instance if a user searches for two words `foo` and `bar`, a match
4040
across different chapters is probably very poor, while a match within the same
4141
paragraph is likely good.
4242

43-
[float]
44-
[[sparsity]]
45-
=== Avoid sparsity
46-
47-
The data-structures behind Lucene, which Elasticsearch relies on in order to
48-
index and store data, work best with dense data, ie. when all documents have the
49-
same fields. This is especially true for fields that have norms enabled (which
50-
is the case for `text` fields by default) or doc values enabled (which is the
51-
case for numerics, `date`, `ip` and `keyword` by default).
52-
53-
The reason is that Lucene internally identifies documents with so-called doc
54-
ids, which are integers between 0 and the total number of documents in the
55-
index. These doc ids are used for communication between the internal APIs of
56-
Lucene: for instance searching on a term with a `match` query produces an
57-
iterator of doc ids, and these doc ids are then used to retrieve the value of
58-
the `norm` in order to compute a score for these documents. The way this `norm`
59-
lookup is implemented currently is by reserving one byte for each document.
60-
The `norm` value for a given doc id can then be retrieved by reading the
61-
byte at index `doc_id`. While this is very efficient and helps Lucene quickly
62-
have access to the `norm` values of every document, this has the drawback that
63-
documents that do not have a value will also require one byte of storage.
64-
65-
In practice, this means that if an index has `M` documents, norms will require
66-
`M` bytes of storage *per field*, even for fields that only appear in a small
67-
fraction of the documents of the index. Although slightly more complex with doc
68-
values due to the fact that doc values have multiple ways that they can be
69-
encoded depending on the type of field and on the actual data that the field
70-
stores, the problem is very similar. In case you wonder: `fielddata`, which was
71-
used in Elasticsearch pre-2.0 before being replaced with doc values, also
72-
suffered from this issue, except that the impact was only on the memory
73-
footprint since `fielddata` was not explicitly materialized on disk.
74-
75-
Note that even though the most notable impact of sparsity is on storage
76-
requirements, it also has an impact on indexing speed and search speed since
77-
these bytes for documents that do not have a field still need to be written
78-
at index time and skipped over at search time.
79-
80-
It is totally fine to have a minority of sparse fields in an index. But beware
81-
that if sparsity becomes the rule rather than the exception, then the index
82-
will not be as efficient as it could be.
83-
84-
This section mostly focused on `norms` and `doc values` because those are the
85-
two features that are most affected by sparsity. Sparsity also affect the
86-
efficiency of the inverted index (used to index `text`/`keyword` fields) and
87-
dimensional points (used to index `geo_point` and numerics) but to a lesser
88-
extent.
89-
90-
Here are some recommendations that can help avoid sparsity:
91-
92-
[float]
93-
==== Avoid putting unrelated data in the same index
94-
95-
You should avoid putting documents that have totally different structures into
96-
the same index in order to avoid sparsity. It is often better to put these
97-
documents into different indices, you could also consider giving fewer shards
98-
to these smaller indices since they will contain fewer documents overall.
99-
100-
Note that this advice does not apply in the case that you need to use
101-
parent/child relations between your documents since this feature is only
102-
supported on documents that live in the same index.
103-
104-
[float]
105-
==== Normalize document structures
106-
107-
Even if you really need to put different kinds of documents in the same index,
108-
maybe there are opportunities to reduce sparsity. For instance if all documents
109-
in the index have a timestamp field but some call it `timestamp` and others
110-
call it `creation_date`, it would help to rename it so that all documents have
111-
the same field name for the same data.
112-
113-
[float]
114-
==== Avoid types
115-
116-
Types might sound like a good way to store multiple tenants in a single index.
117-
They are not: given that types store everything in a single index, having
118-
multiple types that have different fields in a single index will also cause
119-
problems due to sparsity as described above. If your types do not have very
120-
similar mappings, you might want to consider moving them to a dedicated index.
121-
122-
[float]
123-
==== Disable `norms` and `doc_values` on sparse fields
124-
125-
If none of the above recommendations apply in your case, you might want to
126-
check whether you actually need `norms` and `doc_values` on your sparse fields.
127-
`norms` can be disabled if producing scores is not necessary on a field, this is
128-
typically true for fields that are only used for filtering. `doc_values` can be
129-
disabled on fields that are neither used for sorting nor for aggregations.
130-
Beware that this decision should not be made lightly since these parameters
131-
cannot be changed on a live index, so you would have to reindex if you realize
132-
that you need `norms` or `doc_values`.
133-

docs/reference/index-modules/similarity.asciidoc

+12-2
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,18 @@ Which yields:
341341
// TESTRESPONSE[s/"took": 12/"took" : $body.took/]
342342
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
343343

344-
You might have noticed that a significant part of the script depends on
344+
WARNING: While scripted similarities provide a lot of flexibility, there is
345+
a set of rules that they need to satisfy. Failing to do so could make
346+
Elasticsearch silently return wrong top hits or fail with internal errors at
347+
search time:
348+
349+
- Returned scores must be positive.
350+
- All other variables remaining equal, scores must not decrease when
351+
`doc.freq` increases.
352+
- All other variables remaining equal, scores must not increase when
353+
`doc.length` increases.
354+
355+
You might have noticed that a significant part of the above script depends on
345356
statistics that are the same for every document. It is possible to make the
346357
above slightly more efficient by providing an `weight_script` which will
347358
compute the document-independent part of the score and will be available
@@ -506,7 +517,6 @@ GET /index/_search?explain=true
506517
507518
////////////////////
508519

509-
510520
Type name: `scripted`
511521

512522
[float]

docs/reference/mapping/dynamic/field-mapping.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,6 @@ PUT my_index/_doc/1
135135
}
136136
--------------------------------------------------
137137
// CONSOLE
138-
<1> The `my_float` field is added as a <<number,`double`>> field.
138+
<1> The `my_float` field is added as a <<number,`float`>> field.
139139
<2> The `my_integer` field is added as a <<number,`long`>> field.
140140

docs/reference/mapping/dynamic/templates.asciidoc

+16-5
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,22 @@ name as an existing template, it will replace the old version.
4646
[[match-mapping-type]]
4747
==== `match_mapping_type`
4848

49-
The `match_mapping_type` matches on the datatype detected by
50-
<<dynamic-field-mapping,dynamic field mapping>>, in other words, the datatype
51-
that Elasticsearch thinks the field should have. Only the following datatypes
52-
can be automatically detected: `boolean`, `date`, `double`, `long`, `object`,
53-
`string`. It also accepts `*` to match all datatypes.
49+
The `match_mapping_type` is the datatype detected by the json parser. Since
50+
JSON doesn't allow to distinguish a `long` from an `integer` or a `double` from
51+
a `float`, it will always choose the wider datatype, ie. `long` for integers
52+
and `double` for floating-point numbers.
53+
54+
The following datatypes may be automatically detected:
55+
56+
- `boolean` when `true` or `false` are encountered.
57+
- `date` when <<date-detection,date detection>> is enabled and a string is
58+
found that matches any of the configured date formats.
59+
- `double` for numbers with a decimal part.
60+
- `long` for numbers without a decimal part.
61+
- `object` for objects, also called hashes.
62+
- `string` for character strings.
63+
64+
`*` may also be used in order to match all datatypes.
5465

5566
For example, if we wanted to map all integer fields as `integer` instead of
5667
`long`, and all `string` fields as both `text` and `keyword`, we

docs/reference/query-dsl/query-string-syntax.asciidoc

+4-20
Original file line numberDiff line numberDiff line change
@@ -233,26 +233,10 @@ states that:
233233
* `news` must not be present
234234
* `quick` and `brown` are optional -- their presence increases the relevance
235235

236-
The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
237-
are also supported. However, the effects of these operators can be more
238-
complicated than is obvious at first glance. `NOT` takes precedence over
239-
`AND`, which takes precedence over `OR`. While the `+` and `-` only affect
240-
the term to the right of the operator, `AND` and `OR` can affect the terms to
241-
the left and right.
242-
243-
****
244-
Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
245-
complexity:
246-
247-
`quick OR brown AND fox AND NOT news`::
248-
249-
This is incorrect, because `brown` is now a required term.
250-
251-
`(quick OR brown) AND fox AND NOT news`::
252-
253-
This is incorrect because at least one of `quick` or `brown` is now required
254-
and the search for those terms would be scored differently from the original
255-
query.
236+
The familiar boolean operators `AND`, `OR` and `NOT` (also written `&&`, `||`
237+
and `!`) are also supported but beware that they do not honor the usual
238+
precedence rules, so parentheses should be used whenever multiple operators are
239+
used together. For instance the previous query could be rewritten as:
256240

257241
`((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
258242

modules/aggs-matrix-stats/src/main/java/org/elasticsearch/search/aggregations/support/MultiValuesSource.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ public NumericDoubleValues getField(final int ordinal, LeafReaderContext ctx) th
4747
if (ordinal > names.length) {
4848
throw new IndexOutOfBoundsException("ValuesSource array index " + ordinal + " out of bounds");
4949
}
50-
return multiValueMode.select(values[ordinal].doubleValues(ctx), Double.NEGATIVE_INFINITY);
50+
return multiValueMode.select(values[ordinal].doubleValues(ctx));
5151
}
5252
}
5353

modules/lang-expression/src/main/java/org/elasticsearch/script/expression/DateMethodValueSource.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ class DateMethodValueSource extends FieldDataValueSource {
5454
public FunctionValues getValues(Map context, LeafReaderContext leaf) throws IOException {
5555
AtomicNumericFieldData leafData = (AtomicNumericFieldData) fieldData.load(leaf);
5656
final Calendar calendar = Calendar.getInstance(TimeZone.getTimeZone("UTC"), Locale.ROOT);
57-
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues(), 0d);
57+
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues());
5858
return new DoubleDocValues(this) {
5959
@Override
6060
public double doubleVal(int docId) throws IOException {

modules/lang-expression/src/main/java/org/elasticsearch/script/expression/DateObjectValueSource.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ class DateObjectValueSource extends FieldDataValueSource {
5656
public FunctionValues getValues(Map context, LeafReaderContext leaf) throws IOException {
5757
AtomicNumericFieldData leafData = (AtomicNumericFieldData) fieldData.load(leaf);
5858
MutableDateTime joda = new MutableDateTime(0, DateTimeZone.UTC);
59-
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues(), 0d);
59+
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues());
6060
return new DoubleDocValues(this) {
6161
@Override
6262
public double doubleVal(int docId) throws IOException {

modules/lang-expression/src/main/java/org/elasticsearch/script/expression/FieldDataValueSource.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ public int hashCode() {
6868
@SuppressWarnings("rawtypes") // ValueSource uses a rawtype
6969
public FunctionValues getValues(Map context, LeafReaderContext leaf) throws IOException {
7070
AtomicNumericFieldData leafData = (AtomicNumericFieldData) fieldData.load(leaf);
71-
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues(), 0d);
71+
NumericDoubleValues docValues = multiValueMode.select(leafData.getDoubleValues());
7272
return new DoubleDocValues(this) {
7373
@Override
7474
public double doubleVal(int doc) throws IOException {

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4SizeHeaderFrameDecoder.java

+11-9
Original file line numberDiff line numberDiff line change
@@ -37,17 +37,19 @@ final class Netty4SizeHeaderFrameDecoder extends ByteToMessageDecoder {
3737
protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception {
3838
try {
3939
BytesReference networkBytes = Netty4Utils.toBytesReference(in);
40-
int messageLength = TcpTransport.readMessageLength(networkBytes) + HEADER_SIZE;
41-
// If the message length is -1, we have not read a complete header. If the message length is
42-
// greater than the network bytes available, we have not read a complete frame.
43-
if (messageLength != -1 && messageLength <= networkBytes.length()) {
44-
final ByteBuf message = in.skipBytes(HEADER_SIZE);
45-
// 6 bytes would mean it is a ping. And we should ignore.
46-
if (messageLength != 6) {
47-
out.add(message);
40+
int messageLength = TcpTransport.readMessageLength(networkBytes);
41+
// If the message length is -1, we have not read a complete header.
42+
if (messageLength != -1) {
43+
int messageLengthWithHeader = messageLength + HEADER_SIZE;
44+
// If the message length is greater than the network bytes available, we have not read a complete frame.
45+
if (messageLengthWithHeader <= networkBytes.length()) {
46+
final ByteBuf message = in.skipBytes(HEADER_SIZE);
47+
// 6 bytes would mean it is a ping. And we should ignore.
48+
if (messageLengthWithHeader != 6) {
49+
out.add(message);
50+
}
4851
}
4952
}
50-
5153
} catch (IllegalArgumentException ex) {
5254
throw new TooLongFrameException(ex);
5355
}

server/src/main/java/org/elasticsearch/common/util/SingleObjectCache.java

+5
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ public T getOrRefresh() {
6464
return cached;
6565
}
6666

67+
/** Return the potentially stale cached entry. */
68+
protected final T getNoRefresh() {
69+
return cached;
70+
}
71+
6772
/**
6873
* Returns a new instance to cache
6974
*/

server/src/main/java/org/elasticsearch/index/fielddata/FieldData.java

+59-32
Original file line numberDiff line numberDiff line change
@@ -291,38 +291,6 @@ public static boolean isMultiValued(SortedSetDocValues values) {
291291
return DocValues.unwrapSingleton(values) == null;
292292
}
293293

294-
/**
295-
* Returns whether the provided values *might* be multi-valued. There is no
296-
* guarantee that this method will return {@code false} in the single-valued case.
297-
*/
298-
public static boolean isMultiValued(SortedNumericDocValues values) {
299-
return DocValues.unwrapSingleton(values) == null;
300-
}
301-
302-
/**
303-
* Returns whether the provided values *might* be multi-valued. There is no
304-
* guarantee that this method will return {@code false} in the single-valued case.
305-
*/
306-
public static boolean isMultiValued(SortedNumericDoubleValues values) {
307-
return unwrapSingleton(values) == null;
308-
}
309-
310-
/**
311-
* Returns whether the provided values *might* be multi-valued. There is no
312-
* guarantee that this method will return {@code false} in the single-valued case.
313-
*/
314-
public static boolean isMultiValued(SortedBinaryDocValues values) {
315-
return unwrapSingleton(values) != null;
316-
}
317-
318-
/**
319-
* Returns whether the provided values *might* be multi-valued. There is no
320-
* guarantee that this method will return {@code false} in the single-valued case.
321-
*/
322-
public static boolean isMultiValued(MultiGeoPointValues values) {
323-
return unwrapSingleton(values) == null;
324-
}
325-
326294
/**
327295
* Return a {@link String} representation of the provided values. That is
328296
* typically used for scripts or for the `map` execution mode of terms aggs.
@@ -555,4 +523,63 @@ public long nextValue() throws IOException {
555523
}
556524

557525
}
526+
527+
/**
528+
* Return a {@link NumericDocValues} instance that has a value for every
529+
* document, returns the same value as {@code values} if there is a value
530+
* for the current document and {@code missing} otherwise.
531+
*/
532+
public static NumericDocValues replaceMissing(NumericDocValues values, long missing) {
533+
return new AbstractNumericDocValues() {
534+
535+
private long value;
536+
537+
@Override
538+
public int docID() {
539+
return values.docID();
540+
}
541+
542+
@Override
543+
public boolean advanceExact(int target) throws IOException {
544+
if (values.advanceExact(target)) {
545+
value = values.longValue();
546+
} else {
547+
value = missing;
548+
}
549+
return true;
550+
}
551+
552+
@Override
553+
public long longValue() throws IOException {
554+
return value;
555+
}
556+
};
557+
}
558+
559+
/**
560+
* Return a {@link NumericDoubleValues} instance that has a value for every
561+
* document, returns the same value as {@code values} if there is a value
562+
* for the current document and {@code missing} otherwise.
563+
*/
564+
public static NumericDoubleValues replaceMissing(NumericDoubleValues values, double missing) {
565+
return new NumericDoubleValues() {
566+
567+
private double value;
568+
569+
@Override
570+
public boolean advanceExact(int target) throws IOException {
571+
if (values.advanceExact(target)) {
572+
value = values.doubleValue();
573+
} else {
574+
value = missing;
575+
}
576+
return true;
577+
}
578+
579+
@Override
580+
public double doubleValue() throws IOException {
581+
return value;
582+
}
583+
};
584+
}
558585
}

server/src/main/java/org/elasticsearch/index/fielddata/fieldcomparator/DoubleValuesComparatorSource.java

+2-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
import org.apache.lucene.search.SortField;
2828
import org.apache.lucene.util.BitSet;
2929
import org.elasticsearch.common.Nullable;
30+
import org.elasticsearch.index.fielddata.FieldData;
3031
import org.elasticsearch.index.fielddata.IndexFieldData;
3132
import org.elasticsearch.index.fielddata.IndexNumericFieldData;
3233
import org.elasticsearch.index.fielddata.NumericDoubleValues;
@@ -71,7 +72,7 @@ protected NumericDocValues getNumericDocValues(LeafReaderContext context, String
7172
final SortedNumericDoubleValues values = getValues(context);
7273
final NumericDoubleValues selectedValues;
7374
if (nested == null) {
74-
selectedValues = sortMode.select(values, dMissingValue);
75+
selectedValues = FieldData.replaceMissing(sortMode.select(values), dMissingValue);
7576
} else {
7677
final BitSet rootDocs = nested.rootDocs(context);
7778
final DocIdSetIterator innerDocs = nested.innerDocs(context);

0 commit comments

Comments
 (0)