Skip to content

Commit d326beb

Browse files
committed
Introduce combined_fields query (elastic#71213)
This PR introduces a new query called `combined_fields` for searching multiple text fields. It takes a term-centric view, first analyzing the query string into individual terms, then searching for each term any of the fields as though they were one combined field. It is based on Lucene's `CombinedFieldQuery`, which takes a principled approach to scoring based on the BM25F formula. This query provides an alternative to the `cross_fields` `multi_match` mode. It has simpler behavior and a more robust approach to scoring. Addresses elastic#41106.
1 parent c1385fb commit d326beb

35 files changed

+2345
-136
lines changed
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
[[query-dsl-combined-fields-query]]
2+
=== Combined fields
3+
++++
4+
<titleabbrev>Combined fields</titleabbrev>
5+
++++
6+
7+
The `combined_fields` query supports searching multiple text fields as if their
8+
contents had been indexed into one combined field. It takes a term-centric
9+
view of the query: first it analyzes the query string into individual terms,
10+
then looks for each term in any of the fields. This query is particularly
11+
useful when a match could span multiple text fields, for example the `title`,
12+
`abstract` and `body` of an article:
13+
14+
[source,console]
15+
--------------------------------------------------
16+
GET /_search
17+
{
18+
"query": {
19+
"combined_fields" : {
20+
"query": "database systems",
21+
"fields": [ "title", "abstract", "body"],
22+
"operator": "and"
23+
}
24+
}
25+
}
26+
--------------------------------------------------
27+
28+
The `combined_fields` query takes a principled approach to scoring based on the
29+
simple BM25F formula described in
30+
http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf[The Probabilistic Relevance Framework: BM25 and Beyond].
31+
When scoring matches, the query combines term and collection statistics across
32+
fields. This allows it to score each match as if the specified fields had been
33+
indexed into a single combined field. (Note that this is a best attempt --
34+
`combined_fields` makes some approximations and scores will not obey this
35+
model perfectly.)
36+
37+
[WARNING]
38+
.Field number limit
39+
===================================================
40+
There is a limit on the number of fields that can be queried at once. It is
41+
defined by the `indices.query.bool.max_clause_count` <<search-settings>>
42+
which defaults to 1024.
43+
===================================================
44+
45+
==== Per-field boosting
46+
47+
Individual fields can be boosted with the caret (`^`) notation:
48+
49+
[source,console]
50+
--------------------------------------------------
51+
GET /_search
52+
{
53+
"query": {
54+
"combined_fields" : {
55+
"query" : "distributed consensus",
56+
"fields" : [ "title^2", "body" ] <1>
57+
}
58+
}
59+
}
60+
--------------------------------------------------
61+
62+
Field boosts are interpreted according to the combined field model. For example,
63+
if the `title` field has a boost of 2, the score is calculated as if each term
64+
in the title appeared twice in the synthetic combined field.
65+
66+
NOTE: The `combined_fields` query requires that field boosts are greater than
67+
or equal to 1.0. Field boosts are allowed to be fractional.
68+
69+
[[combined-field-top-level-params]]
70+
==== Top-level parameters for `combined_fields`
71+
72+
`fields`::
73+
(Required, array of strings) List of fields to search. Field wildcard patterns
74+
are allowed. Only <<text,`text`>> fields are supported, and they must all have
75+
the same search <<analyzer,`analyzer`>>.
76+
77+
`query`::
78+
+
79+
--
80+
(Required, string) Text to search for in the provided `<fields>`.
81+
82+
The `combined_fields` query <<analysis,analyzes>> the provided text before
83+
performing a search.
84+
--
85+
86+
`auto_generate_synonyms_phrase_query`::
87+
+
88+
--
89+
(Optional, Boolean) If `true`, <<query-dsl-match-query-phrase,match phrase>>
90+
queries are automatically created for multi-term synonyms. Defaults to `true`.
91+
92+
See <<query-dsl-match-query-synonyms,Use synonyms with match query>> for an
93+
example.
94+
--
95+
96+
`operator`::
97+
+
98+
--
99+
(Optional, string) Boolean logic used to interpret text in the `query` value.
100+
Valid values are:
101+
102+
`or` (Default)::
103+
For example, a `query` value of `database systems` is interpreted as `database
104+
OR systems`.
105+
106+
`and`::
107+
For example, a `query` value of `database systems` is interpreted as `database
108+
AND systems`.
109+
--
110+
111+
`minimum_should_match`::
112+
+
113+
--
114+
(Optional, string) Minimum number of clauses that must match for a document to
115+
be returned. See the <<query-dsl-minimum-should-match, `minimum_should_match`
116+
parameter>> for valid values and more information.
117+
--
118+
119+
`zero_terms_query`::
120+
+
121+
--
122+
(Optional, string) Indicates whether no documents are returned if the `analyzer`
123+
removes all tokens, such as when using a `stop` filter. Valid values are:
124+
125+
`none` (Default)::
126+
No documents are returned if the `analyzer` removes all tokens.
127+
128+
`all`::
129+
Returns all documents, similar to a <<query-dsl-match-all-query,`match_all`>>
130+
query.
131+
132+
See <<query-dsl-match-query-zero>> for an example.
133+
--
134+
135+
===== Comparison to `multi_match` query
136+
137+
The `combined_fields` query provides a principled way of matching and scoring
138+
across multiple <<text, `text`>> fields. To support this, it requires that all
139+
fields have the same search <<analyzer,`analyzer`>>.
140+
141+
If you want a single query that handles fields of different types like
142+
keywords or numbers, then the <<query-dsl-multi-match-query,`multi_match`>>
143+
query may be a better fit. It supports both text and non-text fields, and
144+
accepts text fields that do not share the same analyzer.
145+
146+
The main `multi_match` modes `best_fields` and `most_fields` take a
147+
field-centric view of the query. In contrast, `combined_fields` is
148+
term-centric: `operator` and `minimum_should_match` are applied per-term,
149+
instead of per-field. Concretely, a query like
150+
151+
[source,console]
152+
--------------------------------------------------
153+
GET /_search
154+
{
155+
"query": {
156+
"combined_fields" : {
157+
"query": "database systems",
158+
"fields": [ "title", "abstract"],
159+
"operator": "and"
160+
}
161+
}
162+
}
163+
--------------------------------------------------
164+
165+
is executed as
166+
167+
+(combined("database", fields:["title" "abstract"]))
168+
+(combined("systems", fields:["title", "abstract"]))
169+
170+
In other words, each term must be present in at least one field for a
171+
document to match.
172+
173+
The `cross_fields` `multi_match` mode also takes a term-centric approach and
174+
applies `operator` and `minimum_should_match per-term`. The main advantage of
175+
`combined_fields` over `cross_fields` is its robust and interpretable approach
176+
to scoring based on the BM25F algorithm.
177+
178+
[NOTE]
179+
.Custom similarities
180+
===================================================
181+
The `combined_fields` query currently only supports the `BM25` similarity
182+
(which is the default unless a <<index-modules-similarity, custom similarity>>
183+
is configured). <<similarity, Per-field similarities>> are also not allowed.
184+
Using `combined_fields` in either of these cases will result in an error.
185+
===================================================

docs/reference/query-dsl/full-text-queries.asciidoc

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
[[full-text-queries]]
22
== Full text queries
33

4-
The full text queries enable you to search <<analysis,analyzed text fields>> such as the
5-
body of an email. The query string is processed using the same analyzer that was applied to
6-
the field during indexing.
4+
The full text queries enable you to search <<analysis,analyzed text fields>> such as the
5+
body of an email. The query string is processed using the same analyzer that was applied to
6+
the field during indexing.
77

88
The queries in this group are:
99

@@ -21,16 +21,15 @@ the last term, which is matched as a `prefix` query
2121

2222
<<query-dsl-match-query-phrase,`match_phrase` query>>::
2323
Like the `match` query but used for matching exact phrases or word proximity matches.
24-
24+
2525
<<query-dsl-match-query-phrase-prefix,`match_phrase_prefix` query>>::
2626
Like the `match_phrase` query, but does a wildcard search on the final word.
27-
27+
2828
<<query-dsl-multi-match-query,`multi_match` query>>::
2929
The multi-field version of the `match` query.
3030

31-
<<query-dsl-common-terms-query,`common` terms query>>::
32-
33-
A more specialized query which gives more preference to uncommon words.
31+
<<query-dsl-combined-fields-query,`combined_fields` query>>::
32+
Matches over multiple fields as if they had been indexed into one combined field.
3433

3534
<<query-dsl-query-string-query,`query_string` query>>::
3635
Supports the compact Lucene <<query-string-syntax,query string syntax>>,
@@ -52,10 +51,12 @@ include::match-phrase-query.asciidoc[]
5251

5352
include::match-phrase-prefix-query.asciidoc[]
5453

54+
include::combined-fields-query.asciidoc[]
55+
5556
include::multi-match-query.asciidoc[]
5657

5758
include::common-terms-query.asciidoc[]
5859

5960
include::query-string-query.asciidoc[]
6061

61-
include::simple-query-string-query.asciidoc[]
62+
include::simple-query-string-query.asciidoc[]

docs/reference/query-dsl/multi-match-query.asciidoc

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,10 @@ This query is executed as:
192192
In other words, *all terms* must be present *in a single field* for a document
193193
to match.
194194
195-
See <<type-cross-fields>> for a better solution.
195+
The <<query-dsl-combined-fields-query, `combined_fields`>> query offers a
196+
term-centric approach that handles `operator` and `minimum_should_match` on a
197+
per-term basis. The other multi-match mode <<type-cross-fields>> also
198+
addresses this issue.
196199
197200
===================================================
198201

@@ -388,6 +391,12 @@ Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
388391
`lenient`, `zero_terms_query` and `cutoff_frequency`, as explained in
389392
<<query-dsl-match-query, match query>>.
390393

394+
WARNING: The `cross_fields` type blends field statistics in a way that does
395+
not always produce well-formed scores (for example scores can become
396+
negative). As an alternative, you can consider the
397+
<<query-dsl-combined-fields-query,`combined_fields`>> query, which is also
398+
term-centric but combines field statistics in a more robust way.
399+
391400
[[cross-field-analysis]]
392401
===== `cross_field` and analysis
393402

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.highlight/10_unified.yml

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,27 @@ setup:
2424
indices.refresh: {}
2525

2626
---
27-
"Basic":
27+
"Basic multi_match query":
2828
- do:
2929
search:
30-
rest_total_hits_as_int: true
31-
body: { "query" : {"multi_match" : { "query" : "quick brown fox", "fields" : [ "text*"] } }, "highlight" : { "type" : "unified", "fields" : { "*" : {} } } }
30+
body: {
31+
"query" : { "multi_match" : { "query" : "quick brown fox", "fields" : [ "text*"] } },
32+
"highlight" : { "type" : "unified", "fields" : { "*" : {} } } }
33+
34+
- match: {hits.hits.0.highlight.text.0: "The <em>quick</em> <em>brown</em> <em>fox</em> is <em>brown</em>."}
35+
- match: {hits.hits.0.highlight.text\.fvh.0: "The <em>quick</em> <em>brown</em> <em>fox</em> is <em>brown</em>."}
36+
- match: {hits.hits.0.highlight.text\.postings.0: "The <em>quick</em> <em>brown</em> <em>fox</em> is <em>brown</em>."}
37+
38+
---
39+
"Basic combined_fields query":
40+
- skip:
41+
version: " - 7.99.99"
42+
reason: "combined fields query is not yet backported"
43+
- do:
44+
search:
45+
body: {
46+
"query" : { "combined_fields" : { "query" : "quick brown fox", "fields" : [ "text*"] } },
47+
"highlight" : { "type" : "unified", "fields" : { "*" : {} } } }
3248

3349
- match: {hits.hits.0.highlight.text.0: "The <em>quick</em> <em>brown</em> <em>fox</em> is <em>brown</em>."}
3450
- match: {hits.hits.0.highlight.text\.fvh.0: "The <em>quick</em> <em>brown</em> <em>fox</em> is <em>brown</em>."}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
setup:
2+
- do:
3+
indices.create:
4+
index: test
5+
body:
6+
mappings:
7+
properties:
8+
title:
9+
type: text
10+
abstract:
11+
type: text
12+
body:
13+
type: text
14+
15+
- do:
16+
index:
17+
index: test
18+
id: 1
19+
body:
20+
title: "Time, Clocks and the Ordering of Events in a Distributed System"
21+
abstract: "The concept of one event happening before another..."
22+
body: "The concept of time is fundamental to our way of thinking..."
23+
refresh: true
24+
25+
---
26+
"Test combined_fields query":
27+
- skip:
28+
version: " - 7.99.99"
29+
reason: "combined fields query is not yet backported"
30+
- do:
31+
search:
32+
index: test
33+
body:
34+
query:
35+
combined_fields:
36+
query: "time event"
37+
fields: ["abstract", "body"]
38+
operator: "and"
39+
40+
- match: { hits.total.value: 1 }
41+
- match: { hits.hits.0._id: "1" }
42+

server/src/internalClusterTest/java/org/elasticsearch/index/search/MatchPhraseQueryIT.java

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
import org.elasticsearch.action.search.SearchResponse;
1414
import org.elasticsearch.common.settings.Settings;
1515
import org.elasticsearch.index.query.MatchPhraseQueryBuilder;
16-
import org.elasticsearch.index.search.MatchQueryParser.ZeroTermsQuery;
16+
import org.elasticsearch.index.query.ZeroTermsQueryOption;
1717
import org.elasticsearch.test.ESIntegTestCase;
1818
import org.junit.Before;
1919

@@ -47,11 +47,11 @@ public void testZeroTermsQuery() throws ExecutionException, InterruptedException
4747
MatchPhraseQueryBuilder baseQuery = matchPhraseQuery("name", "the who")
4848
.analyzer("standard_stopwords");
4949

50-
MatchPhraseQueryBuilder matchNoneQuery = baseQuery.zeroTermsQuery(ZeroTermsQuery.NONE);
50+
MatchPhraseQueryBuilder matchNoneQuery = baseQuery.zeroTermsQuery(ZeroTermsQueryOption.NONE);
5151
SearchResponse matchNoneResponse = client().prepareSearch(INDEX).setQuery(matchNoneQuery).get();
5252
assertHitCount(matchNoneResponse, 0L);
5353

54-
MatchPhraseQueryBuilder matchAllQuery = baseQuery.zeroTermsQuery(ZeroTermsQuery.ALL);
54+
MatchPhraseQueryBuilder matchAllQuery = baseQuery.zeroTermsQuery(ZeroTermsQueryOption.ALL);
5555
SearchResponse matchAllResponse = client().prepareSearch(INDEX).setQuery(matchAllQuery).get();
5656
assertHitCount(matchAllResponse, 2L);
5757
}

0 commit comments

Comments
 (0)