Skip to content

Commit 458bca1

Browse files
authored
Add a feature_vector field. (#31102)
This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552
1 parent 75a676c commit 458bca1

File tree

12 files changed

+633
-46
lines changed

12 files changed

+633
-46
lines changed

docs/reference/mapping/types.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
4242

4343
<<feature>>:: Record numeric features to boost hits at query time.
4444

45+
<<feature-vector>>:: Record numeric feature vectors to boost hits at query time.
46+
4547
[float]
4648
=== Multi-fields
4749

@@ -90,4 +92,4 @@ include::types/parent-join.asciidoc[]
9092

9193
include::types/feature.asciidoc[]
9294

93-
95+
include::types/feature-vector.asciidoc[]
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
[[feature-vector]]
2+
=== Feature vector datatype
3+
4+
A `feature_vector` field can index numeric feature vectors, so that they can
5+
later be used to boost documents in queries with a
6+
<<query-dsl-feature-query,`feature`>> query.
7+
8+
It is analogous to the <<feature,`feature`>> datatype but is better suited
9+
when the list of features is sparse so that it wouldn't be reasonable to add
10+
one field to the mappings for each of them.
11+
12+
[source,js]
13+
--------------------------------------------------
14+
PUT my_index
15+
{
16+
"mappings": {
17+
"_doc": {
18+
"properties": {
19+
"topics": {
20+
"type": "feature_vector" <1>
21+
}
22+
}
23+
}
24+
}
25+
}
26+
27+
PUT my_index/_doc/1
28+
{
29+
"topics": { <2>
30+
"politics": 20,
31+
"economics": 50.8
32+
}
33+
}
34+
35+
PUT my_index/_doc/2
36+
{
37+
"topics": {
38+
"politics": 5.2,
39+
"sports": 80.1
40+
}
41+
}
42+
43+
GET my_index/_search
44+
{
45+
"query": {
46+
"feature": {
47+
"field": "topics.politics"
48+
}
49+
}
50+
}
51+
--------------------------------------------------
52+
// CONSOLE
53+
<1> Feature vector fields must use the `feature_vector` field type
54+
<2> Feature vector fields must be a hash with string keys and strictly positive numeric values
55+
56+
NOTE: `feature_vector` fields only support single-valued features and strictly
57+
positive values. Multi-valued fields and zero or negative values will be rejected.
58+
59+
NOTE: `feature_vector` fields do not support sorting or aggregating and may
60+
only be queried using <<query-dsl-feature-query,`feature`>> queries.
61+
62+
NOTE: `feature_vector` fields only preserve 9 significant bits for the
63+
precision, which translates to a relative error of about 0.4%.
64+

docs/reference/query-dsl/feature-query.asciidoc

Lines changed: 72 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
=== Feature Query
33

44
The `feature` query is a specialized query that only works on
5-
<<feature,`feature`>> fields. Its goal is to boost the score of documents based
6-
on the values of numeric features. It is typically put in a `should` clause of
7-
a <<query-dsl-bool-query,`bool`>> query so that its score is added to the score
5+
<<feature,`feature`>> fields and <<feature-vector,`feature_vector`>> fields.
6+
Its goal is to boost the score of documents based on the values of numeric
7+
features. It is typically put in a `should` clause of a
8+
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
89
of the query.
910

1011
Compared to using <<query-dsl-function-score-query,`function_score`>> or other
@@ -13,7 +14,16 @@ efficiently skip non-competitive hits when
1314
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
1415
spectacular.
1516

16-
Here is an example:
17+
Here is an example that indexes various features:
18+
- https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
19+
importance of a website,
20+
- `url_length`, the length of the url, which typically correlates negatively
21+
with relevance,
22+
- `topics`, which associates a list of topics with every document alongside a
23+
measure of how well the document is connected to this topic.
24+
25+
Then the example includes an example query that searches for `"2016"` and boosts
26+
based or `pagerank`, `url_length` and the `sports` topic.
1727

1828
[source,js]
1929
--------------------------------------------------
@@ -28,6 +38,9 @@ PUT test
2838
"url_length": {
2939
"type": "feature",
3040
"positive_score_impact": false
41+
},
42+
"topics": {
43+
"type": "feature_vector"
3144
}
3245
}
3346
}
@@ -36,32 +49,73 @@ PUT test
3649
3750
PUT test/_doc/1
3851
{
39-
"pagerank": 10,
40-
"url_length": 50
52+
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
53+
"content": "Rio 2016",
54+
"pagerank": 50.3,
55+
"url_length": 42,
56+
"topics": {
57+
"sports": 50,
58+
"brazil": 30
59+
}
4160
}
4261
4362
PUT test/_doc/2
4463
{
45-
"pagerank": 100,
46-
"url_length": 20
64+
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
65+
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
66+
"pagerank": 50.3,
67+
"url_length": 47,
68+
"topics": {
69+
"sports": 35,
70+
"formula one": 65,
71+
"brazil": 20
72+
}
4773
}
4874
49-
POST test/_refresh
50-
51-
GET test/_search
75+
PUT test/_doc/3
5276
{
53-
"query": {
54-
"feature": {
55-
"field": "pagerank"
56-
}
77+
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
78+
"content": "Deadpool is a 2016 American superhero film",
79+
"pagerank": 50.3,
80+
"url_length": 37,
81+
"topics": {
82+
"movies": 60,
83+
"super hero": 65
5784
}
5885
}
5986
60-
GET test/_search
87+
POST test/_refresh
88+
89+
GET test/_search
6190
{
6291
"query": {
63-
"feature": {
64-
"field": "url_length"
92+
"bool": {
93+
"must": [
94+
{
95+
"match": {
96+
"content": "2016"
97+
}
98+
}
99+
],
100+
"should": [
101+
{
102+
"feature": {
103+
"field": "pagerank"
104+
}
105+
},
106+
{
107+
"feature": {
108+
"field": "url_length",
109+
"boost": 0.1
110+
}
111+
},
112+
{
113+
"feature": {
114+
"field": "topics.sports",
115+
"boost": 0.4
116+
}
117+
}
118+
]
65119
}
66120
}
67121
}

modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/FeatureFieldMapper.java

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -165,8 +165,7 @@ public Query existsQuery(QueryShardContext context) {
165165

166166
@Override
167167
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) {
168-
failIfNoDocValues();
169-
return new DocValuesIndexFieldData.Builder();
168+
throw new UnsupportedOperationException("[feature] fields do not support sorting, scripting or aggregating");
170169
}
171170

172171
@Override
@@ -229,10 +228,6 @@ protected String contentType() {
229228
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
230229
super.doXContentBody(builder, includeDefaults, params);
231230

232-
if (includeDefaults || fieldType().nullValue() != null) {
233-
builder.field("null_value", fieldType().nullValue());
234-
}
235-
236231
if (includeDefaults || fieldType().positiveScoreImpact() == false) {
237232
builder.field("positive_score_impact", fieldType().positiveScoreImpact());
238233
}

0 commit comments

Comments
 (0)