Skip to content

Commit 5478fff

Browse files
authored
Deprecate the sparse_vector field type. (#48315)
We have not seen much adoption of this experimental field type, and don't see a clear use case as it's currently designed. This PR deprecates the field type in 7.x. It will be removed from 8.0 in a follow-up PR.
1 parent 8c13cf7 commit 5478fff

File tree

10 files changed

+193
-78
lines changed

10 files changed

+193
-78
lines changed

docs/reference/mapping/types/sparse-vector.asciidoc

+6-1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
<titleabbrev>Sparse vector</titleabbrev>
77
++++
88

9+
deprecated[7.6, The `sparse_vector` type is deprecated and will be removed in 8.0.]
910
experimental[]
1011

1112
A `sparse_vector` field stores sparse vectors of float values.
@@ -38,7 +39,11 @@ PUT my_index
3839
}
3940
}
4041
}
42+
--------------------------------------------------
43+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
4144

45+
[source,console]
46+
--------------------------------------------------
4247
PUT my_index/_doc/1
4348
{
4449
"my_text" : "text1",
@@ -50,8 +55,8 @@ PUT my_index/_doc/2
5055
"my_text" : "text2",
5156
"my_vector" : {"103": 0.5, "4": -0.5, "5": 1, "11" : 1.2}
5257
}
53-
5458
--------------------------------------------------
59+
// TEST[continued]
5560

5661
Internally, each document's sparse vector is encoded as a binary
5762
doc value. Its size in bytes is equal to

docs/reference/vectors/vector-functions.asciidoc

+118-72
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,14 @@
55

66
experimental[]
77

8-
These functions are used for
9-
for <<dense-vector,`dense_vector`>> and
10-
<<sparse-vector,`sparse_vector`>> fields.
11-
128
NOTE: During vector functions' calculation, all matched documents are
13-
linearly scanned. Thus, expect the query time grow linearly
9+
linearly scanned. Thus, expect the query time grow linearly
1410
with the number of matched documents. For this reason, we recommend
1511
to limit the number of matched documents with a `query` parameter.
1612

17-
Let's create an index with the following mapping and index a couple
13+
====== `dense_vector` functions
14+
15+
Let's create an index with a `dense_vector` mapping and index a couple
1816
of documents into it.
1917

2018
[source,console]
@@ -27,9 +25,6 @@ PUT my_index
2725
"type": "dense_vector",
2826
"dims": 3
2927
},
30-
"my_sparse_vector" : {
31-
"type" : "sparse_vector"
32-
},
3328
"status" : {
3429
"type" : "keyword"
3530
}
@@ -40,21 +35,21 @@ PUT my_index
4035
PUT my_index/_doc/1
4136
{
4237
"my_dense_vector": [0.5, 10, 6],
43-
"my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1},
4438
"status" : "published"
4539
}
4640
4741
PUT my_index/_doc/2
4842
{
4943
"my_dense_vector": [-0.5, 10, 10],
50-
"my_sparse_vector": {"2": 2.5, "10" : 1.3, "55": -2.3, "113": 1.6},
5144
"status" : "published"
5245
}
5346
47+
POST my_index/_refresh
48+
5449
--------------------------------------------------
5550
// TESTSETUP
5651

57-
For dense_vector fields, `cosineSimilarity` calculates the measure of
52+
The `cosineSimilarity` function calculates the measure of
5853
cosine similarity between a given query vector and document vectors.
5954

6055
[source,console]
@@ -90,8 +85,8 @@ GET my_index/_search
9085
NOTE: If a document's dense vector field has a number of dimensions
9186
different from the query's vector, an error will be thrown.
9287

93-
Similarly, for sparse_vector fields, `cosineSimilaritySparse` calculates cosine similarity
94-
between a given query vector and document vectors.
88+
The `dotProduct` function calculates the measure of
89+
dot product between a given query vector and document vectors.
9590

9691
[source,console]
9792
--------------------------------------------------
@@ -109,18 +104,24 @@ GET my_index/_search
109104
}
110105
},
111106
"script": {
112-
"source": "cosineSimilaritySparse(params.query_vector, doc['my_sparse_vector']) + 1.0",
107+
"source": """
108+
double value = dotProduct(params.query_vector, doc['my_dense_vector']);
109+
return sigmoid(1, Math.E, -value); <1>
110+
""",
113111
"params": {
114-
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
112+
"query_vector": [4, 3.4, -0.2]
115113
}
116114
}
117115
}
118116
}
119117
}
120118
--------------------------------------------------
121119

122-
For dense_vector fields, `dotProduct` calculates the measure of
123-
dot product between a given query vector and document vectors.
120+
<1> Using the standard sigmoid function prevents scores from being negative.
121+
122+
The `l1norm` function calculates L^1^ distance
123+
(Manhattan distance) between a given query vector and
124+
document vectors.
124125

125126
[source,console]
126127
--------------------------------------------------
@@ -138,23 +139,28 @@ GET my_index/_search
138139
}
139140
},
140141
"script": {
141-
"source": """
142-
double value = dotProduct(params.query_vector, doc['my_dense_vector']);
143-
return sigmoid(1, Math.E, -value); <1>
144-
""",
142+
"source": "1 / (1 + l1norm(params.queryVector, doc['my_dense_vector']))", <1>
145143
"params": {
146-
"query_vector": [4, 3.4, -0.2]
144+
"queryVector": [4, 3.4, -0.2]
147145
}
148146
}
149147
}
150148
}
151149
}
152150
--------------------------------------------------
153151

154-
<1> Using the standard sigmoid function prevents scores from being negative.
152+
<1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
153+
`l2norm` shown below represent distances or differences. This means, that
154+
the more similar the vectors are, the lower the scores will be that are
155+
produced by the `l1norm` and `l2norm` functions.
156+
Thus, as we need more similar vectors to score higher,
157+
we reversed the output from `l1norm` and `l2norm`. Also, to avoid
158+
division by 0 when a document vector matches the query exactly,
159+
we added `1` in the denominator.
155160

156-
Similarly, for sparse_vector fields, `dotProductSparse` calculates dot product
157-
between a given query vector and document vectors.
161+
The `l2norm` function calculates L^2^ distance
162+
(Euclidean distance) between a given query vector and
163+
document vectors.
158164

159165
[source,console]
160166
--------------------------------------------------
@@ -172,26 +178,77 @@ GET my_index/_search
172178
}
173179
},
174180
"script": {
175-
"source": """
176-
double value = dotProductSparse(params.query_vector, doc['my_sparse_vector']);
177-
return sigmoid(1, Math.E, -value);
178-
""",
179-
"params": {
180-
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
181+
"source": "1 / (1 + l2norm(params.queryVector, doc['my_dense_vector']))",
182+
"params": {
183+
"queryVector": [4, 3.4, -0.2]
181184
}
182185
}
183186
}
184187
}
185188
}
186189
--------------------------------------------------
187190

188-
For dense_vector fields, `l1norm` calculates L^1^ distance
189-
(Manhattan distance) between a given query vector and
190-
document vectors.
191+
NOTE: If a document doesn't have a value for a vector field on which
192+
a vector function is executed, an error will be thrown.
193+
194+
You can check if a document has a value for the field `my_vector` by
195+
`doc['my_vector'].size() == 0`. Your overall script can look like this:
196+
197+
[source,js]
198+
--------------------------------------------------
199+
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, doc['my_vector'])"
200+
--------------------------------------------------
201+
// NOTCONSOLE
202+
203+
====== `sparse_vector` functions
204+
205+
deprecated[7.6, The `sparse_vector` type is deprecated and will be removed in 8.0.]
206+
207+
Let's create an index with a `sparse_vector` mapping and index a couple
208+
of documents into it.
191209

192210
[source,console]
193211
--------------------------------------------------
194-
GET my_index/_search
212+
PUT my_sparse_index
213+
{
214+
"mappings": {
215+
"properties": {
216+
"my_sparse_vector": {
217+
"type": "sparse_vector"
218+
},
219+
"status" : {
220+
"type" : "keyword"
221+
}
222+
}
223+
}
224+
}
225+
--------------------------------------------------
226+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
227+
228+
[source,console]
229+
--------------------------------------------------
230+
PUT my_sparse_index/_doc/1
231+
{
232+
"my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1},
233+
"status" : "published"
234+
}
235+
236+
PUT my_sparse_index/_doc/2
237+
{
238+
"my_sparse_vector": {"2": 2.5, "10" : 1.3, "55": -2.3, "113": 1.6},
239+
"status" : "published"
240+
}
241+
242+
POST my_sparse_index/_refresh
243+
--------------------------------------------------
244+
// TEST[continued]
245+
246+
The `cosineSimilaritySparse` function calculates cosine similarity
247+
between a given query vector and document vectors.
248+
249+
[source,console]
250+
--------------------------------------------------
251+
GET my_sparse_index/_search
195252
{
196253
"query": {
197254
"script_score": {
@@ -205,31 +262,24 @@ GET my_index/_search
205262
}
206263
},
207264
"script": {
208-
"source": "1 / (1 + l1norm(params.queryVector, doc['my_dense_vector']))", <1>
265+
"source": "cosineSimilaritySparse(params.query_vector, doc['my_sparse_vector']) + 1.0",
209266
"params": {
210-
"queryVector": [4, 3.4, -0.2]
267+
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
211268
}
212269
}
213270
}
214271
}
215272
}
216273
--------------------------------------------------
274+
// TEST[continued]
275+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
217276

218-
<1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
219-
`l2norm` shown below represent distances or differences. This means, that
220-
the more similar the vectors are, the lower the scores will be that are
221-
produced by the `l1norm` and `l2norm` functions.
222-
Thus, as we need more similar vectors to score higher,
223-
we reversed the output from `l1norm` and `l2norm`. Also, to avoid
224-
division by 0 when a document vector matches the query exactly,
225-
we added `1` in the denominator.
226-
227-
For sparse_vector fields, `l1normSparse` calculates L^1^ distance
277+
The `dotProductSparse` function calculates dot product
228278
between a given query vector and document vectors.
229279

230280
[source,console]
231281
--------------------------------------------------
232-
GET my_index/_search
282+
GET my_sparse_index/_search
233283
{
234284
"query": {
235285
"script_score": {
@@ -243,23 +293,27 @@ GET my_index/_search
243293
}
244294
},
245295
"script": {
246-
"source": "1 / (1 + l1normSparse(params.queryVector, doc['my_sparse_vector']))",
247-
"params": {
248-
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
296+
"source": """
297+
double value = dotProductSparse(params.query_vector, doc['my_sparse_vector']);
298+
return sigmoid(1, Math.E, -value);
299+
""",
300+
"params": {
301+
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
249302
}
250303
}
251304
}
252305
}
253306
}
254307
--------------------------------------------------
308+
// TEST[continued]
309+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
255310

256-
For dense_vector fields, `l2norm` calculates L^2^ distance
257-
(Euclidean distance) between a given query vector and
258-
document vectors.
311+
The `l1normSparse` function calculates L^1^ distance
312+
between a given query vector and document vectors.
259313

260314
[source,console]
261315
--------------------------------------------------
262-
GET my_index/_search
316+
GET my_sparse_index/_search
263317
{
264318
"query": {
265319
"script_score": {
@@ -273,22 +327,24 @@ GET my_index/_search
273327
}
274328
},
275329
"script": {
276-
"source": "1 / (1 + l2norm(params.queryVector, doc['my_dense_vector']))",
330+
"source": "1 / (1 + l1normSparse(params.queryVector, doc['my_sparse_vector']))",
277331
"params": {
278-
"queryVector": [4, 3.4, -0.2]
332+
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
279333
}
280334
}
281335
}
282336
}
283337
}
284338
--------------------------------------------------
339+
// TEST[continued]
340+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
285341

286-
Similarly, for sparse_vector fields, `l2normSparse` calculates L^2^ distance
342+
The `l2normSparse` function calculates L^2^ distance
287343
between a given query vector and document vectors.
288344

289345
[source,console]
290346
--------------------------------------------------
291-
GET my_index/_search
347+
GET my_sparse_index/_search
292348
{
293349
"query": {
294350
"script_score": {
@@ -311,15 +367,5 @@ GET my_index/_search
311367
}
312368
}
313369
--------------------------------------------------
314-
315-
NOTE: If a document doesn't have a value for a vector field on which
316-
a vector function is executed, an error will be thrown.
317-
318-
You can check if a document has a value for the field `my_vector` by
319-
`doc['my_vector'].size() == 0`. Your overall script can look like this:
320-
321-
[source,js]
322-
--------------------------------------------------
323-
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, doc['my_vector'])"
324-
--------------------------------------------------
325-
// NOTCONSOLE
370+
// TEST[continued]
371+
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]

0 commit comments

Comments
 (0)