-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Distance measures for dense and sparse vectors #37947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
1f033c5
7075c03
0d4517f
3535e48
ac0205c
608a1fb
e00f7d5
f5a8ec4
f15c510
302b171
16412f8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,108 @@ to be the most efficient by using the internal mechanisms. | |
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
[[vector-functions]] | ||
===== Distance functions for vector fields | ||
These functions are used to calculate distances | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's maybe avoid mentioning "distance" since eg. cosineSimilarity measure the similarity between two vectors rather than their distance? |
||
for <<dense-vector,`dense_vector`>> and | ||
<<sparse-vector,`sparse_vector`>> fields. | ||
|
||
For dense_vector fields, `cosineSimilarity` calculates the measure of | ||
cosine similarity between a given query vector and document vectors. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"query": { | ||
"script_score": { | ||
"query": { | ||
"match_all": {} | ||
}, | ||
"script": { | ||
"source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'])", | ||
"params": { | ||
"queryVector": [4, 3.4, -1.2] <1> | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
<1> To take advantage of the script optimizations, supply a query vector in script parameters. | ||
|
||
Similarly, for sparse_vector fields, `cosineSimilaritySparse` calculates cosine similarity | ||
between a given query vector and document vectors. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"query": { | ||
"script_score": { | ||
"query": { | ||
"match_all": {} | ||
}, | ||
"script": { | ||
"source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'])", | ||
"params": { | ||
"queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
For dense_vector fields, `dotProduct` calculates the measure of | ||
dot product between a given query vector and document vectors. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"query": { | ||
"script_score": { | ||
"query": { | ||
"match_all": {} | ||
}, | ||
"script": { | ||
"source": "dotProduct(params.queryVector, doc['my_dense_vector'])", | ||
"params": { | ||
"queryVector": [4, 3.4, -1.2] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
Similarly, for sparse_vector fields, `dotProductSparse` calculates dot product | ||
between a given query vector and document vectors. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"query": { | ||
"script_score": { | ||
"query": { | ||
"match_all": {} | ||
}, | ||
"script": { | ||
"source": "dotProductSparse(params.queryVector, doc['my_sparse_vector'])", | ||
"params": { | ||
"queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
NOTE: If a document doesn't have a value for a vector field on which | ||
a distance function is executed, 0 will be returned as a result. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's also clarify what happens for dense vectors if they don't have the same number of dimensions? |
||
|
||
|
||
[[random-functions]] | ||
===== Random functions | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ | |
import org.apache.lucene.util.InPlaceMergeSorter; | ||
|
||
// static utility functions for encoding and decoding dense_vector and sparse_vector fields | ||
final class VectorEncoderDecoder { | ||
public final class VectorEncoderDecoder { | ||
static final byte INT_BYTES = 4; | ||
static final byte SHORT_BYTES = 2; | ||
|
||
|
@@ -34,7 +34,8 @@ private VectorEncoderDecoder() { } | |
* BytesRef: int[] floats encoded as integers values, 2 bytes for each dimension | ||
* @param values - values of the sparse array | ||
* @param dims - dims of the sparse array | ||
* @param dimCount - number of the dimension | ||
* @param dimCount - number of the dimensions, necessary as values and dims are dynamically created arrays, | ||
* and may be over-allocated | ||
* @return BytesRef | ||
*/ | ||
static BytesRef encodeSparseVector(int[] dims, float[] values, int dimCount) { | ||
|
@@ -66,9 +67,12 @@ static BytesRef encodeSparseVector(int[] dims, float[] values, int dimCount) { | |
|
||
/** | ||
* Decodes the first part of BytesRef into sparse vector dimensions | ||
* @param vectorBR - vector decoded in BytesRef | ||
* @param vectorBR - sparse vector encoded in BytesRef | ||
*/ | ||
static int[] decodeSparseVectorDims(BytesRef vectorBR) { | ||
public static int[] decodeSparseVectorDims(BytesRef vectorBR) { | ||
if (vectorBR == null) { | ||
throw new IllegalStateException("A document doesn't have a value for a vector field!"); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be an illegal argument exception? |
||
int dimCount = vectorBR.length / (INT_BYTES + SHORT_BYTES); | ||
int[] dims = new int[dimCount]; | ||
int offset = vectorBR.offset; | ||
|
@@ -81,9 +85,12 @@ static int[] decodeSparseVectorDims(BytesRef vectorBR) { | |
|
||
/** | ||
* Decodes the second part of the BytesRef into sparse vector values | ||
* @param vectorBR - vector decoded in BytesRef | ||
* @param vectorBR - sparse vector encoded in BytesRef | ||
*/ | ||
static float[] decodeSparseVector(BytesRef vectorBR) { | ||
public static float[] decodeSparseVector(BytesRef vectorBR) { | ||
if (vectorBR == null) { | ||
throw new IllegalStateException("A document doesn't have a value for a vector field!"); | ||
} | ||
int dimCount = vectorBR.length / (INT_BYTES + SHORT_BYTES); | ||
int offset = vectorBR.offset + SHORT_BYTES * dimCount; //calculate the offset from where values are encoded | ||
float[] vector = new float[dimCount]; | ||
|
@@ -100,10 +107,14 @@ static float[] decodeSparseVector(BytesRef vectorBR) { | |
|
||
|
||
/** | ||
Sort dimensions in the ascending order and | ||
sort values in the same order as their corresponding dimensions | ||
**/ | ||
static void sortSparseDimsValues(int[] dims, float[] values, int n) { | ||
* Sorts dimensions in the ascending order and | ||
* sorts values in the same order as their corresponding dimensions | ||
* | ||
* @param dims - dimensions of the sparse query vector | ||
* @param values - values for the sparse query vector | ||
* @param n - number of dimensions | ||
*/ | ||
public static void sortSparseDimsValues(int[] dims, float[] values, int n) { | ||
new InPlaceMergeSorter() { | ||
@Override | ||
public int compare(int i, int j) { | ||
|
@@ -123,8 +134,14 @@ public void swap(int i, int j) { | |
}.sort(0, n); | ||
} | ||
|
||
// Decodes a BytesRef into an array of floats | ||
static float[] decodeDenseVector(BytesRef vectorBR) { | ||
/** | ||
* Decodes a BytesRef into an array of floats | ||
* @param vectorBR - dense vector encoded in BytesRef | ||
*/ | ||
public static float[] decodeDenseVector(BytesRef vectorBR) { | ||
if (vectorBR == null) { | ||
throw new IllegalStateException("A document doesn't have a value for a vector field!"); | ||
} | ||
int dimCount = vectorBR.length / INT_BYTES; | ||
float[] vector = new float[dimCount]; | ||
int offset = vectorBR.offset; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.index.query; | ||
|
||
|
||
import org.elasticsearch.painless.spi.PainlessExtension; | ||
import org.elasticsearch.painless.spi.Whitelist; | ||
import org.elasticsearch.painless.spi.WhitelistLoader; | ||
import org.elasticsearch.script.ScoreScript; | ||
import org.elasticsearch.script.ScriptContext; | ||
|
||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class DocValuesWhitelistExtension implements PainlessExtension { | ||
|
||
private static final Whitelist WHITELIST = | ||
WhitelistLoader.loadFromResourceFiles(DocValuesWhitelistExtension.class, "docvalues_whitelist.txt"); | ||
|
||
@Override | ||
public Map<ScriptContext<?>, List<Whitelist>> getContextWhitelists() { | ||
return Collections.singletonMap(ScoreScript.CONTEXT, Collections.singletonList(WHITELIST)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not use an internal link, eg.
<<vector-functions,document scoring>>
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adrien. I think we can use internal links only to reference within the same document. What I wanted to do here is reference a section of the external document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit confused, this is the same document, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Sorry Adrien, I meant that inside one asciidoc doc
dense-vector.asciidoc
we want to reference a section of another asciidoc docscript-score-query.asciidoc
.We can indeed use an easier format : <<query-dsl-script-score-query,
document_scoring
>>, but this will link to the whole document. And as I understood after talking with the documentation team, the only way to link to the section of another doc is to use this full html link.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Sorry Adrien, please disregard my previous comments. I have followed your advice to use internal links and it looks like documentation CI passed.