Skip to content

Distance measures for dense and sparse vectors #37947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 20, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be
different across documents. A `dense_vector` field is
a single-valued field.

These vectors can be used for document scoring.
These vectors can be used for
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason not use an internal link, eg. <<vector-functions,document scoring>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adrien. I think we can use internal links only to reference within the same document. What I wanted to do here is reference a section of the external document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused, this is the same document, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Sorry Adrien, I meant that inside one asciidoc doc dense-vector.asciidoc we want to reference a section of another asciidoc doc script-score-query.asciidoc.

We can indeed use an easier format : <<query-dsl-script-score-query,document_scoring>>, but this will link to the whole document. And as I understood after talking with the documentation team, the only way to link to the section of another doc is to use this full html link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Sorry Adrien, please disregard my previous comments. I have followed your advice to use internal links and it looks like documentation CI passed.

For example, a document score can represent a distance between
a given query vector and the indexed document vector.

Expand Down
3 changes: 2 additions & 1 deletion docs/reference/mapping/types/sparse-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be
different across documents. A `sparse_vector` field is
a single-valued field.

These vectors can be used for document scoring.
These vectors can be used for
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring].
For example, a document score can represent a distance between
a given query vector and the indexed document vector.

Expand Down
102 changes: 102 additions & 0 deletions docs/reference/query-dsl/script-score-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,108 @@ to be the most efficient by using the internal mechanisms.
--------------------------------------------------
// NOTCONSOLE

[[vector-functions]]
===== Distance functions for vector fields
These functions are used to calculate distances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe avoid mentioning "distance" since eg. cosineSimilarity measure the similarity between two vectors rather than their distance?

for <<dense-vector,`dense_vector`>> and
<<sparse-vector,`sparse_vector`>> fields.

For dense_vector fields, `cosineSimilarity` calculates the measure of
cosine similarity between a given query vector and document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'])",
"params": {
"queryVector": [4, 3.4, -1.2] <1>
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
<1> To take advantage of the script optimizations, supply a query vector in script parameters.

Similarly, for sparse_vector fields, `cosineSimilaritySparse` calculates cosine similarity
between a given query vector and document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'])",
"params": {
"queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

For dense_vector fields, `dotProduct` calculates the measure of
dot product between a given query vector and document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "dotProduct(params.queryVector, doc['my_dense_vector'])",
"params": {
"queryVector": [4, 3.4, -1.2]
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

Similarly, for sparse_vector fields, `dotProductSparse` calculates dot product
between a given query vector and document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "dotProductSparse(params.queryVector, doc['my_sparse_vector'])",
"params": {
"queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

NOTE: If a document doesn't have a value for a vector field on which
a distance function is executed, 0 will be returned as a result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also clarify what happens for dense vectors if they don't have the same number of dimensions?



[[random-functions]]
===== Random functions
Expand Down
9 changes: 9 additions & 0 deletions modules/mapper-extras/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,13 @@
esplugin {
description 'Adds advanced field mappers'
classname 'org.elasticsearch.index.mapper.MapperExtrasPlugin'
extendedPlugins = ['lang-painless']
}

dependencies {
compileOnly project(':modules:lang-painless')
}

integTestCluster {
module project(':modules:lang-painless')
}
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import org.elasticsearch.common.xcontent.XContentParser.Token;
import org.elasticsearch.index.fielddata.IndexFieldData;
import org.elasticsearch.index.query.QueryShardContext;
import org.elasticsearch.index.query.VectorDVIndexFieldData;
import org.elasticsearch.search.DocValueFormat;

import java.io.IOException;
Expand Down Expand Up @@ -119,8 +120,7 @@ public Query existsQuery(QueryShardContext context) {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) {
throw new UnsupportedOperationException(
"Field [" + name() + "] of type [" + typeName() + "] doesn't support sorting, scripting or aggregating");
return new VectorDVIndexFieldData.Builder();
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import org.elasticsearch.common.xcontent.XContentParser.Token;
import org.elasticsearch.index.fielddata.IndexFieldData;
import org.elasticsearch.index.query.QueryShardContext;
import org.elasticsearch.index.query.VectorDVIndexFieldData;
import org.elasticsearch.search.DocValueFormat;

import java.io.IOException;
Expand Down Expand Up @@ -119,8 +120,7 @@ public Query existsQuery(QueryShardContext context) {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) {
throw new UnsupportedOperationException(
"Field [" + name() + "] of type [" + typeName() + "] doesn't support sorting, scripting or aggregating");
return new VectorDVIndexFieldData.Builder();
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import org.apache.lucene.util.InPlaceMergeSorter;

// static utility functions for encoding and decoding dense_vector and sparse_vector fields
final class VectorEncoderDecoder {
public final class VectorEncoderDecoder {
static final byte INT_BYTES = 4;
static final byte SHORT_BYTES = 2;

Expand All @@ -34,7 +34,8 @@ private VectorEncoderDecoder() { }
* BytesRef: int[] floats encoded as integers values, 2 bytes for each dimension
* @param values - values of the sparse array
* @param dims - dims of the sparse array
* @param dimCount - number of the dimension
* @param dimCount - number of the dimensions, necessary as values and dims are dynamically created arrays,
* and may be over-allocated
* @return BytesRef
*/
static BytesRef encodeSparseVector(int[] dims, float[] values, int dimCount) {
Expand Down Expand Up @@ -66,9 +67,12 @@ static BytesRef encodeSparseVector(int[] dims, float[] values, int dimCount) {

/**
* Decodes the first part of BytesRef into sparse vector dimensions
* @param vectorBR - vector decoded in BytesRef
* @param vectorBR - sparse vector encoded in BytesRef
*/
static int[] decodeSparseVectorDims(BytesRef vectorBR) {
public static int[] decodeSparseVectorDims(BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalStateException("A document doesn't have a value for a vector field!");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an illegal argument exception?

int dimCount = vectorBR.length / (INT_BYTES + SHORT_BYTES);
int[] dims = new int[dimCount];
int offset = vectorBR.offset;
Expand All @@ -81,9 +85,12 @@ static int[] decodeSparseVectorDims(BytesRef vectorBR) {

/**
* Decodes the second part of the BytesRef into sparse vector values
* @param vectorBR - vector decoded in BytesRef
* @param vectorBR - sparse vector encoded in BytesRef
*/
static float[] decodeSparseVector(BytesRef vectorBR) {
public static float[] decodeSparseVector(BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalStateException("A document doesn't have a value for a vector field!");
}
int dimCount = vectorBR.length / (INT_BYTES + SHORT_BYTES);
int offset = vectorBR.offset + SHORT_BYTES * dimCount; //calculate the offset from where values are encoded
float[] vector = new float[dimCount];
Expand All @@ -100,10 +107,14 @@ static float[] decodeSparseVector(BytesRef vectorBR) {


/**
Sort dimensions in the ascending order and
sort values in the same order as their corresponding dimensions
**/
static void sortSparseDimsValues(int[] dims, float[] values, int n) {
* Sorts dimensions in the ascending order and
* sorts values in the same order as their corresponding dimensions
*
* @param dims - dimensions of the sparse query vector
* @param values - values for the sparse query vector
* @param n - number of dimensions
*/
public static void sortSparseDimsValues(int[] dims, float[] values, int n) {
new InPlaceMergeSorter() {
@Override
public int compare(int i, int j) {
Expand All @@ -123,8 +134,14 @@ public void swap(int i, int j) {
}.sort(0, n);
}

// Decodes a BytesRef into an array of floats
static float[] decodeDenseVector(BytesRef vectorBR) {
/**
* Decodes a BytesRef into an array of floats
* @param vectorBR - dense vector encoded in BytesRef
*/
public static float[] decodeDenseVector(BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalStateException("A document doesn't have a value for a vector field!");
}
int dimCount = vectorBR.length / INT_BYTES;
float[] vector = new float[dimCount];
int offset = vectorBR.offset;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.query;


import org.elasticsearch.painless.spi.PainlessExtension;
import org.elasticsearch.painless.spi.Whitelist;
import org.elasticsearch.painless.spi.WhitelistLoader;
import org.elasticsearch.script.ScoreScript;
import org.elasticsearch.script.ScriptContext;

import java.util.Collections;
import java.util.List;
import java.util.Map;

public class DocValuesWhitelistExtension implements PainlessExtension {

private static final Whitelist WHITELIST =
WhitelistLoader.loadFromResourceFiles(DocValuesWhitelistExtension.class, "docvalues_whitelist.txt");

@Override
public Map<ScriptContext<?>, List<Whitelist>> getContextWhitelists() {
return Collections.singletonMap(ScoreScript.CONTEXT, Collections.singletonList(WHITELIST));
}
}
Loading