Skip to content

Add dims parameter to dense_vector mapping #43444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@ experimental[]

A `dense_vector` field stores dense vectors of float values.
The maximum number of dimensions that can be in a vector should
not exceed 1024. The number of dimensions can be
different across documents. A `dense_vector` field is
a single-valued field.
not exceed 1024. A `dense_vector` field is a single-valued field.

These vectors can be used for <<vector-functions,document scoring>>.
For example, a document score can represent a distance between
Expand All @@ -24,7 +22,8 @@ PUT my_index
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector"
"type": "dense_vector", <1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this callout <1> should go on the next line.

"dims": 3
},
"my_text" : {
"type" : "keyword"
Expand All @@ -42,13 +41,14 @@ PUT my_index/_doc/1
PUT my_index/_doc/2
{
"my_text" : "text2",
"my_vector" : [-0.5, 10, 10, 4]
"my_vector" : [-0.5, 10, 10]
}

--------------------------------------------------
// CONSOLE

<1> dims—the number of dimensions in the vector, required parameter.

Internally, each document's dense vector is encoded as a binary
doc value. Its size in bytes is equal to
`4 * NUMBER_OF_DIMENSIONS`, where `NUMBER_OF_DIMENSIONS` -
number of the vector's dimensions.
`4 * dims`, where `dims`—the number of the vector's dimensions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
setup:
- skip:
features: headers
version: " - 7.2.99"
reason: "dense_vector functions were introduced in 7.3.0"
version: " - 7.99.99" # TODO: change to 7.2.99 after backport
reason: "dense_vector dims parameter was added from 8.0"

- do:
indices.create:
Expand All @@ -15,6 +15,7 @@ setup:
properties:
my_dense_vector:
type: dense_vector
dims: 5
- do:
index:
index: test-index
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
setup:
- skip:
features: headers
version: " - 7.2.99"
reason: "dense_vector functions were introduced in 7.3.0"
version: " - 7.99.99" # TODO: change to 7.2.99 after backport
reason: "dense_vector dims parameter was added from 8.0"

- do:
indices.create:
Expand All @@ -17,31 +17,36 @@ setup:
properties:
my_dense_vector:
type: dense_vector
dims: 3


---
"Vectors of different dimensions and data types":
# document vectors of different dimensions
"Indexing of Dense vectors should error when dims don't match defined in the mapping":

- do:
catch: bad_request
index:
index: test-index
id: 1
body:
my_dense_vector: [10]
my_dense_vector: [10, 2]
- match: { error.type: "mapper_parsing_exception" }

---
"Vectors of mixed integers and floats":
- do:
index:
index: test-index
id: 2
id: 1
body:
my_dense_vector: [10, 10.5]
my_dense_vector: [10, 10, 10]

- do:
index:
index: test-index
id: 3
id: 2
body:
my_dense_vector: [10, 10.5, 100.5]
my_dense_vector: [10.9, 10.9, 10.9]

- do:
indices.refresh: {}
Expand All @@ -59,14 +64,13 @@ setup:
script:
source: "cosineSimilarity(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: [10]
query_vector: [10, 10, 10]

- match: {hits.total: 3}
- match: {hits.total: 2}
- match: {hits.hits.0._id: "1"}
- match: {hits.hits.1._id: "2"}
- match: {hits.hits.2._id: "3"}

# query vector of type double
# query vector of type float
- do:
headers:
Content-Type: application/json
Expand All @@ -79,12 +83,52 @@ setup:
script:
source: "cosineSimilarity(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: [10.0]
query_vector: [10.0, 10.0, 10.0]

- match: {hits.total: 3}
- match: {hits.total: 2}
- match: {hits.hits.0._id: "1"}
- match: {hits.hits.1._id: "2"}
- match: {hits.hits.2._id: "3"}


---
"Functions with query vectors with dims different from docs vectors should error":
- do:
index:
index: test-index
id: 1
body:
my_dense_vector: [1, 2, 3]

- do:
indices.refresh: {}

- do:
catch: bad_request
search:
rest_total_hits_as_int: true
body:
query:
script_score:
query: {match_all: {} }
script:
source: "cosineSimilarity(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: [1, 2, 3, 4]
- match: { error.root_cause.0.type: "script_exception" }

- do:
catch: bad_request
search:
rest_total_hits_as_int: true
body:
query:
script_score:
query: {match_all: {} }
script:
source: "dotProduct(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: [1, 2, 3, 4]
- match: { error.root_cause.0.type: "script_exception" }

---
"Distance functions for documents missing vector field should return 0":
Expand All @@ -93,7 +137,7 @@ setup:
index: test-index
id: 1
body:
my_dense_vector: [10]
my_dense_vector: [10, 10, 10]

- do:
index:
Expand All @@ -117,7 +161,7 @@ setup:
script:
source: "cosineSimilarity(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: [10.0]
query_vector: [10.0, 10.0, 10.0]

- match: {hits.total: 2}
- match: {hits.hits.0._id: "1"}
Expand Down Expand Up @@ -149,5 +193,5 @@ setup:
script:
source: "dotProductSparse(params.query_vector, doc['my_dense_vector'])"
params:
query_vector: {"2": 0.5, "10" : 111.3}
query_vector: {"2": 0.5, "10" : 111.3, "3": 44}
- match: { error.root_cause.0.type: "script_exception" }
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.search.DocValuesFieldExistsQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser.Token;
import org.elasticsearch.common.xcontent.support.XContentMapValues;
import org.elasticsearch.index.fielddata.IndexFieldData;
import org.elasticsearch.index.mapper.ArrayValueMapperParser;
import org.elasticsearch.index.mapper.FieldMapper;
Expand Down Expand Up @@ -56,12 +57,28 @@ public static class Defaults {
}

public static class Builder extends FieldMapper.Builder<Builder, DenseVectorFieldMapper> {
private int dims = 0;

public Builder(String name) {
super(name, Defaults.FIELD_TYPE, Defaults.FIELD_TYPE);
builder = this;
}

public Builder dims(int dims) {
if ((dims > MAX_DIMS_COUNT) || (dims < 1)) {
throw new MapperParsingException("The number of dimensions for field [" + name +
"] should be in the range [1, " + MAX_DIMS_COUNT + "]");
}
this.dims = dims;
return this;
}

@Override
protected void setupFieldType(BuilderContext context) {
super.setupFieldType(context);
fieldType().setDims(dims);
}

@Override
public DenseVectorFieldType fieldType() {
return (DenseVectorFieldType) super.fieldType();
Expand All @@ -80,11 +97,17 @@ public static class TypeParser implements Mapper.TypeParser {
@Override
public Mapper.Builder<?,?> parse(String name, Map<String, Object> node, ParserContext parserContext) throws MapperParsingException {
DenseVectorFieldMapper.Builder builder = new DenseVectorFieldMapper.Builder(name);
return builder;
Object dimsField = node.remove("dims");
if (dimsField == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same reasoning as above, it would be nice to move this check to DenseVectorFieldMapper.Builder#build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks, but we need this check here to ensure we don't get NullPointerException on the line that follows after that:

int dims = XContentMapValues.nodeIntegerValue(dimsField);

Also, I was thinking that public Builder dims(int dims) can accept dims parameter as a primitive integer.

Copy link
Contributor

@jtibshirani jtibshirani Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach would be to check if the node is null, and only parse it and call Builder#dims if it is not null. Then Builder#build could complain if Builder#dims has never been called. Anyways, this is a small comment, the change looks good to me!

throw new MapperParsingException("The [dims] property must be specified for field [" + name + "].");
}
int dims = XContentMapValues.nodeIntegerValue(dimsField);
return builder.dims(dims);
}
}

public static final class DenseVectorFieldType extends MappedFieldType {
private int dims;

public DenseVectorFieldType() {}

Expand All @@ -96,6 +119,14 @@ public DenseVectorFieldType clone() {
return new DenseVectorFieldType(this);
}

int dims() {
return dims;
}

void setDims(int dims) {
this.dims = dims;
}

@Override
public String typeName() {
return CONTENT_TYPE;
Expand Down Expand Up @@ -143,39 +174,47 @@ public DenseVectorFieldType fieldType() {
@Override
public void parse(ParseContext context) throws IOException {
if (context.externalValueSet()) {
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] can't be used in multi-fields");
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() + "] can't be used in multi-fields");
}
int dims = fieldType().dims(); //number of vector dimensions

// encode array of floats as array of integers and store into buf
// this code is here and not int the VectorEncoderDecoder so not to create extra arrays
byte[] buf = new byte[0];
byte[] buf = new byte[dims * INT_BYTES];
int offset = 0;
int dim = 0;
for (Token token = context.parser().nextToken(); token != Token.END_ARRAY; token = context.parser().nextToken()) {
if (dim++ >= dims) {
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() + "] of doc [" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like IllegalArgumentException would make more sense here, since we are parsing a document and not a mapping.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani I agree with you. This should be IllegalArgumentException, but these exceptions will be caught by DocumentParser::parseDocument and wrapped into the MapperParsingException, which eventually come up as MapperParsingExceptions with a cause underneath it as IllegalArgumentException.
That's why I thought it makes sense to throw MapperParsingException from the beginning.
What do you think - does it still makes sense to have IllegalArgumentException underneath MapperParsingException?
Actually, I would really benefit from more clarification around exceptions, may be worth discussion with the team.

context.sourceToParse().id() + "] has exceeded the number of dimensions [" + dims + "] defined in mapping");
}
ensureExpectedToken(Token.VALUE_NUMBER, token, context.parser()::getTokenLocation);
float value = context.parser().floatValue(true);
if (buf.length < (offset + INT_BYTES)) {
buf = ArrayUtil.grow(buf, (offset + INT_BYTES));
}
int intValue = Float.floatToIntBits(value);
buf[offset] = (byte) (intValue >> 24);
buf[offset+1] = (byte) (intValue >> 16);
buf[offset+2] = (byte) (intValue >> 8);
buf[offset+3] = (byte) intValue;
offset += INT_BYTES;
if (dim++ >= MAX_DIMS_COUNT) {
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() +
"] has exceeded the maximum allowed number of dimensions of [" + MAX_DIMS_COUNT + "]");
}
buf[offset++] = (byte) (intValue >> 24);
buf[offset++] = (byte) (intValue >> 16);
buf[offset++] = (byte) (intValue >> 8);
buf[offset++] = (byte) intValue;
}
if (dim != dims) {
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() + "] of doc [" +
context.sourceToParse().id() + "] has number of dimensions [" + dim +
"] less than defined in the mapping [" + dims +"]");
}
BinaryDocValuesField field = new BinaryDocValuesField(fieldType().name(), new BytesRef(buf, 0, offset));
if (context.doc().getByKey(fieldType().name()) != null) {
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() +
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() +
"] doesn't not support indexing multiple values for the same field in the same document");
}
context.doc().addWithKey(fieldType().name(), field);
}

@Override
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
super.doXContentBody(builder, includeDefaults, params);
builder.field("dims", fieldType().dims());
}

@Override
protected void parseCreateField(ParseContext context, List<IndexableField> fields) {
throw new AssertionError("parse is implemented directly");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,11 @@ public static float[] decodeDenseVector(BytesRef vectorBR) {
float[] vector = new float[dimCount];
int offset = vectorBR.offset;
for (int dim = 0; dim < dimCount; dim++) {
int intValue = ((vectorBR.bytes[offset] & 0xFF) << 24) |
((vectorBR.bytes[offset+1] & 0xFF) << 16) |
((vectorBR.bytes[offset+2] & 0xFF) << 8) |
(vectorBR.bytes[offset+3] & 0xFF);
int intValue = ((vectorBR.bytes[offset++] & 0xFF) << 24) |
((vectorBR.bytes[offset++] & 0xFF) << 16) |
((vectorBR.bytes[offset++] & 0xFF) << 8) |
(vectorBR.bytes[offset++] & 0xFF);
vector[dim] = Float.intBitsToFloat(intValue);
offset = offset + INT_BYTES;
}
return vector;
}
Expand Down
Loading