-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add dims parameter to dense_vector mapping #43444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dims parameter to dense_vector mapping #43444
Conversation
Pinging @elastic/es-search |
@elasticmachine run elasticsearch-ci/1 |
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine run elasticsearch-ci/default-distro |
Typically, dense vectors of both documents and queries must have the same number of dimensions. Different number of dimensions among documents or query vector indicate an error. This PR enforces that all vectors for the same field have the same number of dimensions. It also enforces that query vectors have the same number of dimensions.
3179461
to
bac0c34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova, this change makes sense to me. For the kinds of use cases we're targeting with vector fields, it does seem that having vectors with mixed dimensions almost always indicates an error, and it is helpful to be alerted to it. I can't think of a case where this behavior would be overly restrictive (and couldn't be worked around easily).
While reviewing this change, I noticed that for the cosineSimilarity
and dotProduct
functions, we accept documents without the field and always return 0. This behavior isn't very intuitive to me -- I would expect for the similarity function to error, and that the user should explicitly handle documents without the field in the script. The current behavior could mask errors in the data (the type of error we're trying to avoid with the current change).
...gin/vectors/src/main/java/org/elasticsearch/xpack/vectors/mapper/DenseVectorFieldMapper.java
Show resolved
Hide resolved
int offset = 0; | ||
int dim = 0; | ||
for (Token token = context.parser().nextToken(); token != Token.END_ARRAY; token = context.parser().nextToken()) { | ||
if (dim++ >= dims) { | ||
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() + "] of doc [" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like IllegalArgumentException
would make more sense here, since we are parsing a document and not a mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani I agree with you. This should be IllegalArgumentException
, but these exceptions will be caught by DocumentParser::parseDocument and wrapped into the MapperParsingException
, which eventually come up as MapperParsingExceptions
with a cause underneath it as IllegalArgumentException
.
That's why I thought it makes sense to throw MapperParsingException
from the beginning.
What do you think - does it still makes sense to have IllegalArgumentException
underneath MapperParsingException
?
Actually, I would really benefit from more clarification around exceptions, may be worth discussion with the team.
...lugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/mapper/VectorEncoderDecoder.java
Show resolved
Hide resolved
x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java
Show resolved
Hide resolved
x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java
Show resolved
Hide resolved
@jtibshirani Thanks for your feedback, the next commit in the PR tries to address it.
The way the API and As there is not way for users to skip documents without a vector field, I chose to have a default value |
…s_to_dense_vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions. I was wondering if it makes difference for you to have a single-level MapperParsingException or MapperParsingException with wrapped IllegalArgumentException?
Thanks @mayya-sharipova, I didn't see that MapperParsingException is also used when parsing documents. I don't have a strong preference, I think I'd still lean towards IllegalArgumentException since it's more commonly used in this context (I verified this through a quick code search).
The way the API and VectorScriptDocValues are designed right now it is NOT possible to check if a document has a value for a vector field.
I wonder if it'd at least make sense to support size()
? In any case, it is not too relevant to this PR and I agree we can wait for more feedback to see if we need to rethink the VectorScriptDocValues
API.
The last comment I had is that we should probably update the line "
If a document’s dense vector field has a number of dimensions different from the query’s vector..." from this documentation: https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-script-score-query.html#vector-functions
@@ -24,7 +22,8 @@ PUT my_index | |||
"mappings": { | |||
"properties": { | |||
"my_vector": { | |||
"type": "dense_vector" | |||
"type": "dense_vector", <1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this callout <1>
should go on the next line.
@@ -80,11 +97,17 @@ public DenseVectorFieldMapper build(BuilderContext context) { | |||
@Override | |||
public Mapper.Builder<?,?> parse(String name, Map<String, Object> node, ParserContext parserContext) throws MapperParsingException { | |||
DenseVectorFieldMapper.Builder builder = new DenseVectorFieldMapper.Builder(name); | |||
return builder; | |||
Object dimsField = node.remove("dims"); | |||
if (dimsField == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the same reasoning as above, it would be nice to move this check to DenseVectorFieldMapper.Builder#build
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks, but we need this check here to ensure we don't get NullPointerException
on the line that follows after that:
int dims = XContentMapValues.nodeIntegerValue(dimsField);
Also, I was thinking that public Builder dims(int dims)
can accept dims
parameter as a primitive integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach would be to check if the node is null, and only parse it and call Builder#dims
if it is not null. Then Builder#build
could complain if Builder#dims
has never been called. Anyways, this is a small comment, the change looks good to me!
...lugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/mapper/VectorEncoderDecoder.java
Show resolved
Hide resolved
@elasticmachine test this please |
@jtibshirani Thanks Julie, I have addressed your 2nd feedback in the last commit.
Right, let's do it in a separate PR. |
@elasticmachine test this please |
…s_to_dense_vector
@elasticmachine run elasticsearch-ci/packaging-sample |
1 similar comment
@elasticmachine run elasticsearch-ci/packaging-sample |
…s_to_dense_vector
@elasticmachine test this please |
Typically, dense vectors of both documents and queries must have the same number of dimensions. Different number of dimensions among documents or query vector indicate an error. This PR enforces that all vectors for the same field have the same number of dimensions. It also enforces that query vectors have the same number of dimensions.
Typically, dense vectors of both documents and queries must have the same number of dimensions. Different number of dimensions among documents or query vector indicate an error. This PR enforces that all vectors for the same field have the same number of dimensions. It also enforces that query vectors have the same number of dimensions.
After backport to 7.3 adjust yml tests Relates to #43444
Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions as doc vectors.