Skip to content

Synthetic _source: support dense_vector #89840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/89840.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 89840
summary: "Synthetic _source: support `dense_vector`"
area: Vector Search
type: feature
issues: []
1 change: 1 addition & 0 deletions docs/reference/mapping/fields/synthetic-source.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ types:
** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
** <<boolean-synthetic-source,`boolean`>>
** <<numeric-synthetic-source,`byte`>>
** <<dense-vector-synthetic-source,`dense_vector`>>
** <<numeric-synthetic-source,`double`>>
** <<numeric-synthetic-source,`float`>>
** <<geo-point-synthetic-source,`geo_point`>>
Expand Down
4 changes: 4 additions & 0 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,7 @@ Defaults to `16`.
The number of candidates to track while assembling the list of nearest
neighbors for each new node. Defaults to `100`.
====

[[dense-vector-synthetic-source]]
==== Synthetic source preview:[]
`dense_vector` fields support <<synthetic-source,synthetic `_source`>> .
Original file line number Diff line number Diff line change
Expand Up @@ -454,3 +454,90 @@ stored keyword with ignore_above:
- short
- jumped over the lazy dog # fields saved by ignore_above are returned after doc values fields
- is_false: fields

---
indexed dense vectors:
- skip:
version: " - 8.4.99"
reason: introduced in 8.5.0

- do:
indices.create:
index: test
body:
mappings:
_source:
mode: synthetic
properties:
name:
type: keyword
vector:
type: dense_vector
dims: 5
index: true
similarity: l2_norm

- do:
index:
index: test
id: 1
body:
name: cow.jpg
vector: [ 230.0, 300.33, -34.8988, 15.555, -200.0 ]

- do:
get:
index: test
id: 1
- match: {_index: "test"}
- match: {_id: "1"}
- match: {_version: 1}
- match: {found: true}
- match:
_source:
name: cow.jpg
vector: [ 230.0, 300.33, -34.8988, 15.555, -200.0 ]
- is_false: fields

---
non-indexed dense vectors:
- skip:
version: " - 8.4.99"
reason: introduced in 8.5.0

- do:
indices.create:
index: test
body:
mappings:
_source:
mode: synthetic
properties:
name:
type: keyword
vector:
type: dense_vector
dims: 5
index: false

- do:
index:
index: test
id: 1
body:
name: cow.jpg
vector: [ 230.0, 300.33, -34.8988, 15.555, -200.0 ]

- do:
get:
index: test
id: 1
- match: {_index: "test"}
- match: {_id: "1"}
- match: {_version: 1}
- match: {found: true}
- match:
_source:
name: cow.jpg
vector: [ 230.0, 300.33, -34.8988, 15.555, -200.0 ]
- is_false: fields
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.KnnVectorField;
import org.apache.lucene.index.BinaryDocValues;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.index.VectorValues;
import org.apache.lucene.search.FieldExistsQuery;
import org.apache.lucene.search.KnnVectorQuery;
import org.apache.lucene.search.Query;
Expand All @@ -31,6 +34,7 @@
import org.elasticsearch.index.mapper.MappingLookup;
import org.elasticsearch.index.mapper.MappingParser;
import org.elasticsearch.index.mapper.SimpleMappedFieldType;
import org.elasticsearch.index.mapper.SourceLoader;
import org.elasticsearch.index.mapper.TextSearchInfo;
import org.elasticsearch.index.mapper.ValueFetcher;
import org.elasticsearch.index.query.SearchExecutionContext;
Expand All @@ -45,6 +49,7 @@
import java.time.ZoneId;
import java.util.Map;
import java.util.Objects;
import java.util.stream.Stream;

import static org.elasticsearch.common.xcontent.XContentParserUtils.ensureExpectedToken;

Expand Down Expand Up @@ -525,4 +530,97 @@ public KnnVectorsFormat getKnnVectorsFormatForField() {
return new Lucene94HnswVectorsFormat(hnswIndexOptions.m, hnswIndexOptions.efConstruction);
}
}

@Override
public SourceLoader.SyntheticFieldLoader syntheticFieldLoader() {
if (copyTo.copyToFields().isEmpty() != true) {
throw new IllegalArgumentException(
"field [" + name() + "] of type [" + typeName() + "] doesn't support synthetic source because it declares copy_to"
);
}
if (indexed) {
return new IndexedSyntheticFieldLoader();
}
return new DocValuesSyntheticFieldLoader();
}

private class IndexedSyntheticFieldLoader implements SourceLoader.SyntheticFieldLoader {
private VectorValues values;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using the lucene APIs directly for SyntheticFieldLoader subclasses. I believe it saves allocating a float[] on the doc values version which is nice, but not a huge thing. But I did it just to line up with the other implementations.

private boolean hasValue;

@Override
public Stream<Map.Entry<String, StoredFieldLoader>> storedFieldLoaders() {
return Stream.of();
}

@Override
public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
values = leafReader.getVectorValues(name());
if (values == null) {
return null;
}
return docId -> {
hasValue = docId == values.advance(docId);
return hasValue;
};
}

@Override
public boolean hasValue() {
return hasValue;
}

@Override
public void write(XContentBuilder b) throws IOException {
if (false == hasValue) {
return;
}
b.startArray(simpleName());
for (float v : values.vectorValue()) {
b.value(v);
}
b.endArray();
}
}

private class DocValuesSyntheticFieldLoader implements SourceLoader.SyntheticFieldLoader {
private BinaryDocValues values;
private boolean hasValue;

@Override
public Stream<Map.Entry<String, StoredFieldLoader>> storedFieldLoaders() {
return Stream.of();
}

@Override
public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
values = leafReader.getBinaryDocValues(name());
if (values == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be leafReader.getBinary(name()) because DocValues.getBinary() never returns null

return null;
}
return docId -> {
hasValue = docId == values.advance(docId);
return hasValue;
};
}

@Override
public boolean hasValue() {
return hasValue;
}

@Override
public void write(XContentBuilder b) throws IOException {
if (false == hasValue) {
return;
}
b.startArray(simpleName());
BytesRef ref = values.binaryValue();
ByteBuffer byteBuffer = ByteBuffer.wrap(ref.bytes, ref.offset, ref.length);
for (int dim = 0; dim < dims; dim++) {
b.value(byteBuffer.getFloat());
}
b.endArray();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import org.elasticsearch.index.mapper.ParsedDocument;
import org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.DenseVectorFieldType;
import org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.VectorSimilarity;
import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.xcontent.XContentBuilder;
import org.junit.AssumptionViolatedException;

Expand Down Expand Up @@ -465,12 +466,50 @@ public void testKnnVectorsFormat() throws IOException {
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
protected SyntheticSourceSupport syntheticSourceSupport() {
return new DenseVectorSyntheticSourceSupport();
}

@Override
protected boolean supportsEmptyInputArray() {
return false;
}

private static class DenseVectorSyntheticSourceSupport implements SyntheticSourceSupport {
private final int dims = between(5, 1000);
private final boolean indexed = randomBoolean();
private final boolean indexOptionsSet = indexed && randomBoolean();

@Override
public SyntheticSourceExample example(int maxValues) throws IOException {
List<Float> value = randomList(dims, dims, ESTestCase::randomFloat);
return new SyntheticSourceExample(value, value, this::mapping);
}

private void mapping(XContentBuilder b) throws IOException {
b.field("type", "dense_vector");
b.field("dims", dims);
if (indexed) {
b.field("index", true);
b.field("similarity", "l2_norm");
if (indexOptionsSet) {
b.startObject("index_options");
b.field("type", "hnsw");
b.field("m", 5);
b.field("ef_construction", 50);
b.endObject();
}
}
}

@Override
public List<SyntheticSourceInvalidExample> invalidExample() throws IOException {
return List.of();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like there is nothing like ignore_above or ignore_malformed or doc_values: false on this field type.

}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -914,6 +914,19 @@ public final void testSyntheticEmptyList() throws IOException {

public final void testSyntheticEmptyListNoDocValuesLoader() throws IOException {
assumeTrue("Field does not support [] as input", supportsEmptyInputArray());
assertNoDocValueLoader(b -> b.startArray("field").endArray());
}

public final void testEmptyDocumentNoDocValueLoader() throws IOException {
assumeFalse("Field will add values even if no fields are supplied", addsValueWhenNotSupplied());
assertNoDocValueLoader(b -> {});
}

protected boolean addsValueWhenNotSupplied() {
return false;
}

private void assertNoDocValueLoader(CheckedConsumer<XContentBuilder, IOException> doc) throws IOException {
SyntheticSourceExample syntheticSourceExample = syntheticSourceSupport().example(5);
DocumentMapper mapper = createDocumentMapper(syntheticSourceMapping(b -> {
b.startObject("field");
Expand All @@ -922,8 +935,7 @@ public final void testSyntheticEmptyListNoDocValuesLoader() throws IOException {
}));
try (Directory directory = newDirectory()) {
RandomIndexWriter iw = new RandomIndexWriter(random(), directory);
LuceneDocument doc = mapper.parse(source(b -> b.startArray("field").endArray())).rootDoc();
iw.addDocument(doc);
iw.addDocument(mapper.parse(source(doc)).rootDoc());
iw.close();
try (DirectoryReader reader = DirectoryReader.open(directory)) {
LeafReader leafReader = getOnlyLeafReader(reader);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -243,4 +243,9 @@ public void testNullValueSyntheticSource() throws IOException {
protected boolean supportsEmptyInputArray() {
return false;
}

@Override
protected boolean addsValueWhenNotSupplied() {
return true;
}
}