From e9921b44e53e33afc61afcd9dc7c4b386fde6509 Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Fri, 29 Aug 2014 14:32:22 +0200 Subject: [PATCH 1/8] Term Vectors: Support for artificial documents This adds the ability to the Term Vector API to generate term vectors for artifical documents, that is for documents not present in the index. Following a similar syntax to the Percolator API, a new 'doc' parameter is used, instead of '_id', that specifies the document of interest. The parameters '_index' and '_type' determine the mapping and therefore analyzers to apply to each value field. --- .../reference/docs/multi-termvectors.asciidoc | 35 ++++++- docs/reference/docs/termvectors.asciidoc | 41 +++++++-- .../termvector/MultiTermVectorsRequest.java | 1 - .../action/termvector/TermVectorRequest.java | 49 +++++++++- .../termvector/TermVectorRequestBuilder.java | 21 +++++ .../action/termvector/TermVectorResponse.java | 25 +++-- .../action/termvector/TermVectorWriter.java | 1 - .../java/org/elasticsearch/client/Client.java | 7 +- .../client/support/AbstractClient.java | 5 + .../index/mapper/ParseContext.java | 26 ++++++ .../termvectors/ShardTermVectorService.java | 91 +++++++++++++++---- .../termvector/RestTermVectorAction.java | 2 + .../action/termvector/GetTermVectorTests.java | 79 +++++++++++++--- 13 files changed, 322 insertions(+), 61 deletions(-) diff --git a/docs/reference/docs/multi-termvectors.asciidoc b/docs/reference/docs/multi-termvectors.asciidoc index a3a9aa051b5ec..0ad1209dc42f1 100644 --- a/docs/reference/docs/multi-termvectors.asciidoc +++ b/docs/reference/docs/multi-termvectors.asciidoc @@ -1,7 +1,10 @@ [[docs-multi-termvectors]] == Multi termvectors API -Multi termvectors API allows to get multiple termvectors based on an index, type and id. The response includes a `docs` +Multi termvectors API allows to get multiple termvectors at once. The +documents from which to retrieve the term vectors are specified by an index, +type and id. But the documents could also be artificially provided coming[1.4.0]. +The response includes a `docs` array with all the fetched termvectors, each element having the structure provided by the <> API. Here is an example: @@ -89,4 +92,32 @@ curl 'localhost:9200/testidx/test/_mtermvectors' -d '{ }' -------------------------------------------------- -Parameters can also be set by passing them as uri parameters (see <>). uri parameters are the default parameters and are overwritten by any parameter setting defined in the body. +Additionally coming[1.4.0], just like for the <> +API, term vectors could be generated for user provided documents. The syntax +is similar to the <> API. The mapping used is +determined by `_index` and `_type`. Term statistics are disallowed as they +depend on the shard processing the request. + +[source,js] +-------------------------------------------------- +curl 'localhost:9200/_mtermvectors' -d '{ + "docs": [ + { + "_index": "testidx", + "_type": "test", + "doc" : { + "fullname" : "John Doe", + "text" : "twitter test test test" + } + }, + { + "_index": "testidx", + "_type": "test", + "doc" : { + "fullname" : "Jane Doe", + "text" : "Another twitter test ..." + } + } + ] +}' +-------------------------------------------------- diff --git a/docs/reference/docs/termvectors.asciidoc b/docs/reference/docs/termvectors.asciidoc index f0c16572f12b2..1160eb155a974 100644 --- a/docs/reference/docs/termvectors.asciidoc +++ b/docs/reference/docs/termvectors.asciidoc @@ -3,10 +3,11 @@ added[1.0.0.Beta1] -Returns information and statistics on terms in the fields of a -particular document as stored in the index. Note that this is a -near realtime API as the term vectors are not available until the -next refresh. +Returns information and statistics on terms in the fields of a particular +document. The document could be stored in the index or artificially provided +by the user coming[1.4.0]. Note that for documents stored in the index, this +is a near realtime API as the term vectors are not available until the next +refresh. [source,js] -------------------------------------------------- @@ -41,10 +42,10 @@ statistics are returned for all fields but no term statistics. * term payloads (`payloads` : true), as base64 encoded bytes If the requested information wasn't stored in the index, it will be -computed on the fly if possible. See <> -for how to configure your index to store term vectors. +computed on the fly if possible. Additionally, term vectors could be computed +for documents not even existing in the index, but instead provided by the user. -coming[1.4.0,The ability to computed term vectors on the fly is only available from 1.4.0 onwards (see below)] +coming[1.4.0,The ability to computed term vectors on the fly as well as support for artificial documents is only available from 1.4.0 onwards (see below example 2 and 3 respectively)] [WARNING] ====== @@ -86,7 +87,8 @@ The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute -numbers have no meaning in this context. +numbers have no meaning in this context. The term statistics are disabled +when using artificial documents. [float] === Example 1 @@ -231,7 +233,7 @@ Response: [float] === Example 2 coming[1.4.0] -Additionally, term vectors which are not explicitly stored in the index are automatically +Term vectors which are not explicitly stored in the index are automatically computed on the fly. The following request returns all information and statistics for the fields in document `1`, even though the terms haven't been explicitly stored in the index. Note that for the field `text`, the terms are not re-generated. @@ -246,3 +248,24 @@ curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{ "field_statistics" : true }' -------------------------------------------------- + +[float] +=== Example 3 coming[1.4.0] + +Additionally, term vectors can also be generated for artificial documents, +that is for documents not present in the index. The syntax is similar to the +<> API. For example, the following request would +return the same results as in example 1. The mapping used is determined by the +`index` and `type`. Term statistics are disallowed as they depend on the shard +processing the request. + +[source,js] +-------------------------------------------------- +curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{ + "doc" : { + "fullname" : "John Doe", + "text" : "twitter test test test" + } +}' +-------------------------------------------------- + diff --git a/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java b/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java index 9c2aa515e4c32..8de48f746b18a 100644 --- a/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java @@ -90,7 +90,6 @@ public void add(TermVectorRequest template, BytesReference data) throws Exceptio if (token == XContentParser.Token.FIELD_NAME) { currentFieldName = parser.currentName(); } else if (token == XContentParser.Token.START_ARRAY) { - if ("docs".equals(currentFieldName)) { while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) { if (token != XContentParser.Token.START_OBJECT) { diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java index 9b1ccdc287a84..627db4be31a01 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java @@ -26,12 +26,17 @@ import org.elasticsearch.action.ValidateActions; import org.elasticsearch.action.get.MultiGetRequest; import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest; +import org.elasticsearch.common.bytes.BytesReference; import org.elasticsearch.common.io.stream.StreamInput; import org.elasticsearch.common.io.stream.StreamOutput; +import org.elasticsearch.common.xcontent.XContentBuilder; import org.elasticsearch.common.xcontent.XContentParser; import java.io.IOException; import java.util.*; +import java.util.concurrent.atomic.AtomicInteger; + +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder; /** * Request returning the term vector (doc frequency, positions, offsets) for a @@ -46,10 +51,14 @@ public class TermVectorRequest extends SingleShardOperationRequest selectedFields; @@ -129,6 +138,23 @@ public TermVectorRequest id(String id) { return this; } + /** + * Returns the artificial document from which term vectors are requested for. + */ + public BytesReference doc() { + return doc; + } + + /** + * Sets an artificial document from which term vectors are requested for. + */ + public TermVectorRequest doc(XContentBuilder documentBuilder) { + // assign a random id to this artificial document, for routing + this.id(String.valueOf(randomInt.getAndAdd(1))); + this.doc = documentBuilder.bytes(); + return this; + } + /** * @return The routing for this request. */ @@ -281,8 +307,8 @@ public ActionRequestValidationException validate() { if (type == null) { validationException = ValidateActions.addValidationError("type is missing", validationException); } - if (id == null) { - validationException = ValidateActions.addValidationError("id is missing", validationException); + if (id == null && doc == null) { + validationException = ValidateActions.addValidationError("id or doc is missing", validationException); } return validationException; } @@ -303,6 +329,9 @@ public void readFrom(StreamInput in) throws IOException { } type = in.readString(); id = in.readString(); + if (in.readBoolean()) { + doc = in.readBytesReference(); + } routing = in.readOptionalString(); preference = in.readOptionalString(); long flags = in.readVLong(); @@ -331,6 +360,10 @@ public void writeTo(StreamOutput out) throws IOException { } out.writeString(type); out.writeString(id); + out.writeBoolean(doc != null); + if (doc != null) { + out.writeBytesReference(doc); + } out.writeOptionalString(routing); out.writeOptionalString(preference); long longFlags = 0; @@ -381,6 +414,9 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar } else if (currentFieldName.equals("payloads")) { termVectorRequest.payloads(parser.booleanValue()); } else if (currentFieldName.equals("term_statistics") || currentFieldName.equals("termStatistics")) { + if (parser.booleanValue() && termVectorRequest.doc != null) { + throw new ElasticsearchParseException("Term statistics are not supported with artificial documents."); + } termVectorRequest.termStatistics(parser.booleanValue()); } else if (currentFieldName.equals("field_statistics") || currentFieldName.equals("fieldStatistics")) { termVectorRequest.fieldStatistics(parser.booleanValue()); @@ -389,7 +425,15 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar } else if ("_type".equals(currentFieldName)) { termVectorRequest.type = parser.text(); } else if ("_id".equals(currentFieldName)) { + if (termVectorRequest.doc != null) { + throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!"); + } termVectorRequest.id = parser.text(); + } else if ("doc".equals(currentFieldName)) { + if (termVectorRequest.id != null) { + throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!"); + } + termVectorRequest.doc(jsonBuilder().copyCurrentStructure(parser)); } else if ("_routing".equals(currentFieldName) || "routing".equals(currentFieldName)) { termVectorRequest.routing = parser.text(); } else { @@ -398,7 +442,6 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar } } } - if (fields.size() > 0) { String[] fieldsAsArray = new String[fields.size()]; termVectorRequest.selectedFields(fields.toArray(fieldsAsArray)); diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java index 5a91d0e4e6e7c..d0f0921dea7ba 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java @@ -22,6 +22,7 @@ import org.elasticsearch.action.ActionListener; import org.elasticsearch.action.ActionRequestBuilder; import org.elasticsearch.client.Client; +import org.elasticsearch.common.xcontent.XContentBuilder; /** */ @@ -35,6 +36,26 @@ public TermVectorRequestBuilder(Client client, String index, String type, String super(client, new TermVectorRequest(index, type, id)); } + public TermVectorRequestBuilder setIndex(String index) { + request.index(index); + return this; + } + + public TermVectorRequestBuilder setId(String id) { + request.id(id); + return this; + } + + public TermVectorRequestBuilder setType(String type) { + request.type(type); + return this; + } + + public TermVectorRequestBuilder setDoc(XContentBuilder xContent) { + request.doc(xContent); + return this; + } + /** * Sets the routing. Required if routing isn't id based. */ diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorResponse.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorResponse.java index 912b184da5397..55aa2127ccbd0 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorResponse.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorResponse.java @@ -81,10 +81,11 @@ private static class FieldStrings { private String id; private long docVersion; private boolean exists = false; + private boolean artificial = false; private boolean sourceCopied = false; - int[] curentPositions = new int[0]; + int[] currentPositions = new int[0]; int[] currentStartOffset = new int[0]; int[] currentEndOffset = new int[0]; BytesReference[] currentPayloads = new BytesReference[0]; @@ -156,7 +157,6 @@ public int size() { } }; } - } @Override @@ -166,7 +166,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws assert id != null; builder.field(FieldStrings._INDEX, index); builder.field(FieldStrings._TYPE, type); - builder.field(FieldStrings._ID, id); + if (!isArtificial()) { + builder.field(FieldStrings._ID, id); + } builder.field(FieldStrings._VERSION, docVersion); builder.field(FieldStrings.FOUND, isExists()); if (!isExists()) { @@ -181,7 +183,6 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws } builder.endObject(); return builder; - } private void buildField(XContentBuilder builder, final CharsRef spare, Fields theFields, Iterator fieldIter) throws IOException { @@ -237,7 +238,7 @@ private void buildValues(XContentBuilder builder, Terms curTerms, int termFreq) for (int i = 0; i < termFreq; i++) { builder.startObject(); if (curTerms.hasPositions()) { - builder.field(FieldStrings.POS, curentPositions[i]); + builder.field(FieldStrings.POS, currentPositions[i]); } if (curTerms.hasOffsets()) { builder.field(FieldStrings.START_OFFSET, currentStartOffset[i]); @@ -249,14 +250,13 @@ private void buildValues(XContentBuilder builder, Terms curTerms, int termFreq) builder.endObject(); } builder.endArray(); - } private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFreq) throws IOException { for (int j = 0; j < termFreq; j++) { int nextPos = posEnum.nextPosition(); if (curTerms.hasPositions()) { - curentPositions[j] = nextPos; + currentPositions[j] = nextPos; } if (curTerms.hasOffsets()) { currentStartOffset[j] = posEnum.startOffset(); @@ -269,7 +269,6 @@ private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFr } else { currentPayloads[j] = null; } - } } } @@ -277,7 +276,7 @@ private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFr private void initMemory(Terms curTerms, int termFreq) { // init memory for performance reasons if (curTerms.hasPositions()) { - curentPositions = ArrayUtil.grow(curentPositions, termFreq); + currentPositions = ArrayUtil.grow(currentPositions, termFreq); } if (curTerms.hasOffsets()) { currentStartOffset = ArrayUtil.grow(currentStartOffset, termFreq); @@ -336,7 +335,6 @@ public void setTermVectorField(BytesStreamOutput output) { public void setHeader(BytesReference header) { headerRef = header; - } public void setDocVersion(long version) { @@ -356,4 +354,11 @@ public String getId() { return id; } + public boolean isArtificial() { + return artificial; + } + + public void setArtificial(boolean artificial) { + this.artificial = artificial; + } } diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java index 10509662cbc3c..9bbf142508b9d 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java @@ -46,7 +46,6 @@ final class TermVectorWriter { } void setFields(Fields termVectorsByField, Set selectedFields, EnumSet flags, Fields topLevelFields) throws IOException { - int numFieldsWritten = 0; TermsEnum iterator = null; DocsAndPositionsEnum docsAndPosEnum = null; diff --git a/src/main/java/org/elasticsearch/client/Client.java b/src/main/java/org/elasticsearch/client/Client.java index c88d2d77f2318..d30d09f226549 100644 --- a/src/main/java/org/elasticsearch/client/Client.java +++ b/src/main/java/org/elasticsearch/client/Client.java @@ -533,7 +533,6 @@ public interface Client extends ElasticsearchClient, Releasable { */ MoreLikeThisRequestBuilder prepareMoreLikeThis(String index, String type, String id); - /** * An action that returns the term vectors for a specific document. * @@ -550,6 +549,10 @@ public interface Client extends ElasticsearchClient, Releasable { */ void termVector(TermVectorRequest request, ActionListener listener); + /** + * Builder for the term vector request. + */ + TermVectorRequestBuilder prepareTermVector(); /** * Builder for the term vector request. @@ -560,7 +563,6 @@ public interface Client extends ElasticsearchClient, Releasable { */ TermVectorRequestBuilder prepareTermVector(String index, String type, String id); - /** * Multi get term vectors. */ @@ -576,7 +578,6 @@ public interface Client extends ElasticsearchClient, Releasable { */ MultiTermVectorsRequestBuilder prepareMultiTermVectors(); - /** * Percolates a request returning the matches documents. */ diff --git a/src/main/java/org/elasticsearch/client/support/AbstractClient.java b/src/main/java/org/elasticsearch/client/support/AbstractClient.java index 22b5e5e99291a..b728795779059 100644 --- a/src/main/java/org/elasticsearch/client/support/AbstractClient.java +++ b/src/main/java/org/elasticsearch/client/support/AbstractClient.java @@ -441,6 +441,11 @@ public void termVector(final TermVectorRequest request, final ActionListenerString[] of field values + */ + public final String[] getValues(String name) { + List result = new ArrayList<>(); + + for (IndexableField field : fields) { + if (field.name().equals(name) && field.stringValue() != null) { + result.add(field.stringValue()); + } + } + + if (result.size() == 0) { + return new String[0]; + } + + return result.toArray(new String[result.size()]); + } + public IndexableField getField(String name) { for (IndexableField field : fields) { if (field.name().equals(name)) { diff --git a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java index a05fdc88d302d..bfe3d4ef18422 100644 --- a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java +++ b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java @@ -28,12 +28,13 @@ import org.elasticsearch.common.Strings; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.lucene.uid.Versions; -import org.elasticsearch.common.regex.Regex; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.engine.Engine; import org.elasticsearch.index.get.GetField; import org.elasticsearch.index.get.GetResult; +import org.elasticsearch.index.mapper.DocumentMapper; import org.elasticsearch.index.mapper.FieldMapper; +import org.elasticsearch.index.mapper.ParseContext; import org.elasticsearch.index.mapper.Uid; import org.elasticsearch.index.mapper.core.StringFieldMapper; import org.elasticsearch.index.mapper.internal.UidFieldMapper; @@ -45,6 +46,8 @@ import java.io.IOException; import java.util.*; +import static org.elasticsearch.index.mapper.SourceToParse.source; + /** */ @@ -67,23 +70,35 @@ public TermVectorResponse getTermVector(TermVectorRequest request, String concre final Engine.Searcher searcher = indexShard.acquireSearcher("term_vector"); IndexReader topLevelReader = searcher.reader(); final TermVectorResponse termVectorResponse = new TermVectorResponse(concreteIndex, request.type(), request.id()); - final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id())); + + // handle potential wildcards in fields + if (request.selectedFields() != null) { + handleFieldWildcards(request); + } + try { Fields topLevelFields = MultiFields.getFields(topLevelReader); + // from an artificial document + if (request.doc() != null) { + Fields termVectorsByField = generateTermVectorsFromDoc(request, topLevelFields); + termVectorResponse.setFields(termVectorsByField, request.selectedFields(), request.getFlags(), topLevelFields); + termVectorResponse.setExists(true); + termVectorResponse.setArtificial(true); + return termVectorResponse; + } + // or from an existing document + final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id())); Versions.DocIdAndVersion docIdAndVersion = Versions.loadDocIdAndVersion(topLevelReader, uidTerm); if (docIdAndVersion != null) { - /* handle potential wildcards in fields */ - if (request.selectedFields() != null) { - handleFieldWildcards(request); - } - /* generate term vectors if not available */ + // fields with stored term vectors Fields termVectorsByField = docIdAndVersion.context.reader().getTermVectors(docIdAndVersion.docId); + // fields without term vectors if (request.selectedFields() != null) { - termVectorsByField = generateTermVectorsIfNeeded(termVectorsByField, request, uidTerm, false); + termVectorsByField = addGeneratedTermVectors(termVectorsByField, request, uidTerm, false); } termVectorResponse.setFields(termVectorsByField, request.selectedFields(), request.getFlags(), topLevelFields); - termVectorResponse.setExists(true); termVectorResponse.setDocVersion(docIdAndVersion.version); + termVectorResponse.setExists(true); } else { termVectorResponse.setExists(false); } @@ -103,39 +118,52 @@ private void handleFieldWildcards(TermVectorRequest request) { request.selectedFields(fieldNames.toArray(Strings.EMPTY_ARRAY)); } - private Fields generateTermVectorsIfNeeded(Fields termVectorsByField, TermVectorRequest request, Term uidTerm, boolean realTime) throws IOException { - List validFields = new ArrayList<>(); + private boolean isValidField(FieldMapper field) { + // must be a string + if (!(field instanceof StringFieldMapper)) { + return false; + } + // and must be indexed + if (!field.fieldType().indexed()) { + return false; + } + return true; + } + + private Fields addGeneratedTermVectors(Fields termVectorsByField, TermVectorRequest request, Term uidTerm, boolean realTime) throws IOException { + /* only keep valid fields */ + Set validFields = new HashSet<>(); for (String field : request.selectedFields()) { FieldMapper fieldMapper = indexShard.mapperService().smartNameFieldMapper(field); - if (!(fieldMapper instanceof StringFieldMapper)) { + if (!isValidField(fieldMapper)) { continue; } + // already retrieved if (fieldMapper.fieldType().storeTermVectors()) { continue; } - // only disallow fields which are not indexed - if (!fieldMapper.fieldType().indexed()) { - continue; - } validFields.add(field); } + if (validFields.isEmpty()) { return termVectorsByField; } + /* generate term vectors from fetched document fields */ Engine.GetResult get = indexShard.get(new Engine.Get(realTime, uidTerm)); Fields generatedTermVectors; try { if (!get.exists()) { return termVectorsByField; } - // TODO: support for fetchSourceContext? GetResult getResult = indexShard.getService().get( get, request.id(), request.type(), validFields.toArray(Strings.EMPTY_ARRAY), null, false); generatedTermVectors = generateTermVectors(getResult.getFields().values(), request.offsets()); } finally { get.release(); } + + /* merge with existing Fields */ if (termVectorsByField == null) { return generatedTermVectors; } else { @@ -160,6 +188,35 @@ private Fields generateTermVectors(Collection getFields, boolean withO return MultiFields.getFields(index.createSearcher().getIndexReader()); } + private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException { + DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type()); + ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).id(request.id())).rootDoc(); + + Collection seenFields = new HashSet<>(); + Collection getFields = new HashSet<>(); + for (IndexableField field : doc.getFields()) { + FieldMapper fieldMapper = indexShard.mapperService().smartNameFieldMapper(field.name()); + if (seenFields.contains(field.name())) { + continue; + } + else { + seenFields.add(field.name()); + } + if (topLevelFields.terms(field.name()) == null) { + continue; + } + if (!isValidField(fieldMapper)) { + continue; + } + if (request.selectedFields() != null && !request.selectedFields().contains(field.name())) { + continue; + } + String[] values = doc.getValues(field.name()); + getFields.add(new GetField(field.name(), Arrays.asList((Object[]) values))); + } + return generateTermVectors(getFields, request.offsets()); + } + private Fields mergeFields(String[] fieldNames, Fields... fieldsObject) throws IOException { ParallelFields parallelFields = new ParallelFields(); for (Fields fieldObject : fieldsObject) { diff --git a/src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java b/src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java index 54a752213590c..d12f7a76505fb 100644 --- a/src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java +++ b/src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java @@ -48,6 +48,8 @@ public class RestTermVectorAction extends BaseRestHandler { @Inject public RestTermVectorAction(Settings settings, Client client, RestController controller) { super(settings, client); + controller.registerHandler(GET, "/{index}/{type}/_termvector", this); + controller.registerHandler(POST, "/{index}/{type}/_termvector", this); controller.registerHandler(GET, "/{index}/{type}/{id}/_termvector", this); controller.registerHandler(POST, "/{index}/{type}/{id}/_termvector", this); } diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index 6f83af8c7544c..4a066ffbccf85 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -31,7 +31,6 @@ import org.elasticsearch.common.settings.ImmutableSettings; import org.elasticsearch.common.xcontent.ToXContent; import org.elasticsearch.common.xcontent.XContentBuilder; -import org.elasticsearch.common.xcontent.XContentFactory; import org.elasticsearch.index.mapper.core.AbstractFieldMapper; import org.junit.Test; @@ -43,6 +42,7 @@ import java.util.concurrent.ExecutionException; import static org.elasticsearch.common.settings.ImmutableSettings.settingsBuilder; +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder; import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked; import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertThrows; import static org.hamcrest.Matchers.*; @@ -51,7 +51,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests { @Test public void testNoSuchDoc() throws Exception { - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1") .startObject("properties") .startObject("field") .field("type", "string") @@ -72,13 +72,13 @@ public void testNoSuchDoc() throws Exception { assertThat(actionGet.getIndex(), equalTo("test")); assertThat(actionGet.isExists(), equalTo(false)); // check response is nevertheless serializable to json - actionGet.toXContent(XContentFactory.jsonBuilder(), ToXContent.EMPTY_PARAMS); + actionGet.toXContent(jsonBuilder(), ToXContent.EMPTY_PARAMS); } } @Test public void testExistingFieldWithNoTermVectorsNoNPE() throws Exception { - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1") .startObject("properties") .startObject("existingfield") .field("type", "string") @@ -107,7 +107,7 @@ public void testExistingFieldWithNoTermVectorsNoNPE() throws Exception { @Test public void testExistingFieldButNotInDocNPE() throws Exception { - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1") .startObject("properties") .startObject("existingfield") .field("type", "string") @@ -179,7 +179,7 @@ public void testNotIndexedField() throws Exception { @Test public void testSimpleTermVectors() throws ElasticsearchException, IOException { - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1") .startObject("properties") .startObject("field") .field("type", "string") @@ -197,7 +197,7 @@ public void testSimpleTermVectors() throws ElasticsearchException, IOException { ensureYellow(); for (int i = 0; i < 10; i++) { client().prepareIndex("test", "type1", Integer.toString(i)) - .setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog") + .setSource(jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog") // 0the3 4quick9 10brown15 16fox19 20jumps25 26over30 // 31the34 35lazy39 40dog43 .endObject()).execute().actionGet(); @@ -268,7 +268,7 @@ public void testRandomSingleTermVectors() throws ElasticsearchException, IOExcep ft.setStoreTermVectorPositions(storePositions); String optionString = AbstractFieldMapper.termVectorOptionsToString(ft); - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1") .startObject("properties") .startObject("field") .field("type", "string") @@ -284,7 +284,7 @@ public void testRandomSingleTermVectors() throws ElasticsearchException, IOExcep ensureYellow(); for (int i = 0; i < 10; i++) { client().prepareIndex("test", "type1", Integer.toString(i)) - .setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog") + .setSource(jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog") // 0the3 4quick9 10brown15 16fox19 20jumps25 26over30 // 31the34 35lazy39 40dog43 .endObject()).execute().actionGet(); @@ -423,7 +423,7 @@ public void testRandomPayloadWithDelimitedPayloadTokenFilter() throws Elasticsea String delimiter = createRandomDelimiter(tokens); String queryString = createString(tokens, payloads, encoding, delimiter.charAt(0)); //create the mapping - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties") + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties") .startObject("field").field("type", "string").field("term_vector", "with_positions_offsets_payloads") .field("analyzer", "payload_test").endObject().endObject().endObject().endObject(); assertAcked(prepareCreate("test").addMapping("type1", mapping).setSettings( @@ -437,7 +437,7 @@ public void testRandomPayloadWithDelimitedPayloadTokenFilter() throws Elasticsea ensureYellow(); client().prepareIndex("test", "type1", Integer.toString(1)) - .setSource(XContentFactory.jsonBuilder().startObject().field("field", queryString).endObject()).execute().actionGet(); + .setSource(jsonBuilder().startObject().field("field", queryString).endObject()).execute().actionGet(); refresh(); TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(1)).setPayloads(true).setOffsets(true) .setPositions(true).setSelectedFields(); @@ -579,8 +579,8 @@ public void testSimpleTermVectorsWithGenerate() throws ElasticsearchException, I fieldNames[i] = "field" + String.valueOf(i); } - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties"); - XContentBuilder source = XContentFactory.jsonBuilder().startObject(); + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties"); + XContentBuilder source = jsonBuilder().startObject(); for (String field : fieldNames) { mapping.startObject(field) .field("type", "string") @@ -764,8 +764,8 @@ private void compareTermVectors(String fieldName, Fields fields0, Fields fields1 public void testSimpleWildCards() throws ElasticsearchException, IOException { int numFields = 25; - XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties"); - XContentBuilder source = XContentFactory.jsonBuilder().startObject(); + XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties"); + XContentBuilder source = jsonBuilder().startObject(); for (int i = 0; i < numFields; i++) { mapping.startObject("field" + i) .field("type", "string") @@ -788,6 +788,55 @@ public void testSimpleWildCards() throws ElasticsearchException, IOException { assertThat("All term vectors should have been generated", response.getFields().size(), equalTo(numFields)); } + @Test + public void testExistingVsArtificial() throws ElasticsearchException, IOException, ExecutionException, InterruptedException { + // setup indices + ImmutableSettings.Builder settings = settingsBuilder() + .put(indexSettings()) + .put("index.analysis.analyzer", "standard"); + assertAcked(prepareCreate("test") + .setSettings(settings) + .addMapping("type1", "field1", "type=string,term_vector=with_positions_offsets")); + ensureGreen(); + + // index documents existing document + String text = "Generating a random permutation of a sequence (such as when shuffling cards)."; + List indexBuilders = new ArrayList<>(); + indexBuilders.add(client().prepareIndex() + .setIndex("test") + .setType("type1") + .setId("existing") + .setSource("field1", text)); + indexRandom(true, indexBuilders); + + // request tvs from existing document + TermVectorResponse respExisting = client().prepareTermVector("test", "type1", "existing") + .setOffsets(true) + .setPositions(true) + .setSelectedFields("field1") + .setFieldStatistics(false) // field statistics will be slightly different so ignore + .get(); + assertThat("doc with index: test, type1 and id: existing", respExisting.isExists(), equalTo(true)); + + // request tvs from artificial document + TermVectorResponse respArtificial = client().prepareTermVector() + .setIndex("test") + .setType("type1") + .setDoc(jsonBuilder() + .startObject() + .field("field1", text) + .endObject()) + .setOffsets(true) + .setPositions(true) + .setSelectedFields("field1") + .setFieldStatistics(false) // field statistics will be slightly different so ignore + .get(); + assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); + + // compare existing tvs with artificial + compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields()); + } + private static String indexOrAlias() { return randomBoolean() ? "test" : "alias"; } From d033e5915d5b68bee9fcd1de7901abe0d642023a Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Tue, 2 Sep 2014 10:23:36 +0200 Subject: [PATCH 2/8] flyweight instead of specifying an id --- .../elasticsearch/index/termvectors/ShardTermVectorService.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java index bfe3d4ef18422..7a22541659008 100644 --- a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java +++ b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java @@ -190,7 +190,7 @@ private Fields generateTermVectors(Collection getFields, boolean withO private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException { DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type()); - ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).id(request.id())).rootDoc(); + ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).flyweight(true)).rootDoc(); Collection seenFields = new HashSet<>(); Collection getFields = new HashSet<>(); From 9556a275213788e675bd46d4ff78bd5772c9590c Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Tue, 2 Sep 2014 15:34:34 +0200 Subject: [PATCH 3/8] fix for when shard has no doc indexed --- .../termvectors/ShardTermVectorService.java | 21 +++++------ .../action/termvector/GetTermVectorTests.java | 36 ++++++++++++++++--- 2 files changed, 42 insertions(+), 15 deletions(-) diff --git a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java index 7a22541659008..600a8dcff06d4 100644 --- a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java +++ b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java @@ -71,22 +71,26 @@ public TermVectorResponse getTermVector(TermVectorRequest request, String concre IndexReader topLevelReader = searcher.reader(); final TermVectorResponse termVectorResponse = new TermVectorResponse(concreteIndex, request.type(), request.id()); - // handle potential wildcards in fields + /* handle potential wildcards in fields */ if (request.selectedFields() != null) { handleFieldWildcards(request); } try { Fields topLevelFields = MultiFields.getFields(topLevelReader); - // from an artificial document + //* from an artificial document */ if (request.doc() != null) { - Fields termVectorsByField = generateTermVectorsFromDoc(request, topLevelFields); + Fields termVectorsByField = generateTermVectorsFromDoc(request); + // if no document indexed in shard, take the queried document itself for stats + if (topLevelFields == null) { + topLevelFields = termVectorsByField; + } termVectorResponse.setFields(termVectorsByField, request.selectedFields(), request.getFlags(), topLevelFields); termVectorResponse.setExists(true); termVectorResponse.setArtificial(true); return termVectorResponse; } - // or from an existing document + /* or from an existing document */ final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id())); Versions.DocIdAndVersion docIdAndVersion = Versions.loadDocIdAndVersion(topLevelReader, uidTerm); if (docIdAndVersion != null) { @@ -172,7 +176,7 @@ private Fields addGeneratedTermVectors(Fields termVectorsByField, TermVectorRequ } private Fields generateTermVectors(Collection getFields, boolean withOffsets) throws IOException { - // store document in memory index + /* store document in memory index */ MemoryIndex index = new MemoryIndex(withOffsets); for (GetField getField : getFields) { String field = getField.getName(); @@ -184,11 +188,11 @@ private Fields generateTermVectors(Collection getFields, boolean withO index.addField(field, text.toString(), analyzer); } } - // and read vectors from it + /* and read vectors from it */ return MultiFields.getFields(index.createSearcher().getIndexReader()); } - private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException { + private Fields generateTermVectorsFromDoc(TermVectorRequest request) throws IOException { DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type()); ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).flyweight(true)).rootDoc(); @@ -202,9 +206,6 @@ private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topL else { seenFields.add(field.name()); } - if (topLevelFields.terms(field.name()) == null) { - continue; - } if (!isValidField(fieldMapper)) { continue; } diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index 4a066ffbccf85..a25514b4e17da 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -789,7 +789,7 @@ public void testSimpleWildCards() throws ElasticsearchException, IOException { } @Test - public void testExistingVsArtificial() throws ElasticsearchException, IOException, ExecutionException, InterruptedException { + public void testArtificialVsExisting() throws ElasticsearchException, ExecutionException, InterruptedException, IOException { // setup indices ImmutableSettings.Builder settings = settingsBuilder() .put(indexSettings()) @@ -813,8 +813,7 @@ public void testExistingVsArtificial() throws ElasticsearchException, IOExceptio TermVectorResponse respExisting = client().prepareTermVector("test", "type1", "existing") .setOffsets(true) .setPositions(true) - .setSelectedFields("field1") - .setFieldStatistics(false) // field statistics will be slightly different so ignore + .setFieldStatistics(true) .get(); assertThat("doc with index: test, type1 and id: existing", respExisting.isExists(), equalTo(true)); @@ -828,8 +827,7 @@ public void testExistingVsArtificial() throws ElasticsearchException, IOExceptio .endObject()) .setOffsets(true) .setPositions(true) - .setSelectedFields("field1") - .setFieldStatistics(false) // field statistics will be slightly different so ignore + .setFieldStatistics(true) .get(); assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); @@ -837,6 +835,34 @@ public void testExistingVsArtificial() throws ElasticsearchException, IOExceptio compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields()); } + @Test + public void testArtificialNoDoc() throws IOException { + // setup indices + ImmutableSettings.Builder settings = settingsBuilder() + .put(indexSettings()) + .put("index.analysis.analyzer", "standard"); + assertAcked(prepareCreate("test") + .setSettings(settings) + .addMapping("type1", "field1", "type=string")); + ensureGreen(); + + // request tvs from artificial document + String text = "the quick brown fox jumps over the lazy dog"; + TermVectorResponse respArtificial = client().prepareTermVector() + .setIndex("test") + .setType("type1") + .setDoc(jsonBuilder() + .startObject() + .field("field1", text) + .endObject()) + .setOffsets(true) + .setPositions(true) + .setFieldStatistics(true) + .get(); + assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); + checkBrownFoxTermVector(respArtificial.getFields(), "field1", false); + } + private static String indexOrAlias() { return randomBoolean() ? "test" : "alias"; } From 4ec2acc0c90b28a8c60871a198ab0876320b4a47 Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Tue, 2 Sep 2014 16:35:13 +0200 Subject: [PATCH 4/8] allow for term stats and mention to use routing --- .../reference/docs/multi-termvectors.asciidoc | 3 +- docs/reference/docs/termvectors.asciidoc | 8 +-- .../action/termvector/TermVectorRequest.java | 3 - .../action/termvector/GetTermVectorTests.java | 69 +++++++++++-------- 4 files changed, 46 insertions(+), 37 deletions(-) diff --git a/docs/reference/docs/multi-termvectors.asciidoc b/docs/reference/docs/multi-termvectors.asciidoc index 0ad1209dc42f1..2cfb1012ff5ec 100644 --- a/docs/reference/docs/multi-termvectors.asciidoc +++ b/docs/reference/docs/multi-termvectors.asciidoc @@ -95,8 +95,7 @@ curl 'localhost:9200/testidx/test/_mtermvectors' -d '{ Additionally coming[1.4.0], just like for the <> API, term vectors could be generated for user provided documents. The syntax is similar to the <> API. The mapping used is -determined by `_index` and `_type`. Term statistics are disallowed as they -depend on the shard processing the request. +determined by `_index` and `_type`. [source,js] -------------------------------------------------- diff --git a/docs/reference/docs/termvectors.asciidoc b/docs/reference/docs/termvectors.asciidoc index 1160eb155a974..1ddd6d1b77fb1 100644 --- a/docs/reference/docs/termvectors.asciidoc +++ b/docs/reference/docs/termvectors.asciidoc @@ -87,8 +87,9 @@ The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute -numbers have no meaning in this context. The term statistics are disabled -when using artificial documents. +numbers have no meaning in this context. By default, when requesting +term vectors of artificial documents, a shard to get the statistics from +is randomly selected. Use `routing` only to hit a particular shard. [float] === Example 1 @@ -256,8 +257,7 @@ Additionally, term vectors can also be generated for artificial documents, that is for documents not present in the index. The syntax is similar to the <> API. For example, the following request would return the same results as in example 1. The mapping used is determined by the -`index` and `type`. Term statistics are disallowed as they depend on the shard -processing the request. +`index` and `type`. [source,js] -------------------------------------------------- diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java index 627db4be31a01..705ff7e8ed75a 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java @@ -414,9 +414,6 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar } else if (currentFieldName.equals("payloads")) { termVectorRequest.payloads(parser.booleanValue()); } else if (currentFieldName.equals("term_statistics") || currentFieldName.equals("termStatistics")) { - if (parser.booleanValue() && termVectorRequest.doc != null) { - throw new ElasticsearchParseException("Term statistics are not supported with artificial documents."); - } termVectorRequest.termStatistics(parser.booleanValue()); } else if (currentFieldName.equals("field_statistics") || currentFieldName.equals("fieldStatistics")) { termVectorRequest.fieldStatistics(parser.booleanValue()); diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index a25514b4e17da..d483c1eda59e7 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -800,39 +800,51 @@ public void testArtificialVsExisting() throws ElasticsearchException, ExecutionE ensureGreen(); // index documents existing document - String text = "Generating a random permutation of a sequence (such as when shuffling cards)."; + String[] content = new String[]{ + "Generating a random permutation of a sequence (such as when shuffling cards).", + "Selecting a random sample of a population (important in statistical sampling).", + "Allocating experimental units via random assignment to a treatment or control condition.", + "Generating random numbers: see Random number generation."}; + List indexBuilders = new ArrayList<>(); - indexBuilders.add(client().prepareIndex() - .setIndex("test") - .setType("type1") - .setId("existing") - .setSource("field1", text)); + for (int i = 0; i < content.length; i++) { + indexBuilders.add(client().prepareIndex() + .setIndex("test") + .setType("type1") + .setId(String.valueOf(i)) + .setSource("field1", content[i])); + } indexRandom(true, indexBuilders); - // request tvs from existing document - TermVectorResponse respExisting = client().prepareTermVector("test", "type1", "existing") - .setOffsets(true) - .setPositions(true) - .setFieldStatistics(true) - .get(); - assertThat("doc with index: test, type1 and id: existing", respExisting.isExists(), equalTo(true)); + for (int i = 0; i < content.length; i++) { + // request tvs from existing document + TermVectorResponse respExisting = client().prepareTermVector("test", "type1", String.valueOf(i)) + .setOffsets(true) + .setPositions(true) + .setFieldStatistics(true) + .setTermStatistics(true) + .get(); + assertThat("doc with index: test, type1 and id: existing", respExisting.isExists(), equalTo(true)); - // request tvs from artificial document - TermVectorResponse respArtificial = client().prepareTermVector() - .setIndex("test") - .setType("type1") - .setDoc(jsonBuilder() - .startObject() - .field("field1", text) - .endObject()) - .setOffsets(true) - .setPositions(true) - .setFieldStatistics(true) - .get(); - assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); + // request tvs from artificial document + TermVectorResponse respArtificial = client().prepareTermVector() + .setIndex("test") + .setType("type1") + .setRouting(String.valueOf(i)) // ensure we get the stats from the same shard as existing doc + .setDoc(jsonBuilder() + .startObject() + .field("field1", content[i]) + .endObject()) + .setOffsets(true) + .setPositions(true) + .setFieldStatistics(true) + .setTermStatistics(true) + .get(); + assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); - // compare existing tvs with artificial - compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields()); + // compare existing tvs with artificial + compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields()); + } } @Test @@ -858,6 +870,7 @@ public void testArtificialNoDoc() throws IOException { .setOffsets(true) .setPositions(true) .setFieldStatistics(true) + .setTermStatistics(true) .get(); assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); checkBrownFoxTermVector(respArtificial.getFields(), "field1", false); From 13640dd0216e77244f99bd4fb3f8147cd3e36dea Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Tue, 2 Sep 2014 19:38:47 +0200 Subject: [PATCH 5/8] fix and test for field not present in mapping --- .../action/termvector/TermVectorWriter.java | 1 - .../termvectors/ShardTermVectorService.java | 8 ++- .../action/termvector/GetTermVectorTests.java | 55 +++++++++++++++++-- 3 files changed, 57 insertions(+), 7 deletions(-) diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java index 9bbf142508b9d..1b22304c72a26 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java @@ -74,7 +74,6 @@ void setFields(Fields termVectorsByField, Set selectedFields, EnumSet getFields, boolean withO return MultiFields.getFields(index.createSearcher().getIndexReader()); } - private Fields generateTermVectorsFromDoc(TermVectorRequest request) throws IOException { + private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException { DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type()); ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).flyweight(true)).rootDoc(); @@ -206,6 +206,10 @@ private Fields generateTermVectorsFromDoc(TermVectorRequest request) throws IOEx else { seenFields.add(field.name()); } + // skip fields not in the mapping + if (topLevelFields != null && topLevelFields.terms(field.name()) == null) { + continue; + } if (!isValidField(fieldMapper)) { continue; } diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index d483c1eda59e7..3dac7af69855b 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -840,7 +840,7 @@ public void testArtificialVsExisting() throws ElasticsearchException, ExecutionE .setFieldStatistics(true) .setTermStatistics(true) .get(); - assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); + assertThat("doc with index: test, type1 and id: " + String.valueOf(i), respArtificial.isExists(), equalTo(true)); // compare existing tvs with artificial compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields()); @@ -860,7 +860,7 @@ public void testArtificialNoDoc() throws IOException { // request tvs from artificial document String text = "the quick brown fox jumps over the lazy dog"; - TermVectorResponse respArtificial = client().prepareTermVector() + TermVectorResponse resp = client().prepareTermVector() .setIndex("test") .setType("type1") .setDoc(jsonBuilder() @@ -872,8 +872,55 @@ public void testArtificialNoDoc() throws IOException { .setFieldStatistics(true) .setTermStatistics(true) .get(); - assertThat("doc with index: test, type1 and id: existing", respArtificial.isExists(), equalTo(true)); - checkBrownFoxTermVector(respArtificial.getFields(), "field1", false); + assertThat(resp.isExists(), equalTo(true)); + checkBrownFoxTermVector(resp.getFields(), "field1", false); + } + + @Test + public void testArtificialNonExistingField() throws IOException, ExecutionException, InterruptedException { + // setup indices + ImmutableSettings.Builder settings = settingsBuilder() + .put(indexSettings()) + .put("index.analysis.analyzer", "standard"); + assertAcked(prepareCreate("test") + .setSettings(settings) + .addMapping("type1", "field1", "type=string")); + ensureGreen(); + + // index just one doc + List indexBuilders = new ArrayList<>(); + indexBuilders.add(client().prepareIndex() + .setIndex("test") + .setType("type1") + .setId("1") + .setRouting("1") + .setSource("field1", "some text")); + indexRandom(true, indexBuilders); + + // request tvs from artificial document (without a doc in a shard and with) + String text = "the quick brown fox jumps over the lazy dog"; + for (int i = 0; i < 2; i++) { + TermVectorResponse resp = client().prepareTermVector() + .setIndex("test") + .setType("type1") + .setDoc(jsonBuilder() + .startObject() + .field("field1", text) + .endObject() + .startObject() + .field("non_existing", "some text that should not be retrieved") + .endObject()) + .setRouting(String.valueOf(i)) + .setOffsets(true) + .setPositions(true) + .setFieldStatistics(true) + .setTermStatistics(true) + .get(); + assertThat(resp.isExists(), equalTo(true)); + assertNotNull(resp.getFields().terms("field1")); + assertNull(resp.getFields().terms("non_existing")); + checkBrownFoxTermVector(resp.getFields(), "field1", false); + } } private static String indexOrAlias() { From 35cd69ab9d34127a8acec3a53e8356e9f5bc3121 Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Wed, 3 Sep 2014 14:38:10 +0200 Subject: [PATCH 6/8] addressed comments --- .../action/termvector/TermVectorRequest.java | 16 +++++++--- .../index/mapper/ParseContext.java | 6 ---- .../termvectors/ShardTermVectorService.java | 10 +++--- .../action/termvector/GetTermVectorTests.java | 31 +++++++++---------- 4 files changed, 30 insertions(+), 33 deletions(-) diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java index 705ff7e8ed75a..8839b4480a601 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java @@ -329,8 +329,11 @@ public void readFrom(StreamInput in) throws IOException { } type = in.readString(); id = in.readString(); - if (in.readBoolean()) { - doc = in.readBytesReference(); + + if (in.getVersion().after(Version.V_1_4_0)) { + if (in.readBoolean()) { + doc = in.readBytesReference(); + } } routing = in.readOptionalString(); preference = in.readOptionalString(); @@ -360,9 +363,12 @@ public void writeTo(StreamOutput out) throws IOException { } out.writeString(type); out.writeString(id); - out.writeBoolean(doc != null); - if (doc != null) { - out.writeBytesReference(doc); + + if (out.getVersion().after(Version.V_1_4_0)) { + out.writeBoolean(doc != null); + if (doc != null) { + out.writeBytesReference(doc); + } } out.writeOptionalString(routing); out.writeOptionalString(preference); diff --git a/src/main/java/org/elasticsearch/index/mapper/ParseContext.java b/src/main/java/org/elasticsearch/index/mapper/ParseContext.java index 3876cea9b71cf..6a4b6ad439715 100644 --- a/src/main/java/org/elasticsearch/index/mapper/ParseContext.java +++ b/src/main/java/org/elasticsearch/index/mapper/ParseContext.java @@ -138,17 +138,11 @@ public IndexableField[] getFields(String name) { */ public final String[] getValues(String name) { List result = new ArrayList<>(); - for (IndexableField field : fields) { if (field.name().equals(name) && field.stringValue() != null) { result.add(field.stringValue()); } } - - if (result.size() == 0) { - return new String[0]; - } - return result.toArray(new String[result.size()]); } diff --git a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java index ee0bd43bdebf3..c0a1a9db77ec9 100644 --- a/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java +++ b/src/main/java/org/elasticsearch/index/termvectors/ShardTermVectorService.java @@ -32,16 +32,15 @@ import org.elasticsearch.index.engine.Engine; import org.elasticsearch.index.get.GetField; import org.elasticsearch.index.get.GetResult; -import org.elasticsearch.index.mapper.DocumentMapper; -import org.elasticsearch.index.mapper.FieldMapper; -import org.elasticsearch.index.mapper.ParseContext; -import org.elasticsearch.index.mapper.Uid; +import org.elasticsearch.index.mapper.*; import org.elasticsearch.index.mapper.core.StringFieldMapper; import org.elasticsearch.index.mapper.internal.UidFieldMapper; +import org.elasticsearch.index.service.IndexService; import org.elasticsearch.index.settings.IndexSettings; import org.elasticsearch.index.shard.AbstractIndexShardComponent; import org.elasticsearch.index.shard.ShardId; import org.elasticsearch.index.shard.service.IndexShard; +import org.elasticsearch.indices.IndicesService; import java.io.IOException; import java.util.*; @@ -55,6 +54,7 @@ public class ShardTermVectorService extends AbstractIndexShardComponent { private IndexShard indexShard; + @Inject public ShardTermVectorService(ShardId shardId, @IndexSettings Settings indexSettings) { super(shardId, indexSettings); @@ -78,7 +78,7 @@ public TermVectorResponse getTermVector(TermVectorRequest request, String concre try { Fields topLevelFields = MultiFields.getFields(topLevelReader); - //* from an artificial document */ + /* from an artificial document */ if (request.doc() != null) { Fields termVectorsByField = generateTermVectorsFromDoc(request, topLevelFields); // if no document indexed in shard, take the queried document itself for stats diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index 3dac7af69855b..28a152198795f 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -897,30 +897,27 @@ public void testArtificialNonExistingField() throws IOException, ExecutionExcept .setSource("field1", "some text")); indexRandom(true, indexBuilders); - // request tvs from artificial document (without a doc in a shard and with) - String text = "the quick brown fox jumps over the lazy dog"; - for (int i = 0; i < 2; i++) { - TermVectorResponse resp = client().prepareTermVector() + // request tvs from artificial document + XContentBuilder doc = jsonBuilder() + .startObject() + .field("field1", "the quick brown fox jumps over the lazy dog") + .field("non_existing", "some text that should not be retrieved") + .endObject(); + + TermVectorResponse resp = client().prepareTermVector() .setIndex("test") .setType("type1") - .setDoc(jsonBuilder() - .startObject() - .field("field1", text) - .endObject() - .startObject() - .field("non_existing", "some text that should not be retrieved") - .endObject()) - .setRouting(String.valueOf(i)) + .setDoc(doc) + .setRouting(String.valueOf("1")) .setOffsets(true) .setPositions(true) .setFieldStatistics(true) .setTermStatistics(true) .get(); - assertThat(resp.isExists(), equalTo(true)); - assertNotNull(resp.getFields().terms("field1")); - assertNull(resp.getFields().terms("non_existing")); - checkBrownFoxTermVector(resp.getFields(), "field1", false); - } + assertThat(resp.isExists(), equalTo(true)); + assertNotNull(resp.getFields().terms("field1")); + assertNull(resp.getFields().terms("non_existing")); + checkBrownFoxTermVector(resp.getFields(), "field1", false); } private static String indexOrAlias() { From 688f9b52a7046629399c4dfeaebffb63ee28632b Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Thu, 4 Sep 2014 14:25:44 +0200 Subject: [PATCH 7/8] addressed comments and settled for dynamic mapping --- .../action/termvector/TermVectorRequest.java | 2 +- .../termvector/TermVectorRequestBuilder.java | 20 ++++++++--- .../action/termvector/TermVectorWriter.java | 5 +++ .../termvectors/ShardTermVectorService.java | 36 +++++++++++++------ .../action/termvector/GetTermVectorTests.java | 22 +++++++----- 5 files changed, 61 insertions(+), 24 deletions(-) diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java index 8839b4480a601..b2e7a2816f87b 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java @@ -57,7 +57,7 @@ public class TermVectorRequest extends SingleShardOperationRequest selectedFields; diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java index d0f0921dea7ba..3a43912ecc2b5 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java @@ -36,21 +36,33 @@ public TermVectorRequestBuilder(Client client, String index, String type, String super(client, new TermVectorRequest(index, type, id)); } + /** + * Sets the index where the document is located. + */ public TermVectorRequestBuilder setIndex(String index) { request.index(index); return this; } - public TermVectorRequestBuilder setId(String id) { - request.id(id); + /** + * Sets the type of the document. + */ + public TermVectorRequestBuilder setType(String type) { + request.type(type); return this; } - public TermVectorRequestBuilder setType(String type) { - request.type(type); + /** + * Sets the id of the document. + */ + public TermVectorRequestBuilder setId(String id) { + request.id(id); return this; } + /** + * Sets the artificial document from which to generate term vectors. + */ public TermVectorRequestBuilder setDoc(XContentBuilder xContent) { request.doc(xContent); return this; diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java index 1b22304c72a26..62b7b21927291 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java @@ -59,6 +59,11 @@ void setFields(Fields termVectorsByField, Set selectedFields, EnumSet getFields, boolean withO return MultiFields.getFields(index.createSearcher().getIndexReader()); } - private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException { - DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type()); - ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).flyweight(true)).rootDoc(); + private Fields generateTermVectorsFromDoc(TermVectorRequest request) throws IOException { + // parse the document, at the moment we do update the mapping, just like percolate + ParsedDocument parsedDocument = parseDocument(indexShard.shardId().getIndex(), request.type(), request.doc()); + // select the right fields and generate term vectors + ParseContext.Document doc = parsedDocument.rootDoc(); Collection seenFields = new HashSet<>(); Collection getFields = new HashSet<>(); for (IndexableField field : doc.getFields()) { @@ -206,10 +211,6 @@ private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topL else { seenFields.add(field.name()); } - // skip fields not in the mapping - if (topLevelFields != null && topLevelFields.terms(field.name()) == null) { - continue; - } if (!isValidField(fieldMapper)) { continue; } @@ -222,6 +223,19 @@ private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topL return generateTermVectors(getFields, request.offsets()); } + private ParsedDocument parseDocument(String index, String type, BytesReference doc) { + MapperService mapperService = indexShard.mapperService(); + IndexService indexService = indexShard.indexService(); + + // TODO: make parsing not dynamically create fields not in the original mapping + Tuple docMapper = mapperService.documentMapperWithAutoCreate(type); + ParsedDocument parsedDocument = docMapper.v1().parse(source(doc).type(type).flyweight(true)).setMappingsModified(docMapper); + if (parsedDocument.mappingsModified()) { + mappingUpdatedAction.updateMappingOnMaster(index, docMapper.v1(), indexService.indexUUID()); + } + return parsedDocument; + } + private Fields mergeFields(String[] fieldNames, Fields... fieldsObject) throws IOException { ParallelFields parallelFields = new ParallelFields(); for (Fields fieldObject : fieldsObject) { diff --git a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java index 28a152198795f..8766a341dce45 100644 --- a/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java +++ b/src/test/java/org/elasticsearch/action/termvector/GetTermVectorTests.java @@ -32,6 +32,8 @@ import org.elasticsearch.common.xcontent.ToXContent; import org.elasticsearch.common.xcontent.XContentBuilder; import org.elasticsearch.index.mapper.core.AbstractFieldMapper; +import org.elasticsearch.index.service.IndexService; +import org.elasticsearch.indices.IndicesService; import org.junit.Test; import java.io.IOException; @@ -877,7 +879,7 @@ public void testArtificialNoDoc() throws IOException { } @Test - public void testArtificialNonExistingField() throws IOException, ExecutionException, InterruptedException { + public void testArtificialNonExistingField() throws Exception { // setup indices ImmutableSettings.Builder settings = settingsBuilder() .put(indexSettings()) @@ -901,23 +903,27 @@ public void testArtificialNonExistingField() throws IOException, ExecutionExcept XContentBuilder doc = jsonBuilder() .startObject() .field("field1", "the quick brown fox jumps over the lazy dog") - .field("non_existing", "some text that should not be retrieved") + .field("non_existing", "the quick brown fox jumps over the lazy dog") .endObject(); - TermVectorResponse resp = client().prepareTermVector() + for (int i = 0; i < 2; i++) { + TermVectorResponse resp = client().prepareTermVector() .setIndex("test") .setType("type1") .setDoc(doc) - .setRouting(String.valueOf("1")) + .setRouting("" + i) .setOffsets(true) .setPositions(true) .setFieldStatistics(true) .setTermStatistics(true) .get(); - assertThat(resp.isExists(), equalTo(true)); - assertNotNull(resp.getFields().terms("field1")); - assertNull(resp.getFields().terms("non_existing")); - checkBrownFoxTermVector(resp.getFields(), "field1", false); + assertThat(resp.isExists(), equalTo(true)); + checkBrownFoxTermVector(resp.getFields(), "field1", false); + // we should have created a mapping for this field + waitForMappingOnMaster("test", "type1", "non_existing"); + // and return the generated term vectors + checkBrownFoxTermVector(resp.getFields(), "non_existing", false); + } } private static String indexOrAlias() { From 33c6364a940c67c52477dabc5c1aff3efc2cd73c Mon Sep 17 00:00:00 2001 From: Alex Ksikes Date: Thu, 4 Sep 2014 18:54:46 +0200 Subject: [PATCH 8/8] onOrAfter and note on dynamic mapping --- docs/reference/docs/termvectors.asciidoc | 6 ++++++ .../elasticsearch/action/termvector/TermVectorRequest.java | 4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/reference/docs/termvectors.asciidoc b/docs/reference/docs/termvectors.asciidoc index 1ddd6d1b77fb1..de00f78f66940 100644 --- a/docs/reference/docs/termvectors.asciidoc +++ b/docs/reference/docs/termvectors.asciidoc @@ -259,6 +259,12 @@ that is for documents not present in the index. The syntax is similar to the return the same results as in example 1. The mapping used is determined by the `index` and `type`. +[WARNING] +====== +If dynamic mapping is turned on (default), the document fields not in the original +mapping will be dynamically created. +====== + [source,js] -------------------------------------------------- curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{ diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java index b2e7a2816f87b..daba0a8daeb54 100644 --- a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java @@ -330,7 +330,7 @@ public void readFrom(StreamInput in) throws IOException { type = in.readString(); id = in.readString(); - if (in.getVersion().after(Version.V_1_4_0)) { + if (in.getVersion().onOrAfter(Version.V_1_4_0)) { if (in.readBoolean()) { doc = in.readBytesReference(); } @@ -364,7 +364,7 @@ public void writeTo(StreamOutput out) throws IOException { out.writeString(type); out.writeString(id); - if (out.getVersion().after(Version.V_1_4_0)) { + if (out.getVersion().onOrAfter(Version.V_1_4_0)) { out.writeBoolean(doc != null); if (doc != null) { out.writeBytesReference(doc);