Skip to content

Commit c2ec379

Browse files
committed
Term Vectors: Support for artificial documents
This adds the ability to the Term Vector API to generate term vectors for artifical documents, that is for documents not present in the index. Following a similar syntax to the Percolator API, a new 'doc' parameter is used, instead of '_id', that specifies the document of interest. The parameters '_index' and '_type' determine the mapping and therefore analyzers to apply to each value field. Closes #7530
1 parent ebd4007 commit c2ec379

File tree

13 files changed

+454
-67
lines changed

13 files changed

+454
-67
lines changed

docs/reference/docs/multi-termvectors.asciidoc

+32-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
[[docs-multi-termvectors]]
22
== Multi termvectors API
33

4-
Multi termvectors API allows to get multiple termvectors based on an index, type and id. The response includes a `docs`
4+
Multi termvectors API allows to get multiple termvectors at once. The
5+
documents from which to retrieve the term vectors are specified by an index,
6+
type and id. But the documents could also be artificially provided coming[1.4.0].
7+
The response includes a `docs`
58
array with all the fetched termvectors, each element having the structure
69
provided by the <<docs-termvectors,termvectors>>
710
API. Here is an example:
@@ -89,4 +92,31 @@ curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
8992
}'
9093
--------------------------------------------------
9194

92-
Parameters can also be set by passing them as uri parameters (see <<docs-termvectors,termvectors>>). uri parameters are the default parameters and are overwritten by any parameter setting defined in the body.
95+
Additionally coming[1.4.0], just like for the <<docs-termvectors,termvectors>>
96+
API, term vectors could be generated for user provided documents. The syntax
97+
is similar to the <<search-percolate,percolator>> API. The mapping used is
98+
determined by `_index` and `_type`.
99+
100+
[source,js]
101+
--------------------------------------------------
102+
curl 'localhost:9200/_mtermvectors' -d '{
103+
"docs": [
104+
{
105+
"_index": "testidx",
106+
"_type": "test",
107+
"doc" : {
108+
"fullname" : "John Doe",
109+
"text" : "twitter test test test"
110+
}
111+
},
112+
{
113+
"_index": "testidx",
114+
"_type": "test",
115+
"doc" : {
116+
"fullname" : "Jane Doe",
117+
"text" : "Another twitter test ..."
118+
}
119+
}
120+
]
121+
}'
122+
--------------------------------------------------

docs/reference/docs/termvectors.asciidoc

+38-9
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,11 @@
33

44
added[1.0.0.Beta1]
55

6-
Returns information and statistics on terms in the fields of a
7-
particular document as stored in the index. Note that this is a
8-
near realtime API as the term vectors are not available until the
9-
next refresh.
6+
Returns information and statistics on terms in the fields of a particular
7+
document. The document could be stored in the index or artificially provided
8+
by the user coming[1.4.0]. Note that for documents stored in the index, this
9+
is a near realtime API as the term vectors are not available until the next
10+
refresh.
1011

1112
[source,js]
1213
--------------------------------------------------
@@ -41,10 +42,10 @@ statistics are returned for all fields but no term statistics.
4142
* term payloads (`payloads` : true), as base64 encoded bytes
4243

4344
If the requested information wasn't stored in the index, it will be
44-
computed on the fly if possible. See <<mapping-types,type mapping>>
45-
for how to configure your index to store term vectors.
45+
computed on the fly if possible. Additionally, term vectors could be computed
46+
for documents not even existing in the index, but instead provided by the user.
4647

47-
coming[1.4.0,The ability to computed term vectors on the fly is only available from 1.4.0 onwards (see below)]
48+
coming[1.4.0,The ability to computed term vectors on the fly as well as support for artificial documents is only available from 1.4.0 onwards (see below example 2 and 3 respectively)]
4849

4950
[WARNING]
5051
======
@@ -86,7 +87,9 @@ The term and field statistics are not accurate. Deleted documents
8687
are not taken into account. The information is only retrieved for the
8788
shard the requested document resides in. The term and field statistics
8889
are therefore only useful as relative measures whereas the absolute
89-
numbers have no meaning in this context.
90+
numbers have no meaning in this context. By default, when requesting
91+
term vectors of artificial documents, a shard to get the statistics from
92+
is randomly selected. Use `routing` only to hit a particular shard.
9093

9194
[float]
9295
=== Example 1
@@ -231,7 +234,7 @@ Response:
231234
[float]
232235
=== Example 2 coming[1.4.0]
233236

234-
Additionally, term vectors which are not explicitly stored in the index are automatically
237+
Term vectors which are not explicitly stored in the index are automatically
235238
computed on the fly. The following request returns all information and statistics for the
236239
fields in document `1`, even though the terms haven't been explicitly stored in the index.
237240
Note that for the field `text`, the terms are not re-generated.
@@ -246,3 +249,29 @@ curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
246249
"field_statistics" : true
247250
}'
248251
--------------------------------------------------
252+
253+
[float]
254+
=== Example 3 coming[1.4.0]
255+
256+
Additionally, term vectors can also be generated for artificial documents,
257+
that is for documents not present in the index. The syntax is similar to the
258+
<<search-percolate,percolator>> API. For example, the following request would
259+
return the same results as in example 1. The mapping used is determined by the
260+
`index` and `type`.
261+
262+
[WARNING]
263+
======
264+
If dynamic mapping is turned on (default), the document fields not in the original
265+
mapping will be dynamically created.
266+
======
267+
268+
[source,js]
269+
--------------------------------------------------
270+
curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{
271+
"doc" : {
272+
"fullname" : "John Doe",
273+
"text" : "twitter test test test"
274+
}
275+
}'
276+
--------------------------------------------------
277+

src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java

-1
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,6 @@ public void add(TermVectorRequest template, BytesReference data) throws Exceptio
9090
if (token == XContentParser.Token.FIELD_NAME) {
9191
currentFieldName = parser.currentName();
9292
} else if (token == XContentParser.Token.START_ARRAY) {
93-
9493
if ("docs".equals(currentFieldName)) {
9594
while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
9695
if (token != XContentParser.Token.START_OBJECT) {

src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java

+49-3
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,17 @@
2626
import org.elasticsearch.action.ValidateActions;
2727
import org.elasticsearch.action.get.MultiGetRequest;
2828
import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest;
29+
import org.elasticsearch.common.bytes.BytesReference;
2930
import org.elasticsearch.common.io.stream.StreamInput;
3031
import org.elasticsearch.common.io.stream.StreamOutput;
32+
import org.elasticsearch.common.xcontent.XContentBuilder;
3133
import org.elasticsearch.common.xcontent.XContentParser;
3234

3335
import java.io.IOException;
3436
import java.util.*;
37+
import java.util.concurrent.atomic.AtomicInteger;
38+
39+
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
3540

3641
/**
3742
* Request returning the term vector (doc frequency, positions, offsets) for a
@@ -46,10 +51,14 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
4651

4752
private String id;
4853

54+
private BytesReference doc;
55+
4956
private String routing;
5057

5158
protected String preference;
5259

60+
private static final AtomicInteger randomInt = new AtomicInteger(0);
61+
5362
// TODO: change to String[]
5463
private Set<String> selectedFields;
5564

@@ -129,6 +138,23 @@ public TermVectorRequest id(String id) {
129138
return this;
130139
}
131140

141+
/**
142+
* Returns the artificial document from which term vectors are requested for.
143+
*/
144+
public BytesReference doc() {
145+
return doc;
146+
}
147+
148+
/**
149+
* Sets an artificial document from which term vectors are requested for.
150+
*/
151+
public TermVectorRequest doc(XContentBuilder documentBuilder) {
152+
// assign a random id to this artificial document, for routing
153+
this.id(String.valueOf(randomInt.getAndAdd(1)));
154+
this.doc = documentBuilder.bytes();
155+
return this;
156+
}
157+
132158
/**
133159
* @return The routing for this request.
134160
*/
@@ -281,8 +307,8 @@ public ActionRequestValidationException validate() {
281307
if (type == null) {
282308
validationException = ValidateActions.addValidationError("type is missing", validationException);
283309
}
284-
if (id == null) {
285-
validationException = ValidateActions.addValidationError("id is missing", validationException);
310+
if (id == null && doc == null) {
311+
validationException = ValidateActions.addValidationError("id or doc is missing", validationException);
286312
}
287313
return validationException;
288314
}
@@ -303,6 +329,12 @@ public void readFrom(StreamInput in) throws IOException {
303329
}
304330
type = in.readString();
305331
id = in.readString();
332+
333+
if (in.getVersion().onOrAfter(Version.V_1_4_0)) {
334+
if (in.readBoolean()) {
335+
doc = in.readBytesReference();
336+
}
337+
}
306338
routing = in.readOptionalString();
307339
preference = in.readOptionalString();
308340
long flags = in.readVLong();
@@ -331,6 +363,13 @@ public void writeTo(StreamOutput out) throws IOException {
331363
}
332364
out.writeString(type);
333365
out.writeString(id);
366+
367+
if (out.getVersion().onOrAfter(Version.V_1_4_0)) {
368+
out.writeBoolean(doc != null);
369+
if (doc != null) {
370+
out.writeBytesReference(doc);
371+
}
372+
}
334373
out.writeOptionalString(routing);
335374
out.writeOptionalString(preference);
336375
long longFlags = 0;
@@ -389,7 +428,15 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar
389428
} else if ("_type".equals(currentFieldName)) {
390429
termVectorRequest.type = parser.text();
391430
} else if ("_id".equals(currentFieldName)) {
431+
if (termVectorRequest.doc != null) {
432+
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
433+
}
392434
termVectorRequest.id = parser.text();
435+
} else if ("doc".equals(currentFieldName)) {
436+
if (termVectorRequest.id != null) {
437+
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
438+
}
439+
termVectorRequest.doc(jsonBuilder().copyCurrentStructure(parser));
393440
} else if ("_routing".equals(currentFieldName) || "routing".equals(currentFieldName)) {
394441
termVectorRequest.routing = parser.text();
395442
} else {
@@ -398,7 +445,6 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar
398445
}
399446
}
400447
}
401-
402448
if (fields.size() > 0) {
403449
String[] fieldsAsArray = new String[fields.size()];
404450
termVectorRequest.selectedFields(fields.toArray(fieldsAsArray));

src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java

+33
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
import org.elasticsearch.action.ActionListener;
2323
import org.elasticsearch.action.ActionRequestBuilder;
2424
import org.elasticsearch.client.Client;
25+
import org.elasticsearch.common.xcontent.XContentBuilder;
2526

2627
/**
2728
*/
@@ -35,6 +36,38 @@ public TermVectorRequestBuilder(Client client, String index, String type, String
3536
super(client, new TermVectorRequest(index, type, id));
3637
}
3738

39+
/**
40+
* Sets the index where the document is located.
41+
*/
42+
public TermVectorRequestBuilder setIndex(String index) {
43+
request.index(index);
44+
return this;
45+
}
46+
47+
/**
48+
* Sets the type of the document.
49+
*/
50+
public TermVectorRequestBuilder setType(String type) {
51+
request.type(type);
52+
return this;
53+
}
54+
55+
/**
56+
* Sets the id of the document.
57+
*/
58+
public TermVectorRequestBuilder setId(String id) {
59+
request.id(id);
60+
return this;
61+
}
62+
63+
/**
64+
* Sets the artificial document from which to generate term vectors.
65+
*/
66+
public TermVectorRequestBuilder setDoc(XContentBuilder xContent) {
67+
request.doc(xContent);
68+
return this;
69+
}
70+
3871
/**
3972
* Sets the routing. Required if routing isn't id based.
4073
*/

0 commit comments

Comments
 (0)