-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Support for artificial documents #7530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
e9921b4
d033e59
9556a27
4ec2acc
13640dd
35cd69a
688f9b5
33c6364
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,10 +3,11 @@ | |
|
||
added[1.0.0.Beta1] | ||
|
||
Returns information and statistics on terms in the fields of a | ||
particular document as stored in the index. Note that this is a | ||
near realtime API as the term vectors are not available until the | ||
next refresh. | ||
Returns information and statistics on terms in the fields of a particular | ||
document. The document could be stored in the index or artificially provided | ||
by the user coming[1.4.0]. Note that for documents stored in the index, this | ||
is a near realtime API as the term vectors are not available until the next | ||
refresh. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
|
@@ -41,10 +42,10 @@ statistics are returned for all fields but no term statistics. | |
* term payloads (`payloads` : true), as base64 encoded bytes | ||
|
||
If the requested information wasn't stored in the index, it will be | ||
computed on the fly if possible. See <<mapping-types,type mapping>> | ||
for how to configure your index to store term vectors. | ||
computed on the fly if possible. Additionally, term vectors could be computed | ||
for documents not even existing in the index, but instead provided by the user. | ||
|
||
coming[1.4.0,The ability to computed term vectors on the fly is only available from 1.4.0 onwards (see below)] | ||
coming[1.4.0,The ability to computed term vectors on the fly as well as support for artificial documents is only available from 1.4.0 onwards (see below example 2 and 3 respectively)] | ||
|
||
[WARNING] | ||
====== | ||
|
@@ -86,7 +87,9 @@ The term and field statistics are not accurate. Deleted documents | |
are not taken into account. The information is only retrieved for the | ||
shard the requested document resides in. The term and field statistics | ||
are therefore only useful as relative measures whereas the absolute | ||
numbers have no meaning in this context. | ||
numbers have no meaning in this context. By default, when requesting | ||
term vectors of artificial documents, a shard to get the statistics from | ||
is randomly selected. Use `routing` only to hit a particular shard. | ||
|
||
[float] | ||
=== Example 1 | ||
|
@@ -231,7 +234,7 @@ Response: | |
[float] | ||
=== Example 2 coming[1.4.0] | ||
|
||
Additionally, term vectors which are not explicitly stored in the index are automatically | ||
Term vectors which are not explicitly stored in the index are automatically | ||
computed on the fly. The following request returns all information and statistics for the | ||
fields in document `1`, even though the terms haven't been explicitly stored in the index. | ||
Note that for the field `text`, the terms are not re-generated. | ||
|
@@ -246,3 +249,23 @@ curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{ | |
"field_statistics" : true | ||
}' | ||
-------------------------------------------------- | ||
|
||
[float] | ||
=== Example 3 coming[1.4.0] | ||
|
||
Additionally, term vectors can also be generated for artificial documents, | ||
that is for documents not present in the index. The syntax is similar to the | ||
<<search-percolate,percolator>> API. For example, the following request would | ||
return the same results as in example 1. The mapping used is determined by the | ||
`index` and `type`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you leave a note about the fact that it can introduce new mappings? |
||
|
||
[source,js] | ||
-------------------------------------------------- | ||
curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{ | ||
"doc" : { | ||
"fullname" : "John Doe", | ||
"text" : "twitter test test test" | ||
} | ||
}' | ||
-------------------------------------------------- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,12 +26,17 @@ | |
import org.elasticsearch.action.ValidateActions; | ||
import org.elasticsearch.action.get.MultiGetRequest; | ||
import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest; | ||
import org.elasticsearch.common.bytes.BytesReference; | ||
import org.elasticsearch.common.io.stream.StreamInput; | ||
import org.elasticsearch.common.io.stream.StreamOutput; | ||
import org.elasticsearch.common.xcontent.XContentBuilder; | ||
import org.elasticsearch.common.xcontent.XContentParser; | ||
|
||
import java.io.IOException; | ||
import java.util.*; | ||
import java.util.concurrent.atomic.AtomicInteger; | ||
|
||
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder; | ||
|
||
/** | ||
* Request returning the term vector (doc frequency, positions, offsets) for a | ||
|
@@ -46,10 +51,14 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq | |
|
||
private String id; | ||
|
||
private BytesReference doc; | ||
|
||
private String routing; | ||
|
||
protected String preference; | ||
|
||
private static AtomicInteger randomInt = new AtomicInteger(0); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make it final? |
||
|
||
// TODO: change to String[] | ||
private Set<String> selectedFields; | ||
|
||
|
@@ -129,6 +138,23 @@ public TermVectorRequest id(String id) { | |
return this; | ||
} | ||
|
||
/** | ||
* Returns the artificial document from which term vectors are requested for. | ||
*/ | ||
public BytesReference doc() { | ||
return doc; | ||
} | ||
|
||
/** | ||
* Sets an artificial document from which term vectors are requested for. | ||
*/ | ||
public TermVectorRequest doc(XContentBuilder documentBuilder) { | ||
// assign a random id to this artificial document, for routing | ||
this.id(String.valueOf(randomInt.getAndAdd(1))); | ||
this.doc = documentBuilder.bytes(); | ||
return this; | ||
} | ||
|
||
/** | ||
* @return The routing for this request. | ||
*/ | ||
|
@@ -281,8 +307,8 @@ public ActionRequestValidationException validate() { | |
if (type == null) { | ||
validationException = ValidateActions.addValidationError("type is missing", validationException); | ||
} | ||
if (id == null) { | ||
validationException = ValidateActions.addValidationError("id is missing", validationException); | ||
if (id == null && doc == null) { | ||
validationException = ValidateActions.addValidationError("id or doc is missing", validationException); | ||
} | ||
return validationException; | ||
} | ||
|
@@ -303,6 +329,9 @@ public void readFrom(StreamInput in) throws IOException { | |
} | ||
type = in.readString(); | ||
id = in.readString(); | ||
if (in.readBoolean()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to check the stream version? |
||
doc = in.readBytesReference(); | ||
} | ||
routing = in.readOptionalString(); | ||
preference = in.readOptionalString(); | ||
long flags = in.readVLong(); | ||
|
@@ -331,6 +360,10 @@ public void writeTo(StreamOutput out) throws IOException { | |
} | ||
out.writeString(type); | ||
out.writeString(id); | ||
out.writeBoolean(doc != null); | ||
if (doc != null) { | ||
out.writeBytesReference(doc); | ||
} | ||
out.writeOptionalString(routing); | ||
out.writeOptionalString(preference); | ||
long longFlags = 0; | ||
|
@@ -389,7 +422,15 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar | |
} else if ("_type".equals(currentFieldName)) { | ||
termVectorRequest.type = parser.text(); | ||
} else if ("_id".equals(currentFieldName)) { | ||
if (termVectorRequest.doc != null) { | ||
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!"); | ||
} | ||
termVectorRequest.id = parser.text(); | ||
} else if ("doc".equals(currentFieldName)) { | ||
if (termVectorRequest.id != null) { | ||
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!"); | ||
} | ||
termVectorRequest.doc(jsonBuilder().copyCurrentStructure(parser)); | ||
} else if ("_routing".equals(currentFieldName) || "routing".equals(currentFieldName)) { | ||
termVectorRequest.routing = parser.text(); | ||
} else { | ||
|
@@ -398,7 +439,6 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar | |
} | ||
} | ||
} | ||
|
||
if (fields.size() > 0) { | ||
String[] fieldsAsArray = new String[fields.size()]; | ||
termVectorRequest.selectedFields(fields.toArray(fieldsAsArray)); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,7 @@ | |
import org.elasticsearch.action.ActionListener; | ||
import org.elasticsearch.action.ActionRequestBuilder; | ||
import org.elasticsearch.client.Client; | ||
import org.elasticsearch.common.xcontent.XContentBuilder; | ||
|
||
/** | ||
*/ | ||
|
@@ -35,6 +36,26 @@ public TermVectorRequestBuilder(Client client, String index, String type, String | |
super(client, new TermVectorRequest(index, type, id)); | ||
} | ||
|
||
public TermVectorRequestBuilder setIndex(String index) { | ||
request.index(index); | ||
return this; | ||
} | ||
|
||
public TermVectorRequestBuilder setId(String id) { | ||
request.id(id); | ||
return this; | ||
} | ||
|
||
public TermVectorRequestBuilder setType(String type) { | ||
request.type(type); | ||
return this; | ||
} | ||
|
||
public TermVectorRequestBuilder setDoc(XContentBuilder xContent) { | ||
request.doc(xContent); | ||
return this; | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add javadocs since there are user-facing APIs? |
||
/** | ||
* Sets the routing. Required if routing isn't id based. | ||
*/ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -81,10 +81,11 @@ private static class FieldStrings { | |
private String id; | ||
private long docVersion; | ||
private boolean exists = false; | ||
private boolean artificial = false; | ||
|
||
private boolean sourceCopied = false; | ||
|
||
int[] curentPositions = new int[0]; | ||
int[] currentPositions = new int[0]; | ||
int[] currentStartOffset = new int[0]; | ||
int[] currentEndOffset = new int[0]; | ||
BytesReference[] currentPayloads = new BytesReference[0]; | ||
|
@@ -156,7 +157,6 @@ public int size() { | |
} | ||
}; | ||
} | ||
|
||
} | ||
|
||
@Override | ||
|
@@ -166,7 +166,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |
assert id != null; | ||
builder.field(FieldStrings._INDEX, index); | ||
builder.field(FieldStrings._TYPE, type); | ||
builder.field(FieldStrings._ID, id); | ||
if (!isArtificial()) { | ||
builder.field(FieldStrings._ID, id); | ||
} | ||
builder.field(FieldStrings._VERSION, docVersion); | ||
builder.field(FieldStrings.FOUND, isExists()); | ||
if (!isExists()) { | ||
|
@@ -181,7 +183,6 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |
} | ||
builder.endObject(); | ||
return builder; | ||
|
||
} | ||
|
||
private void buildField(XContentBuilder builder, final CharsRef spare, Fields theFields, Iterator<String> fieldIter) throws IOException { | ||
|
@@ -237,7 +238,7 @@ private void buildValues(XContentBuilder builder, Terms curTerms, int termFreq) | |
for (int i = 0; i < termFreq; i++) { | ||
builder.startObject(); | ||
if (curTerms.hasPositions()) { | ||
builder.field(FieldStrings.POS, curentPositions[i]); | ||
builder.field(FieldStrings.POS, currentPositions[i]); | ||
} | ||
if (curTerms.hasOffsets()) { | ||
builder.field(FieldStrings.START_OFFSET, currentStartOffset[i]); | ||
|
@@ -249,14 +250,13 @@ private void buildValues(XContentBuilder builder, Terms curTerms, int termFreq) | |
builder.endObject(); | ||
} | ||
builder.endArray(); | ||
|
||
} | ||
|
||
private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFreq) throws IOException { | ||
for (int j = 0; j < termFreq; j++) { | ||
int nextPos = posEnum.nextPosition(); | ||
if (curTerms.hasPositions()) { | ||
curentPositions[j] = nextPos; | ||
currentPositions[j] = nextPos; | ||
} | ||
if (curTerms.hasOffsets()) { | ||
currentStartOffset[j] = posEnum.startOffset(); | ||
|
@@ -269,15 +269,14 @@ private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFr | |
} else { | ||
currentPayloads[j] = null; | ||
} | ||
|
||
} | ||
} | ||
} | ||
|
||
private void initMemory(Terms curTerms, int termFreq) { | ||
// init memory for performance reasons | ||
if (curTerms.hasPositions()) { | ||
curentPositions = ArrayUtil.grow(curentPositions, termFreq); | ||
currentPositions = ArrayUtil.grow(currentPositions, termFreq); | ||
} | ||
if (curTerms.hasOffsets()) { | ||
currentStartOffset = ArrayUtil.grow(currentStartOffset, termFreq); | ||
|
@@ -336,7 +335,6 @@ public void setTermVectorField(BytesStreamOutput output) { | |
|
||
public void setHeader(BytesReference header) { | ||
headerRef = header; | ||
|
||
} | ||
|
||
public void setDocVersion(long version) { | ||
|
@@ -356,4 +354,11 @@ public String getId() { | |
return id; | ||
} | ||
|
||
public boolean isArtificial() { | ||
return artificial; | ||
} | ||
|
||
public void setArtificial(boolean artificial) { | ||
this.artificial = artificial; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this method should be public? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should be, just like isExists() and setExists, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I had not noticed the other fields had the same issue. I would rather like these setters to not exist on TermVectorsResponse like GetResponse. Maybe we can just worry about setArtificial here and remove the other setters in a different PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we have something similar to a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the term vectors API currently return statistics that are aggregated across all shards? Documentation suggests so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope it does not.
"The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok