Provide API to lookup documents from external sources #67

sbtourist · 2010-03-18T09:45:03Z

When the _source field is disabled, source document is not stored, nor returned in ES queries.

It would be great to provide an API to implement for looking up the source document from external sources, probably starting from index name and document id.
This way, the document source could be used inside ES for search features that may need it, as well as transparently returned to the client inside the ES response.

Thoughts?

timrobertson100 · 2010-04-05T14:38:07Z

This was my intended approach for using Lucene with HBase and I have done preliminary tests, but outgrew single machine capabilities of Lucene (e.g. indexes >300G which led me to look at ES). On single machine indices it worked really very fine, but I never got to test it properly when HBase performed region splits (I believe you might suffer up in the search layer in timeouts when HBase).

I'd be interested in looking at the HBase integration if I can find time. I would do a MapReduce job for building the ES, and also implement the interface. Taking it a bit further... would you think it wise to allow ES to push down content as well? E.g. use the ES as the webservice to the storage layer?

deinspanjer · 2010-05-25T13:36:34Z

I'm evaluating integrating ES with HBase for Mozilla's Socorro project. We are storing billions of JSON objects, and we are sizing our Hadoop cluster to give us adequate storage of this data. If _source were not disabled, it would mean that I need to account for an additional copy of the raw JSON in addition to the unavoidable overhead of the index data that is generated.

The most typical use cases for our search is going to be returning a list of "name" fields that would be hyperlinked to display the document data and returning a faceted set of results to allow users to drill in on subsets.
In both these cases, an API to allow transparent retrieval of the document from HBase would be a welcome addition.

imotov · 2012-02-13T20:20:09Z

Shay,

We really need this feature. If you have no objections, I would like to implement it. My current thinking is to make external source provider plugable on the index level. By default, such source provider will store and retrieve sources in Lucene _source field the same way it's done today. I will implement a file-system based source provider as a simple (but not practical) example. However, most users will use it with their own custom source providers (using HBase, MySql, S3, etc. to retrieve source)

An external source provider will be configured on the index level similar to the way analyzers are configured today. I envision something like this:

{
    "tweet" : {
        "_source" : {
            "enabled" : true,
            "type" : "file",
            "root_directory" : "/data",
            "source_ref" : "filepath"
        }
    }
}

This configuration will instantiate a file based external source provider org.elasticsearch.index.source.file.FileSourceProvider that will retrieve source from files specified in the "filepath" field and located in the "/data" directory.

External source providers will implement two methods:

void parseSource(ParseContext context) that will be called by SourceFieldMapper.parse()
byte[] extractSource(Document doc) that will be called by ShardGetService.extractSource() and FetchPhase.extractSource()

Proper caching of sources, connections and other resources will be responsibility of custom source providers.

Does it make sense?

clintongormley · 2014-07-18T07:39:56Z

After much discussion, we have decided we're against adding this feature. It is very complex, has too many corner cases, and the latency will be terrible. While this could be implemented as a plugin, we highly recommend not trying to do this. It is much better to store your data directly in Elasticsearch and use the native functionality.

Closes elastic#67. (cherry picked from commit d3eaac9)

Update to elasticsearch 1.3.0 Move to java 1.7 Related to elastic#67. Closed elastic#76. (cherry picked from commit 2303932)

Since standard rally-tracks (https://github.com/elastic/rally-tracks) use `number_of_shards` as the track parameter for change the number of primary shards, this commit applies this convention to this track as well.

With this commit we add an index-only benchmark on three nodes based on the new NIO transport. We also ensure that these benchmarks only run when x-pack is disabled (currently TLS is not supported) and we also do not run them for releases prior to 7.0 as this transport is new in Elasticsearch 7.0. Closes elastic#42 Relates elastic#67

imotov mentioned this issue Mar 21, 2012

add plugin mechanism for handling source storage and retrieval #1798

Closed

clintongormley added the discuss label Jul 8, 2014

clintongormley closed this as completed Jul 18, 2014

bluelu mentioned this issue Dec 11, 2014

Node is not responsive after the end of a big merge for close to 10 minutes #8905

Closed

makeyang mentioned this issue Sep 7, 2015

es 0.90.2 plus jdk6.0_25-b06 crashed on production #13368

Closed

rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015

Update to elasticsearch 1.3.0

7c1c201

Closes elastic#67. (cherry picked from commit d3eaac9)

rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015

Update to Lucene 4.9.0

1d1225b

Update to elasticsearch 1.3.0 Move to java 1.7 Related to elastic#67. Closed elastic#76. (cherry picked from commit 2303932)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide API to lookup documents from external sources #67

Provide API to lookup documents from external sources #67

sbtourist commented Mar 18, 2010

timrobertson100 commented Apr 5, 2010

deinspanjer commented May 25, 2010

imotov commented Feb 13, 2012

clintongormley commented Jul 18, 2014

Provide API to lookup documents from external sources #67

Provide API to lookup documents from external sources #67

Comments

sbtourist commented Mar 18, 2010

timrobertson100 commented Apr 5, 2010

deinspanjer commented May 25, 2010

imotov commented Feb 13, 2012

clintongormley commented Jul 18, 2014