Skip to content

Provide API to lookup documents from external sources #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sbtourist opened this issue Mar 18, 2010 · 4 comments
Closed

Provide API to lookup documents from external sources #67

sbtourist opened this issue Mar 18, 2010 · 4 comments
Labels

Comments

@sbtourist
Copy link

When the _source field is disabled, source document is not stored, nor returned in ES queries.

It would be great to provide an API to implement for looking up the source document from external sources, probably starting from index name and document id.
This way, the document source could be used inside ES for search features that may need it, as well as transparently returned to the client inside the ES response.

Thoughts?

@timrobertson100
Copy link

This was my intended approach for using Lucene with HBase and I have done preliminary tests, but outgrew single machine capabilities of Lucene (e.g. indexes >300G which led me to look at ES). On single machine indices it worked really very fine, but I never got to test it properly when HBase performed region splits (I believe you might suffer up in the search layer in timeouts when HBase).

I'd be interested in looking at the HBase integration if I can find time. I would do a MapReduce job for building the ES, and also implement the interface. Taking it a bit further... would you think it wise to allow ES to push down content as well? E.g. use the ES as the webservice to the storage layer?

@deinspanjer
Copy link

I'm evaluating integrating ES with HBase for Mozilla's Socorro project. We are storing billions of JSON objects, and we are sizing our Hadoop cluster to give us adequate storage of this data. If _source were not disabled, it would mean that I need to account for an additional copy of the raw JSON in addition to the unavoidable overhead of the index data that is generated.

The most typical use cases for our search is going to be returning a list of "name" fields that would be hyperlinked to display the document data and returning a faceted set of results to allow users to drill in on subsets.
In both these cases, an API to allow transparent retrieval of the document from HBase would be a welcome addition.

@imotov
Copy link
Contributor

imotov commented Feb 13, 2012

Shay,

We really need this feature. If you have no objections, I would like to implement it. My current thinking is to make external source provider plugable on the index level. By default, such source provider will store and retrieve sources in Lucene _source field the same way it's done today. I will implement a file-system based source provider as a simple (but not practical) example. However, most users will use it with their own custom source providers (using HBase, MySql, S3, etc. to retrieve source)

An external source provider will be configured on the index level similar to the way analyzers are configured today. I envision something like this:

{
    "tweet" : {
        "_source" : {
            "enabled" : true,
            "type" : "file",
            "root_directory" : "/data",
            "source_ref" : "filepath"
        }
    }
}

This configuration will instantiate a file based external source provider org.elasticsearch.index.source.file.FileSourceProvider that will retrieve source from files specified in the "filepath" field and located in the "/data" directory.

External source providers will implement two methods:

  • void parseSource(ParseContext context) that will be called by SourceFieldMapper.parse()
  • byte[] extractSource(Document doc) that will be called by ShardGetService.extractSource() and FetchPhase.extractSource()

Proper caching of sources, connections and other resources will be responsibility of custom source providers.

Does it make sense?

@clintongormley
Copy link
Contributor

After much discussion, we have decided we're against adding this feature. It is very complex, has too many corner cases, and the latency will be terrible. While this could be implemented as a plugin, we highly recommend not trying to do this. It is much better to store your data directly in Elasticsearch and use the native functionality.

rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
Closes elastic#67.
(cherry picked from commit d3eaac9)
rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
Update to elasticsearch 1.3.0
Move to java 1.7

Related to elastic#67.
Closed elastic#76.

(cherry picked from commit 2303932)
henningandersen pushed a commit to henningandersen/elasticsearch that referenced this issue Jun 4, 2020
Since standard rally-tracks (https://github.com/elastic/rally-tracks)
use `number_of_shards` as the track parameter for change the number of
primary shards, this commit applies this convention to this track as
well.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023
With this commit we add an index-only benchmark on three nodes based on
the new NIO transport. We also ensure that these benchmarks only run
when x-pack is disabled (currently TLS is not supported) and we also do
not run them for releases prior to 7.0 as this transport is new in
Elasticsearch 7.0.

Closes elastic#42
Relates elastic#67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants