-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Provide API to lookup documents from external sources #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This was my intended approach for using Lucene with HBase and I have done preliminary tests, but outgrew single machine capabilities of Lucene (e.g. indexes >300G which led me to look at ES). On single machine indices it worked really very fine, but I never got to test it properly when HBase performed region splits (I believe you might suffer up in the search layer in timeouts when HBase). I'd be interested in looking at the HBase integration if I can find time. I would do a MapReduce job for building the ES, and also implement the interface. Taking it a bit further... would you think it wise to allow ES to push down content as well? E.g. use the ES as the webservice to the storage layer? |
I'm evaluating integrating ES with HBase for Mozilla's Socorro project. We are storing billions of JSON objects, and we are sizing our Hadoop cluster to give us adequate storage of this data. If _source were not disabled, it would mean that I need to account for an additional copy of the raw JSON in addition to the unavoidable overhead of the index data that is generated. The most typical use cases for our search is going to be returning a list of "name" fields that would be hyperlinked to display the document data and returning a faceted set of results to allow users to drill in on subsets. |
Shay, We really need this feature. If you have no objections, I would like to implement it. My current thinking is to make external source provider plugable on the index level. By default, such source provider will store and retrieve sources in Lucene _source field the same way it's done today. I will implement a file-system based source provider as a simple (but not practical) example. However, most users will use it with their own custom source providers (using HBase, MySql, S3, etc. to retrieve source) An external source provider will be configured on the index level similar to the way analyzers are configured today. I envision something like this: {
"tweet" : {
"_source" : {
"enabled" : true,
"type" : "file",
"root_directory" : "/data",
"source_ref" : "filepath"
}
}
} This configuration will instantiate a file based external source provider org.elasticsearch.index.source.file.FileSourceProvider that will retrieve source from files specified in the "filepath" field and located in the "/data" directory. External source providers will implement two methods:
Proper caching of sources, connections and other resources will be responsibility of custom source providers. Does it make sense? |
After much discussion, we have decided we're against adding this feature. It is very complex, has too many corner cases, and the latency will be terrible. While this could be implemented as a plugin, we highly recommend not trying to do this. It is much better to store your data directly in Elasticsearch and use the native functionality. |
Closes elastic#67. (cherry picked from commit d3eaac9)
Update to elasticsearch 1.3.0 Move to java 1.7 Related to elastic#67. Closed elastic#76. (cherry picked from commit 2303932)
Since standard rally-tracks (https://github.com/elastic/rally-tracks) use `number_of_shards` as the track parameter for change the number of primary shards, this commit applies this convention to this track as well.
With this commit we add an index-only benchmark on three nodes based on the new NIO transport. We also ensure that these benchmarks only run when x-pack is disabled (currently TLS is not supported) and we also do not run them for releases prior to 7.0 as this transport is new in Elasticsearch 7.0. Closes elastic#42 Relates elastic#67
When the _source field is disabled, source document is not stored, nor returned in ES queries.
It would be great to provide an API to implement for looking up the source document from external sources, probably starting from index name and document id.
This way, the document source could be used inside ES for search features that may need it, as well as transparently returned to the client inside the ES response.
Thoughts?
The text was updated successfully, but these errors were encountered: