-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Reindex from remote #18585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex from remote #18585
Conversation
Along those lines:
|
import static org.elasticsearch.common.xcontent.ConstructingObjectParser.constructorArg; | ||
|
||
public class RemoteScrollableHitSource extends ScrollableHitSource { | ||
private final CloseableHttpClient httpClient = HttpClients.createMinimal(); // NORELEASE replace me with the Elasticsearch client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to see this //NORELEASE. A PR is coming soon for the low level http java client. It already cleans up duplications around http clients in our classpath/code. Can we wait with this PR so it can use directly the proper java client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect so. Worst case we merge this first and rip all the duplication out when we cut over, but I'd like to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having done it already once on my branch, I'd rather not do it again. I will try to speed up my work and send the PR asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand - this stuff is relatively self contained and doesn't rely on any of the test stuff you've already replaced. Either I replace it with the http client during the PR or in a separate PR, but it won't make merging the http client PR harder if this is merged first, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will cause a few merge conflicts, but what I don't like is the size of this PR, where stuff taken from REST tests becomes production code by copying it (HttpDeleteWithEntity and some Response bits). Let's just wait one/two weeks and this code won't be needed. Also the original code will get removed from rest tests and improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stuff taken from REST tests becomes production code by copying it
Yeah, I wasn't particularly happy with that either but I felt like it was ok to use as a crutch while waiting on the client.
just wait one/two weeks
I'd love to iterate on the non-stolen-from-rest-tests parts in the mean time. I certainly wouldn't be surprised if it takes a week or two to get this reviewed sensibly.
I'm not sure we need this. What are you trying to protect against by adding a whitelist? |
It is to prevent Elasticsearch from being a jumping off point in the case of a partially compromised network. I don't trust us to prevent a sufficiently determined individual from figuring out how to do something nasty with the http calls that Elasticsearch issues on their behalf. I don't think of reindex-from-remote as being anything like dynamic scripting in terms of exploitability but I really like the idea of limiting it to a whitelist anyway just for extra paranoia. |
5158dda
to
42e092f
Compare
Yesterday I ran some tests that reindexed from Elasticsearch nodes running on my laptop rather than the system under test. I found and fixed two issues:
With those two things solved I was able to verify that this works against the following versions:
I hope no one has to use this so far back, but, at least with the basic test that I did, it works. |
I've played with this a little more locally. I indexed 10,000 documents into the same cluster and then did both a few normal and a remote reindexes. The first ones were slow because, like a second. After that the "local" reindex hovers around ~270ms and the remote one around ~420ms. I'd like to play with it some and see where the bottleneck is. It might just be the network, http, and json overhead incurred by going remote. The "local" reindex can take advantage of lots of nice work Elasticsearch does to make requests more efficient, like local requests skipping the transport layer entirely. |
*/ | ||
if (mainRequest.getSearchRequest().source().sorts() == null || mainRequest.getSearchRequest().source().sorts().isEmpty()) { | ||
mainRequest.getSearchRequest().source().sort(fieldSort("_doc")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, could we have
List<SortBuilder<?>> sorts = mainRequest.getSearchRequest().source().sorts();
if (sorts == null || sorts.isEmpty()) {
mainRequest.getSearchRequest().source().sort(fieldSort("_doc"));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
I have a first look and it looks good! Most of my comment are really minor and concern naming. I do like how it fits into the current reindex infra but I'm wondering if we could reuse SearchHit/InternalSearchHit/SearchResponse/InternalSearchResponse instead of using the ScrollableHitSource.Hit, Response, BasicHit, ClientHit stuff... It looks like a big part of the code is declaring these objects and convert them around only because of the remote reindex use case. I'd rather see the remote reindex makes use of the current objects, even partially. It would also help later once the built-in HTTP client will be available, what do you think? |
I really don't like the objects we expose though! I think they are way overcomplicated to build. All kinds of weird internals leak out of them like I don't think the java http client will help much with this, but I'm happy to wait and see! |
Most likely not for first round, but something to consider for future iterations is parallelizing of the scrolls. I often would do this when manually re-indexing prior to 2.3. You find a incremental ID field or date field in your data and then break it up evenly filtering on that field so that you can have 4 or however many scrolls happening in parallel. |
@@ -56,11 +56,27 @@ public void testReindexRequest() throws IOException { | |||
randomRequest(reindex); | |||
reindex.getDestination().version(randomFrom(Versions.MATCH_ANY, Versions.MATCH_DELETED, 12L, 1L, 123124L, 12L)); | |||
reindex.getDestination().index("test"); | |||
if (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (true)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! That is a leftover from a serialization bug I was hunting down. I'll push a fix.
Left comments, this is an exciting feature @nik9000! |
Thanks for all the review @dakrone ! I'll go through the comments one more time and push an update. |
@dakrone Ok - I believe I've either fixed or replied to all of your comments. Thanks for slogging through this one. |
How about changing this currently so the remote option supports |
Sure! |
@dakrone, I pushed a commit that handles scheme. I'd still prefer to do the auth stuff in another PR, but this is nicer. |
And I'll push an update to the docs in a second.... |
} | ||
String scheme = hostMatcher.group("scheme"); | ||
if (scheme == null) { | ||
scheme = password == null ? "http" : "https"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should force https
if a password is used. It's unfortunate but someone might want to use auth without encryption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can still get http
if they send the host as http://somehost
or http://something:9200
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely document this then, it's hidden behavior otherwise
@dakrone I pushed another round of docs. |
@nik9000 I've only looked at the docs. Instead of having default port, scheme, etc, why not just require a FQDN? It removes all the ambiguity about what scheme to use, what port etc. The user should just specify the node they want to connect to. Also, I'd add a note about the query and search params being passed directly to the remote cluster, so should use the syntax accepted by that version of Elasticsearch. |
I could require
Every time. I'm ok either way. I picked this way because it felt the most true to your original proposal on the issue. |
In defense of the way i have it working now: the vast majority of the time you'll never have to specify a scheme or a port. It'll just "do the right thing" with the scheme and if you want to do something whacky like use https without a password or http with a password you can make it to do without too much trouble. But I'm ambivalent. So long as the scheme, host, and port are all available I'm ok with any API. |
This would be my preference |
OK - I've got that implemented. The rest test infrastructure doesn't support it so I've reached out to @javanna to talk about the right way to get it into the test infrastructure. |
Okay, on the code side I think this LGTM, I played with it reindexing 40k docs from a 1.7.5 cluster running locally and it took ~35 seconds, so pretty nice! |
Hurray! I'll work on getting the rest tests sane and we should be ready! |
be28b83
to
9523cd1
Compare
This adds a remote option to reindex that looks like ``` curl -POST 'localhost:9200/_reindex?pretty' -d'{ "source": { "remote": { "host": "http://otherhost:9200" }, "index": "target", "query": { "match": { "foo": "bar" } } }, "dest": { "index": "target" } }' ``` This reindex has all of the features of local reindex: * Using queries to filter what is copied * Retry on rejection * Throttle/rethottle The big advantage of this version is that it goes over the HTTP API which can be made backwards compatible. Some things are different: The query field is sent directly to the other node rather than parsed on the coordinating node. This should allow it to support constructs that are invalid on the coordinating node but are valid on the target node. Mostly, that means old syntax.
9523cd1
to
b3c015e
Compare
Thanks for all the reviews everyone! I've just merged this after resolving some "fun" with the rest tests and docs. There are still a few |
This adds a
remote
option to reindex that looks likeThis reindex has all of the features of local reindex:
The big advantage of this version is that it goes over the HTTP API which can be made backwards compatible. I have yet to test it against any version of Elasticsearch other than the modern one but it should be fairly easy to retrofit it to work with anything.
Some things are different:
query
field is sent directly to the other node rather than parsed on the coordinating node. This should allow it to support constructs that are invalid on the coordinating node for whatever reason.Lots of things still need to be sorted out:
host
. This doesn't turn Elasticsearch into an arbitrarycurl
ing machine because reindex will only allow three commands - the one to start the scroll, the one to pull the next scroll, and the one clear the scroll. The URLs aren't arbitrary and neither are the methods and it only does them in that order. But, still, I'm going to have to add a whitelist.More?
Closes #17447