Reindex from remote #18585

nik9000 · 2016-05-26T02:06:55Z

This adds a remote option to reindex that looks like

curl -POST 'localhost:9200/_reindex?pretty' -d'{
  "source": {
    "remote": {
      "host": "otherhost:9200"
    },
    "index": "target",
    "query": {
      "match": {
        "foo": "bar"
      }
    }
  },
  "dest": {
    "index": "target"
  }
}'

This reindex has all of the features of local reindex:

Using queries to filter what is copied
Retry on rejection
Throttle/rethottle

The big advantage of this version is that it goes over the HTTP API which can be made backwards compatible. I have yet to test it against any version of Elasticsearch other than the modern one but it should be fairly easy to retrofit it to work with anything.

Some things are different:

The query field is sent directly to the other node rather than parsed on the coordinating node. This should allow it to support constructs that are invalid on the coordinating node for whatever reason.

Lots of things still need to be sorted out:

A bunch of NOCOMMITS
This uses httpclient directly and doesn't support tls or basic auth at all. I'd like to replace it with Elasticsearch's currently-being-built HTTP based java client.
At this point there isn't a whitelist for the host. This doesn't turn Elasticsearch into an arbitrary curling machine because reindex will only allow three commands - the one to start the scroll, the one to pull the next scroll, and the one clear the scroll. The URLs aren't arbitrary and neither are the methods and it only does them in that order. But, still, I'm going to have to add a whitelist.
I have to test it against more Elasticsearch versions.
We have to decide if we're going to integration test against real elasticsearch versions or just use mocks of their returned json and call that good enough. It is obviously much more work to setup the integration tests and much slower to test. 50ms per test vs 30 seconds per test kind of slower.
Building this involved a lot of refactoring. That refactoring needs to inform some unit test changes. Probably.
More?

Closes #17447

nik9000 · 2016-05-26T02:13:19Z

This uses httpclient directly and doesn't support tls or basic auth at all. I'd like to replace it with Elasticsearch's currently-being-built HTTP based java client.

Along those lines:

I'm just slamming strings together to build the URLs to send to elasticsearch which is almost certainly as SQL-injection style goldmine. My hope is that the fancy new HTTP based Java client for Elasticsearch will have a pattern I can use for that.

javanna · 2016-05-26T07:52:19Z

modules/reindex/src/main/java/org/elasticsearch/index/reindex/RemoteScrollableHitSource.java

+import static org.elasticsearch.common.xcontent.ConstructingObjectParser.constructorArg;
+
+public class RemoteScrollableHitSource extends ScrollableHitSource {
+    private final CloseableHttpClient httpClient = HttpClients.createMinimal(); // NORELEASE replace me with the Elasticsearch client


happy to see this //NORELEASE. A PR is coming soon for the low level http java client. It already cleans up duplications around http clients in our classpath/code. Can we wait with this PR so it can use directly the proper java client?

I suspect so. Worst case we merge this first and rip all the duplication out when we cut over, but I'd like to avoid it.

having done it already once on my branch, I'd rather not do it again. I will try to speed up my work and send the PR asap

I don't understand - this stuff is relatively self contained and doesn't rely on any of the test stuff you've already replaced. Either I replace it with the http client during the PR or in a separate PR, but it won't make merging the http client PR harder if this is merged first, right?

it will cause a few merge conflicts, but what I don't like is the size of this PR, where stuff taken from REST tests becomes production code by copying it (HttpDeleteWithEntity and some Response bits). Let's just wait one/two weeks and this code won't be needed. Also the original code will get removed from rest tests and improved.

stuff taken from REST tests becomes production code by copying it

Yeah, I wasn't particularly happy with that either but I felt like it was ok to use as a crutch while waiting on the client.

just wait one/two weeks

I'd love to iterate on the non-stolen-from-rest-tests parts in the mean time. I certainly wouldn't be surprised if it takes a week or two to get this reviewed sensibly.

clintongormley · 2016-05-26T13:18:31Z

At this point there isn't a whitelist for the host. This doesn't turn Elasticsearch into an arbitrary curling machine because reindex will only allow three commands - the one to start the scroll, the one to pull the next scroll, and the one clear the scroll. The URLs aren't arbitrary and neither are the methods and it only does them in that order. But, still, I'm going to have to add a whitelist.

I'm not sure we need this. What are you trying to protect against by adding a whitelist?

nik9000 · 2016-05-26T13:35:50Z

What are you trying to protect against by adding a whitelist?

It is to prevent Elasticsearch from being a jumping off point in the case of a partially compromised network. I don't trust us to prevent a sufficiently determined individual from figuring out how to do something nasty with the http calls that Elasticsearch issues on their behalf. I don't think of reindex-from-remote as being anything like dynamic scripting in terms of exploitability but I really like the idea of limiting it to a whitelist anyway just for extra paranoia.

nik9000 · 2016-05-27T13:57:28Z

Yesterday I ran some tests that reindexed from Elasticsearch nodes running on my laptop rather than the system under test. I found and fixed two issues:

I was checking if the source index existed in the destination cluster even if we were pulling from remote. I never noticed because of the way I was testing.
Before 2.1.0 Elasticsearch doesn't support sort=_doc which is the default sort used by reindex because it makes the search process more efficient. This required another round trip to the source cluster to check its version and required that we handle scan's habit of not returning any results in the first round.

With those two things solved I was able to verify that this works against the following versions:

master
2.3.3
1.7.5
0.90.13
0.20.5

I hope no one has to use this so far back, but, at least with the basic test that I did, it works.

nik9000 · 2016-05-30T14:21:10Z

I've played with this a little more locally. I indexed 10,000 documents into the same cluster and then did both a few normal and a remote reindexes. The first ones were slow because, like a second. After that the "local" reindex hovers around ~270ms and the remote one around ~420ms. I'd like to play with it some and see where the bottleneck is. It might just be the network, http, and json overhead incurred by going remote. The "local" reindex can take advantage of lots of nice work Elasticsearch does to make requests more efficient, like local requests skipping the transport layer entirely.

tlrx · 2016-05-31T08:10:17Z

...s/reindex/src/main/java/org/elasticsearch/index/reindex/AbstractAsyncBulkByScrollAction.java

+         */
+        if (mainRequest.getSearchRequest().source().sorts() == null || mainRequest.getSearchRequest().source().sorts().isEmpty()) {
+            mainRequest.getSearchRequest().source().sort(fieldSort("_doc"));
+        }


For readability, could we have

List<SortBuilder<?>> sorts = mainRequest.getSearchRequest().source().sorts(); if (sorts == null || sorts.isEmpty()) { mainRequest.getSearchRequest().source().sort(fieldSort("_doc")); }

tlrx · 2016-05-31T09:58:47Z

I have a first look and it looks good! Most of my comment are really minor and concern naming.

I do like how it fits into the current reindex infra but I'm wondering if we could reuse SearchHit/InternalSearchHit/SearchResponse/InternalSearchResponse instead of using the ScrollableHitSource.Hit, Response, BasicHit, ClientHit stuff... It looks like a big part of the code is declaring these objects and convert them around only because of the remote reindex use case. I'd rather see the remote reindex makes use of the current objects, even partially. It would also help later once the built-in HTTP client will be available, what do you think?

nik9000 · 2016-05-31T14:22:48Z

I'd rather see the remote reindex makes use of the current objects, even partially.

I really don't like the objects we expose though! I think they are way overcomplicated to build. All kinds of weird internals leak out of them like type being Text. I think it makes the reindex testing simpler not to build them. ClientHit, with its handling of source, parent, routing, ttl, and timestamp I think simplifies the rest of the code in nice ways.

I don't think the java http client will help much with this, but I'm happy to wait and see!

djschny · 2016-06-05T14:01:55Z

Most likely not for first round, but something to consider for future iterations is parallelizing of the scrolls. I often would do this when manually re-indexing prior to 2.3. You find a incremental ID field or date field in your data and then break it up evenly filtering on that field so that you can have 4 or however many scrolls happening in parallel.

dakrone · 2016-06-28T18:42:01Z

modules/reindex/src/test/java/org/elasticsearch/index/reindex/RoundTripTests.java

@@ -56,11 +56,27 @@ public void testReindexRequest() throws IOException {
        randomRequest(reindex);
        reindex.getDestination().version(randomFrom(Versions.MATCH_ANY, Versions.MATCH_DELETED, 12L, 1L, 123124L, 12L));
        reindex.getDestination().index("test");
+        if (true) {


if (true) ?

Sorry! That is a leftover from a serialization bug I was hunting down. I'll push a fix.

dakrone · 2016-06-28T19:01:18Z

Left comments, this is an exciting feature @nik9000!

nik9000 · 2016-06-28T20:20:31Z

Thanks for all the review @dakrone ! I'll go through the comments one more time and push an update.

nik9000 · 2016-06-28T20:53:23Z

@dakrone Ok - I believe I've either fixed or replied to all of your comments. Thanks for slogging through this one.

dakrone · 2016-06-28T21:40:09Z

I'd like to wait and implement anything but non-authorized http in another PR.

How about changing this currently so the remote option supports scheme which can only be http right now? Right now if someone passes http://localhost:9200 as the URL (which I imagine a lot of people will use that format), they'll get an error like "[host] can either be of the form [host] or [host:port] but was [http]" which is a confusing message

nik9000 · 2016-06-29T12:32:16Z

How about changing this currently so the remote option supports scheme which can only be http right now? Right now if someone passes http://localhost:9200 as the URL (which I imagine a lot of people will use that format), they'll get an error like "[host] can either be of the form [host] or [host:port] but was [http]" which is a confusing message

Sure!

nik9000 · 2016-06-29T14:53:39Z

@dakrone, I pushed a commit that handles scheme. I'd still prefer to do the auth stuff in another PR, but this is nicer.

nik9000 · 2016-06-29T14:55:23Z

And I'll push an update to the docs in a second....

dakrone · 2016-06-29T15:05:21Z

modules/reindex/src/main/java/org/elasticsearch/index/reindex/RestReindexAction.java

+        }
+        String scheme = hostMatcher.group("scheme");
+        if (scheme == null) {
+            scheme = password == null ? "http" : "https";


I don't think we should force https if a password is used. It's unfortunate but someone might want to use auth without encryption

They can still get http if they send the host as http://somehost or http://something:9200.

We should definitely document this then, it's hidden behavior otherwise

nik9000 · 2016-06-29T17:42:00Z

@dakrone I pushed another round of docs.

clintongormley · 2016-06-30T11:37:04Z

@nik9000 I've only looked at the docs. Instead of having default port, scheme, etc, why not just require a FQDN? It removes all the ambiguity about what scheme to use, what port etc. The user should just specify the node they want to connect to.

Also, I'd add a note about the query and search params being passed directly to the remote cluster, so should use the syntax accepted by that version of Elasticsearch.

nik9000 · 2016-06-30T13:14:11Z

Instead of having default port, scheme, etc, why not just require a FQDN

I could require http://otherhost:9200 every time. I could also require:

{
  "remote": {
    "host": "otherhost",
    "scheme": "http",
    "port": 9200
  }
}

Every time. I'm ok either way. I picked this way because it felt the most true to your original proposal on the issue.

nik9000 · 2016-06-30T13:42:42Z

I picked this way because it felt the most true to your original proposal on the issue.

In defense of the way i have it working now: the vast majority of the time you'll never have to specify a scheme or a port. It'll just "do the right thing" with the scheme and if you want to do something whacky like use https without a password or http with a password you can make it to do without too much trouble.

But I'm ambivalent. So long as the scheme, host, and port are all available I'm ok with any API.

clintongormley · 2016-06-30T14:39:40Z

I could require http://otherhost:9200 every time.

This would be my preference

nik9000 · 2016-06-30T17:00:53Z

OK - I've got that implemented. The rest test infrastructure doesn't support it so I've reached out to @javanna to talk about the right way to get it into the test infrastructure.

dakrone · 2016-06-30T17:16:11Z

Okay, on the code side I think this LGTM, I played with it reindexing 40k docs from a 1.7.5 cluster running locally and it took ~35 seconds, so pretty nice!

nik9000 · 2016-06-30T20:35:02Z

Okay, on the code side I think this LGTM, I played with it reindexing 40k docs from a 1.7.5 cluster running locally and it took ~35 seconds, so pretty nice!

Hurray! I'll work on getting the rest tests sane and we should be ready!

This adds a remote option to reindex that looks like ``` curl -POST 'localhost:9200/_reindex?pretty' -d'{ "source": { "remote": { "host": "http://otherhost:9200" }, "index": "target", "query": { "match": { "foo": "bar" } } }, "dest": { "index": "target" } }' ``` This reindex has all of the features of local reindex: * Using queries to filter what is copied * Retry on rejection * Throttle/rethottle The big advantage of this version is that it goes over the HTTP API which can be made backwards compatible. Some things are different: The query field is sent directly to the other node rather than parsed on the coordinating node. This should allow it to support constructs that are invalid on the coordinating node but are valid on the target node. Mostly, that means old syntax.

nik9000 · 2016-07-05T20:16:47Z

Thanks for all the reviews everyone! I've just merged this after resolving some "fun" with the rest tests and docs.

There are still a few // NORELEASEs left and I'll start work on them soon.

nik9000 added WIP :Reindex API v5.0.0-alpha4 labels May 26, 2016

javanna reviewed May 26, 2016
View reviewed changes

clintongormley added >feature release highlight labels May 26, 2016

nik9000 force-pushed the reindex_from_remote branch from 5158dda to 42e092f Compare May 26, 2016 16:55

tlrx reviewed May 31, 2016
View reviewed changes

dakrone reviewed Jun 28, 2016
View reviewed changes

dakrone reviewed Jun 29, 2016
View reviewed changes

nik9000 mentioned this pull request Jul 1, 2016

Add embedded stash key support to rest tests #19216

Merged

nik9000 force-pushed the reindex_from_remote branch 4 times, most recently from be28b83 to 9523cd1 Compare July 5, 2016 20:09

nik9000 force-pushed the reindex_from_remote branch from 9523cd1 to b3c015e Compare July 5, 2016 20:14

nik9000 merged commit b3c015e into elastic:master Jul 5, 2016

ppf2 mentioned this pull request Sep 27, 2016

Include 1.x -> 5.x instructions in upgrade guide #20675

Closed

lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018

Reindex from remote #18585

Reindex from remote #18585

Uh oh!

Conversation

nik9000 commented May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 commented May 26, 2016

Uh oh!

javanna May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintongormley commented May 26, 2016

Uh oh!

nik9000 commented May 26, 2016

Uh oh!

nik9000 commented May 27, 2016

Uh oh!

nik9000 commented May 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx commented May 31, 2016

Uh oh!

nik9000 commented May 31, 2016

Uh oh!

djschny commented Jun 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone commented Jun 28, 2016

Uh oh!

nik9000 commented Jun 28, 2016

Uh oh!

nik9000 commented Jun 28, 2016

Uh oh!

dakrone commented Jun 28, 2016

Uh oh!

nik9000 commented Jun 29, 2016

Uh oh!

nik9000 commented Jun 29, 2016

Uh oh!

nik9000 commented Jun 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Jun 29, 2016

Uh oh!

clintongormley commented Jun 30, 2016

Uh oh!

nik9000 commented Jun 30, 2016

Uh oh!

nik9000 commented Jun 30, 2016

Uh oh!

clintongormley commented Jun 30, 2016

Uh oh!

nik9000 commented Jun 30, 2016

Uh oh!

dakrone commented Jun 30, 2016

nik9000 commented May 26, 2016 •

edited

Loading

javanna May 26, 2016 •

edited

Loading

javanna May 26, 2016 •

edited

Loading