Skip to content

Reindex from remote #18585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 5, 2016
Merged

Reindex from remote #18585

merged 1 commit into from
Jul 5, 2016

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented May 26, 2016

This adds a remote option to reindex that looks like

curl -POST 'localhost:9200/_reindex?pretty' -d'{
  "source": {
    "remote": {
      "host": "otherhost:9200"
    },
    "index": "target",
    "query": {
      "match": {
        "foo": "bar"
      }
    }
  },
  "dest": {
    "index": "target"
  }
}'

This reindex has all of the features of local reindex:

  • Using queries to filter what is copied
  • Retry on rejection
  • Throttle/rethottle

The big advantage of this version is that it goes over the HTTP API which can be made backwards compatible. I have yet to test it against any version of Elasticsearch other than the modern one but it should be fairly easy to retrofit it to work with anything.

Some things are different:

  • The query field is sent directly to the other node rather than parsed on the coordinating node. This should allow it to support constructs that are invalid on the coordinating node for whatever reason.

Lots of things still need to be sorted out:

  • A bunch of NOCOMMITS
  • This uses httpclient directly and doesn't support tls or basic auth at all. I'd like to replace it with Elasticsearch's currently-being-built HTTP based java client.
  • At this point there isn't a whitelist for the host. This doesn't turn Elasticsearch into an arbitrary curling machine because reindex will only allow three commands - the one to start the scroll, the one to pull the next scroll, and the one clear the scroll. The URLs aren't arbitrary and neither are the methods and it only does them in that order. But, still, I'm going to have to add a whitelist.
  • I have to test it against more Elasticsearch versions.
  • We have to decide if we're going to integration test against real elasticsearch versions or just use mocks of their returned json and call that good enough. It is obviously much more work to setup the integration tests and much slower to test. 50ms per test vs 30 seconds per test kind of slower.
  • Building this involved a lot of refactoring. That refactoring needs to inform some unit test changes. Probably.
    More?

Closes #17447

@nik9000
Copy link
Member Author

nik9000 commented May 26, 2016

This uses httpclient directly and doesn't support tls or basic auth at all. I'd like to replace it with Elasticsearch's currently-being-built HTTP based java client.

Along those lines:

  • I'm just slamming strings together to build the URLs to send to elasticsearch which is almost certainly as SQL-injection style goldmine. My hope is that the fancy new HTTP based Java client for Elasticsearch will have a pattern I can use for that.

import static org.elasticsearch.common.xcontent.ConstructingObjectParser.constructorArg;

public class RemoteScrollableHitSource extends ScrollableHitSource {
private final CloseableHttpClient httpClient = HttpClients.createMinimal(); // NORELEASE replace me with the Elasticsearch client
Copy link
Member

@javanna javanna May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to see this //NORELEASE. A PR is coming soon for the low level http java client. It already cleans up duplications around http clients in our classpath/code. Can we wait with this PR so it can use directly the proper java client?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect so. Worst case we merge this first and rip all the duplication out when we cut over, but I'd like to avoid it.

Copy link
Member

@javanna javanna May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having done it already once on my branch, I'd rather not do it again. I will try to speed up my work and send the PR asap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand - this stuff is relatively self contained and doesn't rely on any of the test stuff you've already replaced. Either I replace it with the http client during the PR or in a separate PR, but it won't make merging the http client PR harder if this is merged first, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will cause a few merge conflicts, but what I don't like is the size of this PR, where stuff taken from REST tests becomes production code by copying it (HttpDeleteWithEntity and some Response bits). Let's just wait one/two weeks and this code won't be needed. Also the original code will get removed from rest tests and improved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stuff taken from REST tests becomes production code by copying it

Yeah, I wasn't particularly happy with that either but I felt like it was ok to use as a crutch while waiting on the client.

just wait one/two weeks

I'd love to iterate on the non-stolen-from-rest-tests parts in the mean time. I certainly wouldn't be surprised if it takes a week or two to get this reviewed sensibly.

@clintongormley
Copy link
Contributor

At this point there isn't a whitelist for the host. This doesn't turn Elasticsearch into an arbitrary curling machine because reindex will only allow three commands - the one to start the scroll, the one to pull the next scroll, and the one clear the scroll. The URLs aren't arbitrary and neither are the methods and it only does them in that order. But, still, I'm going to have to add a whitelist.

I'm not sure we need this. What are you trying to protect against by adding a whitelist?

@nik9000
Copy link
Member Author

nik9000 commented May 26, 2016

What are you trying to protect against by adding a whitelist?

It is to prevent Elasticsearch from being a jumping off point in the case of a partially compromised network. I don't trust us to prevent a sufficiently determined individual from figuring out how to do something nasty with the http calls that Elasticsearch issues on their behalf. I don't think of reindex-from-remote as being anything like dynamic scripting in terms of exploitability but I really like the idea of limiting it to a whitelist anyway just for extra paranoia.

@nik9000 nik9000 force-pushed the reindex_from_remote branch from 5158dda to 42e092f Compare May 26, 2016 16:55
@nik9000
Copy link
Member Author

nik9000 commented May 27, 2016

Yesterday I ran some tests that reindexed from Elasticsearch nodes running on my laptop rather than the system under test. I found and fixed two issues:

  • I was checking if the source index existed in the destination cluster even if we were pulling from remote. I never noticed because of the way I was testing.
  • Before 2.1.0 Elasticsearch doesn't support sort=_doc which is the default sort used by reindex because it makes the search process more efficient. This required another round trip to the source cluster to check its version and required that we handle scan's habit of not returning any results in the first round.

With those two things solved I was able to verify that this works against the following versions:

  • master
  • 2.3.3
  • 1.7.5
  • 0.90.13
  • 0.20.5

I hope no one has to use this so far back, but, at least with the basic test that I did, it works.

@nik9000
Copy link
Member Author

nik9000 commented May 30, 2016

I've played with this a little more locally. I indexed 10,000 documents into the same cluster and then did both a few normal and a remote reindexes. The first ones were slow because, like a second. After that the "local" reindex hovers around ~270ms and the remote one around ~420ms. I'd like to play with it some and see where the bottleneck is. It might just be the network, http, and json overhead incurred by going remote. The "local" reindex can take advantage of lots of nice work Elasticsearch does to make requests more efficient, like local requests skipping the transport layer entirely.

*/
if (mainRequest.getSearchRequest().source().sorts() == null || mainRequest.getSearchRequest().source().sorts().isEmpty()) {
mainRequest.getSearchRequest().source().sort(fieldSort("_doc"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, could we have

        List<SortBuilder<?>> sorts = mainRequest.getSearchRequest().source().sorts();
        if (sorts == null || sorts.isEmpty()) {
            mainRequest.getSearchRequest().source().sort(fieldSort("_doc"));
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@tlrx
Copy link
Member

tlrx commented May 31, 2016

I have a first look and it looks good! Most of my comment are really minor and concern naming.

I do like how it fits into the current reindex infra but I'm wondering if we could reuse SearchHit/InternalSearchHit/SearchResponse/InternalSearchResponse instead of using the ScrollableHitSource.Hit, Response, BasicHit, ClientHit stuff... It looks like a big part of the code is declaring these objects and convert them around only because of the remote reindex use case. I'd rather see the remote reindex makes use of the current objects, even partially. It would also help later once the built-in HTTP client will be available, what do you think?

@nik9000
Copy link
Member Author

nik9000 commented May 31, 2016

I'd rather see the remote reindex makes use of the current objects, even partially.

I really don't like the objects we expose though! I think they are way overcomplicated to build. All kinds of weird internals leak out of them like type being Text. I think it makes the reindex testing simpler not to build them. ClientHit, with its handling of source, parent, routing, ttl, and timestamp I think simplifies the rest of the code in nice ways.

I don't think the java http client will help much with this, but I'm happy to wait and see!

@djschny
Copy link
Contributor

djschny commented Jun 5, 2016

Most likely not for first round, but something to consider for future iterations is parallelizing of the scrolls. I often would do this when manually re-indexing prior to 2.3. You find a incremental ID field or date field in your data and then break it up evenly filtering on that field so that you can have 4 or however many scrolls happening in parallel.

@@ -56,11 +56,27 @@ public void testReindexRequest() throws IOException {
randomRequest(reindex);
reindex.getDestination().version(randomFrom(Versions.MATCH_ANY, Versions.MATCH_DELETED, 12L, 1L, 123124L, 12L));
reindex.getDestination().index("test");
if (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (true) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! That is a leftover from a serialization bug I was hunting down. I'll push a fix.

@dakrone
Copy link
Member

dakrone commented Jun 28, 2016

Left comments, this is an exciting feature @nik9000!

@nik9000
Copy link
Member Author

nik9000 commented Jun 28, 2016

Thanks for all the review @dakrone ! I'll go through the comments one more time and push an update.

@nik9000
Copy link
Member Author

nik9000 commented Jun 28, 2016

@dakrone Ok - I believe I've either fixed or replied to all of your comments. Thanks for slogging through this one.

@dakrone
Copy link
Member

dakrone commented Jun 28, 2016

I'd like to wait and implement anything but non-authorized http in another PR.

How about changing this currently so the remote option supports scheme which can only be http right now? Right now if someone passes http://localhost:9200 as the URL (which I imagine a lot of people will use that format), they'll get an error like "[host] can either be of the form [host] or [host:port] but was [http]" which is a confusing message

@nik9000
Copy link
Member Author

nik9000 commented Jun 29, 2016

How about changing this currently so the remote option supports scheme which can only be http right now? Right now if someone passes http://localhost:9200 as the URL (which I imagine a lot of people will use that format), they'll get an error like "[host] can either be of the form [host] or [host:port] but was [http]" which is a confusing message

Sure!

@nik9000
Copy link
Member Author

nik9000 commented Jun 29, 2016

@dakrone, I pushed a commit that handles scheme. I'd still prefer to do the auth stuff in another PR, but this is nicer.

@nik9000
Copy link
Member Author

nik9000 commented Jun 29, 2016

And I'll push an update to the docs in a second....

}
String scheme = hostMatcher.group("scheme");
if (scheme == null) {
scheme = password == null ? "http" : "https";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should force https if a password is used. It's unfortunate but someone might want to use auth without encryption

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can still get http if they send the host as http://somehost or http://something:9200.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely document this then, it's hidden behavior otherwise

@nik9000
Copy link
Member Author

nik9000 commented Jun 29, 2016

@dakrone I pushed another round of docs.

@clintongormley
Copy link
Contributor

@nik9000 I've only looked at the docs. Instead of having default port, scheme, etc, why not just require a FQDN? It removes all the ambiguity about what scheme to use, what port etc. The user should just specify the node they want to connect to.

Also, I'd add a note about the query and search params being passed directly to the remote cluster, so should use the syntax accepted by that version of Elasticsearch.

@nik9000
Copy link
Member Author

nik9000 commented Jun 30, 2016

Instead of having default port, scheme, etc, why not just require a FQDN

I could require http://otherhost:9200 every time. I could also require:

{
  "remote": {
    "host": "otherhost",
    "scheme": "http",
    "port": 9200
  }
}

Every time. I'm ok either way. I picked this way because it felt the most true to your original proposal on the issue.

@nik9000
Copy link
Member Author

nik9000 commented Jun 30, 2016

I picked this way because it felt the most true to your original proposal on the issue.

In defense of the way i have it working now: the vast majority of the time you'll never have to specify a scheme or a port. It'll just "do the right thing" with the scheme and if you want to do something whacky like use https without a password or http with a password you can make it to do without too much trouble.

But I'm ambivalent. So long as the scheme, host, and port are all available I'm ok with any API.

@clintongormley
Copy link
Contributor

I could require http://otherhost:9200 every time.

This would be my preference

@nik9000
Copy link
Member Author

nik9000 commented Jun 30, 2016

OK - I've got that implemented. The rest test infrastructure doesn't support it so I've reached out to @javanna to talk about the right way to get it into the test infrastructure.

@dakrone
Copy link
Member

dakrone commented Jun 30, 2016

Okay, on the code side I think this LGTM, I played with it reindexing 40k docs from a 1.7.5 cluster running locally and it took ~35 seconds, so pretty nice!

@nik9000
Copy link
Member Author

nik9000 commented Jun 30, 2016

Okay, on the code side I think this LGTM, I played with it reindexing 40k docs from a 1.7.5 cluster running locally and it took ~35 seconds, so pretty nice!

Hurray! I'll work on getting the rest tests sane and we should be ready!

@nik9000 nik9000 force-pushed the reindex_from_remote branch 4 times, most recently from be28b83 to 9523cd1 Compare July 5, 2016 20:09
This adds a remote option to reindex that looks like

```
curl -POST 'localhost:9200/_reindex?pretty' -d'{
  "source": {
    "remote": {
      "host": "http://otherhost:9200"
    },
    "index": "target",
    "query": {
      "match": {
        "foo": "bar"
      }
    }
  },
  "dest": {
    "index": "target"
  }
}'
```

This reindex has all of the features of local reindex:
* Using queries to filter what is copied
* Retry on rejection
* Throttle/rethottle
The big advantage of this version is that it goes over the HTTP API
which can be made backwards compatible.

Some things are different:

The query field is sent directly to the other node rather than parsed
on the coordinating node. This should allow it to support constructs
that are invalid on the coordinating node but are valid on the target
node. Mostly, that means old syntax.
@nik9000 nik9000 force-pushed the reindex_from_remote branch from 9523cd1 to b3c015e Compare July 5, 2016 20:14
@nik9000 nik9000 merged commit b3c015e into elastic:master Jul 5, 2016
@nik9000
Copy link
Member Author

nik9000 commented Jul 5, 2016

Thanks for all the reviews everyone! I've just merged this after resolving some "fun" with the rest tests and docs.

There are still a few // NORELEASEs left and I'll start work on them soon.

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >feature release highlight v5.0.0-alpha5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants