-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Delete by query should not silently refresh index #3593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for writing this down! Patches welcome :) |
Any update on this issue? we are facing similar problem |
Depends on #7052 |
DBQ is moved to a plugin in ES 2.0. |
s1monw
added a commit
to s1monw/elasticsearch
that referenced
this issue
Oct 11, 2017
…er to disk" Today, when ES detects it's using too much heap vs the configured indexing buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move the bytes to disk, clear version map, etc. But this has the unexpected side effect of making newly indexed/deleted documents visible to future searches, which is not nice for users who are trying to prevent that, e.g. elastic#3593. This is also an indirect spinoff from elastic#26802 where we potentially pay a big price on rebuilding caches etc. when updates / realtime-get is used. We are refreshing the internal reader for realtime gets which causes for instance global ords to be rebuild. I think we can gain quite a bit if we'd use a reader that is only used for GETs and not for searches etc. that way we can also solve problems of searchers being refreshed unexpectedly aside of replica recovery / relocation. Closes elastic#15768 Closes elastic#26912
s1monw
added a commit
that referenced
this issue
Oct 12, 2017
…er to disk (#26972) Today, when ES detects it's using too much heap vs the configured indexing buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move the bytes to disk, clear version map, etc. But this has the unexpected side effect of making newly indexed/deleted documents visible to future searches, which is not nice for users who are trying to prevent that, e.g. #3593. This is also an indirect spinoff from #26802 where we potentially pay a big price on rebuilding caches etc. when updates / realtime-get is used. We are refreshing the internal reader for realtime gets which causes for instance global ords to be rebuild. I think we can gain quite a bit if we'd use a reader that is only used for GETs and not for searches etc. that way we can also solve problems of searchers being refreshed unexpectedly aside of replica recovery / relocation. Closes #15768 Closes #26912
s1monw
added a commit
that referenced
this issue
Oct 13, 2017
…er to disk (#26972) Today, when ES detects it's using too much heap vs the configured indexing buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move the bytes to disk, clear version map, etc. But this has the unexpected side effect of making newly indexed/deleted documents visible to future searches, which is not nice for users who are trying to prevent that, e.g. #3593. This is also an indirect spinoff from #26802 where we potentially pay a big price on rebuilding caches etc. when updates / realtime-get is used. We are refreshing the internal reader for realtime gets which causes for instance global ords to be rebuild. I think we can gain quite a bit if we'd use a reader that is only used for GETs and not for searches etc. that way we can also solve problems of searchers being refreshed unexpectedly aside of replica recovery / relocation. Closes #15768 Closes #26912
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi this issue caused lots of trouble because it was not clear why this happened. I had some index updates where a (quite common) approach is used:
I have to update a bulk of documents with some higher level group key (not the uid). Like:
The code that updates this group of documents does not know the real _id of those already in the index (it just knows that the whole group updates), so it first deletes all documents by using deleteByQuery on the group key. After that it reindexes all documents in the group (with possibly different new _id values).
If you don't disable index refreshing, for a short time, the whole group would be disappearing and reappearing then. So to make the whole group reindex "atomic" you would disable index refreshing before that and reenable it afterwards (or do manual refreshing at all - what I do for this index in any case).
Unfortunately, deleteByQuery forcefully refreshes the index. Which is hard to understand because its not documented. There is just a comment in the code that the refresh is needed although its heavy, because when executing a Lucene IndexWriter deleteByQuery, ElasticSearch does not know what documents were really deleted, so all internal tracking does not work (it cannot update version consistency,...)
I was discussing with Martijn on IRC (not even he was aware that deleteByQuery does not work with disabled refreshing), he suggested that maybe the query is executed in ElasticSearch itsself and then it starts a bulk on _uid deletes (this is also one possibility for a workaround in our case if number of deletes is small).
In my opinion the better variant would be to do it like in Apache Solr: Apache Solr has 2 different IndexReaders open: One for searching the index (this one is refreshed in those periods of times), but a second one is another NRT reader on the IndexWriter that is used to do some updates of data structures after IndexWriter has written stuff. So updating of the ES internal data should be done with a new NRT reader and not the one used for searching.
The text was updated successfully, but these errors were encountered: