-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES 5.4.3 unexpectedly removes indexes #26669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have any logs from this attempt/timeout/failure (stacktraces would be great to see where it timed out)
This sounds suspiciously like something like Any other relevant logs would be much appreciated, there should be logs at least of which indices were deleted and when. |
Lee,
Aside from one log that was a report of a failure to forcemerge (and a bunch of indicies disappearing right after that), I have no logs with anything interesting in them that I've seen that correlate with the disappearance of indexes. I _think_ the first time I observed this issue (with the silent index deletion) was on Aug 23rd (filesystem timestamps indicate the directories were removed around 7:40am that morning). Here is a link to a tarball of the logs for the past 30 days from the warm node (where the forcemege failed, and where the filesystem problems were happening): https://mesa5.coloradomesa.edu/~dan/0f48740bc7251ae2c2a661a3f928f05760c1dc31.tar.gz
I have yet to see any ElasticSearch log file mention the removal of an index. If you're curious or interested, here are the logs from one of the cold ('crypt') nodes for the past 30 days. This node would have had shards from at least 10 indicies on it that disappeared the day that most of the indexes went missing: https://mesa5.coloradomesa.edu/~dan/10d7714affb5cc21747601320fb15cbfb49a0b1a.tar.gz
I wondered if curator might be the issue as well, but I'm 99% certain that it isn't: the index disappearance issue didn't stop until I fixed the filesystem problem on the warm node (it was happening daily for about 2 weeks). Also, for the days that I checked closely, the curator delete did not correlate with the time that the indicies disappeared. The curator delete happens in the evening (~7pm), but the index disappearance always happened around 12 hours later, with a variance of 1-3 hours from that). I'm the only person who manages this system, so it wouldn't be anyone else using curator or deleting it, and the firewall rules are restrictive enough that only a few people can even access the system. FWIW: the curator program is run from a master node, not the warm node.
While the forcemerge failure might have resulted in the initial deletion of the indicies, I stopped running that process (which is done via curator_cli) a few days later and the index disappearance issue continued to happen. In the daily script that does the deletion of old indexes and forcemerges, the forcemerge is one of the last steps in the script. The only other things that might have run afterwards were a couple of index routing changes. Before I completely commented out the forcemerge, it was the final step in that script.
Here are the relevant lines from the script about curator deletes:
DAYSTOKEEP=29
echo "Deleting indicies older than ${DAYSTOKEEP} days... (`date`)"
curator_cli delete_indices --ignore_empty_list --filter_list '[
{"filtertype":"pattern","kind":"prefix","value":"logstash-"},
{"filtertype":"age", "source":"name", "direction":"older", "timestring":"%Y.%m.%d", "unit":"days", "unit_count":'${DAYSTOKEEP}'},
{"filtertype":"none"}]'
The current version of curator_cli is 5.2.0. When the initial index disappearance happened I was running ElasticSearch 2.4.1 and an older version of curator_cli; I upgraded ES to 5.4.3 on Sep 6th (and curator_cli 5.2.0 a few days later). The upgrade was done initially because of the issues I was seeing in the ES logfile with the forcemerge failures, and I thought the forcemerge errors and index disappearance issue might be related to that. I didn't realize the warm node filesystem itself was the issue until later (because a full fsck -f run on the volume would come back clean) -- it was only later when I was seeing drive timeout messages from 'dmesg' that I got to the bottom of the filesystem issue.
I realize that this is just a corner case that ElasticSearch isn't handling well, but how it is reacting to the issue is serious enough to warrant the bug report.
Thanks,
- Daniel
|
Okay, egg on my face on this issue: my test environment had a cron script to trim indicies, and after an IP shuffle it was sending commands to the production system instead of itself.
Sorry for the noise and time wasted on this!
- Daniel
|
ES: Version: 5.4.3, Build: eed30a8/2017-06-22T00:34:03.743Z, JVM: 1.8.0_144
Plugins: none
Java: 1.8.0_144
OS: Linux es-archival1 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
I have experienced an issue with two versions of ElasticSearch (5.4.3 and 2.4.1), where ElasticSearch deletes indicies that it should not delete. My environment is a multi-mode cluster with Logstash indexes being routed to three different groups: ingestion, warm, and cold (this last one is named 'crypt' in my environment). The logstash indexes (Logstash-YYYY.MM.DD) are created on the ingestion nodes, routed/moved to the 'warm' node a day later, and then routed/moved to the cold nodes about three days later.
The 'warm' node (there is only one in my environment) developed a filesystem/SAN issue where some blocks became unaccessible (ES would time out trying to read those blocks), and this was causing ElasticSearch on this node to get hung up when trying to route/move shards to the cold nodes. When this happened, after a ~12 hour attempt/timeout/failure, ElasticSearch would remove ALL Logstash indexes at or older than the index that it was trying to move to the cold nodes (i.e. all Logstash indexes that were in good shape on the cold nodes [about 30 indexes in my case], in addition to the Logstash index that it couldn't move because of the filesystem errors).
Fixing the filesystem error resolved the issue where indexes were being unexpectedly deleted.
None of the Elasticsearch logfiles (on all of the systems that I spot-checked) showed any indication of why all of the extra indexes were removed. Interestingly enough, other indexes on the cold nodes that didn't have the Logstash-YYYY.MM.DD name format were not affected.
My issues are resolved now (the filesystem/device issues are fixed), but the unexpected deletion of other indexes (particularly since none of their shards were stored on this server any more) is why I'm reporting this issue.
This is how I create my indicies:
curl -XPUT 'http://localhost:9200/logstash-2017.09.16' -H 'Content-Type: application/json' -d '{ "settings" : { "index.routing.allocation.include.category": "ingestion", "index.mapping.total_fields.limit": 3000, "number_of_shards": 6, "number_of_replicas": 0, "index.unassigned.node_left.delayed_timeout": "5m" } }'
This is the command that would (eventually) result in the loss of all Logstash indexes prior to Sep 10:
curl -XPUT 'http://localhost:9200/logstash-2017.09.10/_settings' -H 'Content-Type: application/json' -d '{ "index.routing.allocation.include.category": "crypt", "number_of_replicas": 1 }'
Thanks,
The text was updated successfully, but these errors were encountered: