-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Disk decider can allocate more data than the node can handle #7753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, we have hit the same issue multiple times (version 1.0.2): This should easily be able to be tested when restarting a cluster. Our disk load is about 70% on average with 3 shards per node, before we restart the cluster. If we restart the cluster, elasticsearch allocates sometimes more than 2-3 more shards to some of these nodes, even though that node's disk is already nearly full. (limits are set to 65% don't allocate, and 95% move away). At the end these nodes won't have more than 4 shards "active", but there are still "unused" shards on the node which server as backup copies if the recovery from the primary fails (as we restarted the cluster). The main issue is, that new allocation does not seem to take into consideration the expected shard size of the primary. Then after a few minutes, nodes start running completely full and we have to manually aboard the relocation process to these nodes. |
It only considers the expected disk after the shard in question has completed relocation to the node. The size is independently considered for each shard, because
The disk usage and shard allocation is a bit more complex, because in order to determine the free disk usage we need to poll at an interval for the amount of disk space used, since the So here's what can happen if not careful:
This is worst-case scenario. There are a few ways to address this. First, the interval for the cluster info update can be shortened, so that FS information is gathered more frequently. Second, the For future debugging, to log what ES thinks the current sizes are, you can enable TRACE logging for I will also try to think of a better way to prevent this situation from happening in the future. |
Could it be that when the master allocates a shard, that it only takes into consideration hd usage and the predicted size of that shard, but not the ones which haven't yet recovered/initialised and where it did the same operation just before? In our case relocation will never finish. I'm not sure, but I think cluster_concurrent_rebalance has any effect when you do a cluster restart as primaries and replicas need to be allocated. That would explain why we see this behaviour at cluster restart |
It does take disk usage into account (through the ClusterInfoService) and predicted size of that shard, however it does not take into account the final, total size of shards that are currently relocating to the node. So the decider looking at a node with 0% disk usage evaluating a shard that's 5gb will see that the node will end up with 5gb of used space, even if there are |
I think it might be possible to get the list of other shards currently relocating to a node and factor their size into the final disk usage total, if this solution sounds like it would be useful for you @bluelu @grantr , however, it would be good to confirm the source of the issue (if it is indeed multiple relocations being evaluated independently) first. Turning on the logging I mentioned above would be helpful if you see the issue again! |
Sounds like the exact solution for the problem we have. |
Yes, we run into this if we don't manually intervene like we do now. |
@dakrone we have run into this situation before where the available disk space calculation did not take into account in progress relocations. When @drewr was helping us recover a cluster, he recommended setting the number of concurrent relocations to1 in order to prevent this from happening. I can drop more details in here when I'm back in front of a computer. |
You can also lower the what? Inquiring minds want to know ;) |
Whoops sorry! I meant to say lower the speed at which the shard is recovered (through throttling), to give the |
When using the DiskThresholdDecider, it's possible that shards could already be marked as relocating to the node being evaluated. This commit adds a new setting `cluster.routing.allocation.disk.include_relocations` which adds the size of the shards currently being relocated to this node to the node's used disk space. This new option defaults to `true`, however it's possible to over-estimate the usage for a node if the relocation is already partially complete, for instance: A node with a 10gb shard that's 45% of the way through a relocation would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before examining the watermarks to see if a new shard can be allocated. Fixes elastic#7753 Relates to elastic#6168
When using the DiskThresholdDecider, it's possible that shards could already be marked as relocating to the node being evaluated. This commit adds a new setting `cluster.routing.allocation.disk.include_relocations` which adds the size of the shards currently being relocated to this node to the node's used disk space. This new option defaults to `true`, however it's possible to over-estimate the usage for a node if the relocation is already partially complete, for instance: A node with a 10gb shard that's 45% of the way through a relocation would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before examining the watermarks to see if a new shard can be allocated. Fixes #7753 Relates to #6168
We had a disk full event recently that exposed a potentially dangerous behavior of the disk-based shard allocation.
It appears that the disk-based allocation algorithm checks to see whether shards will fit on a node and disallows shards that would increase the usage past the high watermark. That's good. But in our case, the disks filled up anyway.
We had a cluster where every node's data partition was close to full. When a node (we'll call it node A) ran out of space, most of the shards allocated to it failed and were deleted. This took it very close to the low watermark, but not quite under. Later, an unknown event (possibly a merge, possibly a human doing something) freed more space and brought disk usage back under the low watermark. Elasticsearch allocated a few of the failed shards from before back to the same node. However, recovery of those shards failed due to disk full errors.
At roughly the same time, another node in the cluster (node B) ran out of disk and failed a bunch of shards.
I believe the recovery failed because two events triggered allocation at roughly the same time. The first caused the disk-based allocator to allocate some shards to the node. While those shards were initializing, the second event caused another instance of the allocator to allocate even more shards to the same node.
Does the disk-based allocator consider the expected disk usage after current recoveries are finished, or does it ignore current recoveries?
Unfortunately I don't have logs of allocation decisions, so I don't know exactly which shards were allocated where. I know that all the shards that failed recovery were originally allocated to node A. It's possible that none of the shards from node B were actually allocated to node A.
Regardless of what actually happened, I'm hoping that someone can explain to me what the disk-based allocator would be expected to do in the above case where there are two allocation events in a short time.
The text was updated successfully, but these errors were encountered: