Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

danopia · 2018-04-12T00:23:25Z

Elasticsearch version: 6.1.2, from https://artifacts.elastic.co/packages/6.x/yum

Plugins installed: discovery-ec2, repository-s3

JVM version:

java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

OS version: Amazon Linux 2. Linux ip-10-2-45-153.ec2.internal 4.9.76-38.79.amzn2.x86_64 #1 SMP Mon Jan 15 23:35:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: When a cluster isn't green, translogs have a tendency of not getting trimmed anymore. Flushes sure work, and the uncommitted_size of the translog goes down, but the whole translog sits there reaching into hundreds of gigabytes. I've tried the _flush API, the synced-flush API, and really any API I can find. Expected behavior: the translog size should match the uncommitted_size after a flush, unless there is an ongoing recovery on that shard.

Relevant stats from an index I haven't written to in at least 12 hours:

      "store": {
        "size": "110.1gb",
        "size_in_bytes": 118257807651
      },
      "translog": {
        "operations": 79068513,
        "size": "214.5gb",
        "size_in_bytes": 230343762719,
        "uncommitted_operations": 0,
        "uncommitted_size": "387b",
        "uncommitted_size_in_bytes": 387
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time": "7.7s",
        "throttle_time_in_millis": 7719
      }

Full stats from that index: https://gist.github.com/danopia/7390f5bbb898bfe4c2b3709a503539ec

Steps to reproduce:

Pretty hard for this issue (I haven't tried to cause this problem) but here's some general steps.

Set up an EC2 cluster and index a lot of data into indexes with these settings. I indexed a few dozen terabytes, 50-200GB per shard.

      "index": {
        "refresh_interval": "60s",
        "number_of_shards": "18",
        "translog": {
          "flush_threshold_size": "5gb"
        },
        "number_of_replicas": "1",
      }

Wait for an Elasticsearch node to die. I had two clusters exhibit this behavior, one initially caused by ES running out of file descriptors (since been addressed), the second was EC2-level instance failure.
Create a new fully-allocated index and continue writing into it.
Observe that while the existing data recovery is ongoing, the new indexes can get translogs that grow unbounded.
Once an index has a translog in this state, no combination of API calls can fix it.

The worst part about the stuck translogs is even after I've moved on from indexing into those indexes, they're still in the cluster and take AGES to recover. I'm talking multiple hours per shard recovery because the entire translog must be read in order to recover from existing store, and then read AGAIN to recover a replica. So I'm not aware of any options other than deleting the impacted indexes or living with that.

Some impacted indexes have "Too many open files" logs, others have zero mentions in ES errors. So it's not a requirement but I'm including some such log lines for completeness.

[2018-04-11T11:15:30,421][WARN ][o.e.i.s.IndexShard       ] [i-0352bafb7594c1b75] [events-2018.04.10b][3] failed to flush index
[2018-04-11T18:26:16,055][WARN ][o.e.i.c.IndicesClusterStateService] [i-0c1f657db93d5c6e3] [[events-2018.04.10b][6]] marking and sending shard failed due to [shard failure, reason [lucene commit failed]]
[2018-04-11T18:26:16,056][WARN ][o.e.c.a.s.ShardStateAction] [i-07d34a9aad3760a6c] [events-2018.04.10b][6] received shard failed for shard id [[events-2018.04.10b][6]], allocation id [OWI1rj9WRUiRsHg2cRy96Q], primary term [0], message [shard failure, reason [lucene commit failed]], failure [FileSystemException[/mnt/data/nodes/0/indices/cpYZBHd3TSSMy4qxZXKEHA/6/index/_3vx.cfs: Too many open files]]

With this issue I'm looking for reasons why I can't fix the translogs once they grow unbounded. I'm not entirely sure on the root cause of the unbounded growth, and it's less important as long as I can fix it after... but the root cause really seems to be related to recoveries and the amount of file descriptors that recoveries require.

Here's a graph of a totally healthy cluster getting hit by an EC2 failure, and then spiraling into craziness. Wednesday around 12:00 I performed a full-cluster restart to 10x the file descriptor limit.

The text was updated successfully, but these errors were encountered:

dnhatn · 2018-04-12T00:55:46Z

@danopia Would you mind sharing the shard-level stats with us? This can be retrieved via GET /_stats?level=shards. Thank you!

elasticmachine · 2018-04-12T00:55:55Z

Pinging @elastic/es-distributed

danopia · 2018-04-12T01:19:25Z

Sure. I have a hundred indexes so hopefully just stats from just one impacted index will be useful.

Index: yux-2018.04.11d

It's been an hour since I wrote into this index, and repeated synced-flush attempts say things like "pending operations" and "commit has changed" for many shards. Most recent synced-flush output: https://gist.github.com/danopia/79cf1394b9806cf374c6ea16c90f9623

Shard-level stats for that index: https://gist.github.com/danopia/af7b5df82b3d939e6e99df87cabb8b01

Thanks for the fast response :)

bleskes · 2018-04-12T10:07:26Z

Thanks @danopia

I want to double check a few things:

When you have an outage, I presume you continue indexing. Correct? How fast do you index? i.e., roughly how many bytes per second?
Why did you set flush_threshold_size to 5gb?
I noticed that the index has 14685147 documents but the maximum sequence number is 15253725.
3.1 are you using auto generated ids or supplying your own?
3.2 once indexed, do you update or delete documents? if so, using which API (delete/update by query?)
3.3 do use versioned operations?

The worst part about the stuck translogs is even after I've moved on from indexing into those indexes, they're still in the cluster and take AGES to recover. I'm talking multiple hours per shard recovery because the entire translog must be read in order to recover from existing store, and then read AGAIN to recover a replica.

Can you clarify a bit more about what you mean? Did you stop indexing while replicas were still recovering or were you trying to move the indices to another node?

danopia · 2018-04-12T18:18:58Z

The indexers don't automatically stop, no. I have replication over two racks with restricted allocation, so one node's failure effectively means some shards no longer have a replica, but everything keeps working. Otherwise the indexers would see indexing errors. The cluster I've provided stats from takes 1-2 million documents per minute, most of that going into one index at a time. One day of data is ~2TB in ES, including replicas.
1. I already knew that heavy indexing into a recovering shard is bad. My strategy is creating a new index with healthy allocations and switching indexing over to that. Because of the issue I'm reporting here, though, I don't have a way to clean up the nasty indexes later.
I set flush_threshold_size on my previous ES 2.2 clusters and copied over the index templates 😒 Since I've had these issues on the new ES version, I've been stepping down to 1gb, but these indexes were created using the 5gb figure. The goal was to let the translog go longer without being flushed, because I'm on spinning disks (in RAID 0).
Good catch, I am supplying my own IDs. What you're seeing is unrelated data duplication (reprocessing, retries, etc) being deduplicated by the Elasticsearch indexing. this is actually a nice feature for us because even though data in ES is perfectly deduplicated, the _version attribute lets us check if a certain event was duplicated originally.
1. The IDs are generated upstream. They are prefixed with the document's ISO-8601 timestamp, followed by a hash. Most timestamps are from the past 5 minutes but otherwise not very related.
2. Yea, but only by including the same document (or slightly different document) in another bulk-index API. I never call delete/update APIs.
3. Versioned operations don't sound familiar. My only regular load is _bulk calls, and Kibana usage + _search.

Sure. When I see translogs not being cleared, I manually stop all indexing until I see fully-allocated hot indexes. Usually I just make new empty indexes, wait for them to allocate, then start indexing with the new indexes. If the new index translogs aren't stuck, this is all good.

(One time, I made a new index, then performed a full cluster restart to fix a systemd config, and none of the translogs on that new index worked once I started putting data in)

Anyway, a pain point here is that any future recoveries that involve a broken shard can take hours to complete. At worst, I had a full index of translogs with 67 million entries each, and each recovery took upwards of 10 hours. ES tried to rebalance and used up all the file descriptors, crashing multiple nodes. Then the primary shards had to read the entire translog in order to recover from existing store, so my cluster was red for that entire time. Replicas also took just as long to do a peer recovery. Here's a screenshot of those recoveries:

Basically, once I have these broken indexes, any sort of node restart (config changes) or node failure starts the whole recovery ordeal over again.

Going by #28205 it sounds like Elasticsearch doesn't have the capability of seeking into the translog, so I assume it was reading all of those entries just looking for the end.

danopia · 2018-04-12T18:29:25Z

Maybe this is related? Doesn't read like the same trigger, but I'm seeing the same behavior with # of open files, around the same time. #29097

bleskes · 2018-04-13T07:31:41Z

Thanks @danopia . Let's first try to reduce the number of moving parts.

The goal was to let the translog go longer without being flushed, because I'm on spinning disks (in RAID 0).

I'm not sure whether this achieves what you want - every operation written to the translog is fsynced at the end of each bulk. The flush parameters controls how often a lucene commit point is made and not how much documents are held in memory before the translog is written to disk.

Since recovery first copies over lucene files, it uses the commit point. If the commit is very old (i.e., 5gb behind) you need to re-index a lot of data and recovery will be slow. Another side effect is that while a recovery is ongoing we can't trim the translog. So if takes a long time and you index really quickly, the translog can get very large.

I suggest you go back to using defaults and see what effect it has.

It's been an hour since I wrote into this index, and repeated synced-flush attempts say things like "pending operations" and "commit has changed" for many shards.

It sounds like you were doing this while the indexers were running. If so it's expected. Synced flush only works when no indexing is going on.

Going by #28205 it sounds like Elasticsearch doesn't have the capability of seeking into the translog, so I assume it was reading all of those entries just looking for the end.

We can random access into the translog with a granularity of 64MB. That's good enough for recovery, not good enough for other stuff (hence the issue you linked to).

Maybe this is related? Doesn't read like the same trigger, but I'm seeing the same behavior with # of open files, around the same time. #29097

You symptoms are similar because you translog is huge. I think the cause is different.

Can you please reduce the translog flush size to default (set it to null, it's a dynamic setting) and see how things go? I don't expect them to be solved but I think it will be easier to figure out.

danopia · 2018-04-26T02:27:15Z

Sorry for the radio silence. I removed the translog size override from all my clusters, thanks for the heads-up. The metrics are smoother now.

After my last comment I upgraded the aforementioned cluster to a 6.2.4-SNAPSHOT (using this tree) based on a belief that #29097 was at play. The upgrade didn't fix the translogs (as you expected), but does fix the out-of-files crashes. The number-of-files metric has been beautiful ever since. No.FDs is what was actually crashing nodes and destabilizing clusters. This means I can take my time with the translog investigation.

(The fun part was losing visibility while 6.2.4 was fixing the number-of-files. The node's stats were completely unresponsive for the hour or so that ES needed to clean up all the files, for each cleanup that happened.)

Anyway, after the upgrade was completed: Four translogs (2 shards out of 9, +replicas) got stuck on the same day without a known cause. They just stuck in-sync. ~12 hours later at 17:00PT, indexes rolled over and the new indexes cleanly took the load.

Index settings: https://gist.github.com/danopia/b7020dbd6832c89d9da21273ff43ec28
Shard-level index stats: https://gist.github.com/danopia/d35fd1b1e9f6c0acb77dd078021abf1b
This shard hasn't been written in at least 5 days.
Flush is no longer complaining (presumably due to the upgrade). It says every shard successfully flushed, but as before, the stuck translogs don't budge.

"POST /dice.prod-2018.04.17/_flush/synced"
{
  "_shards": {
    "total": 18,
    "successful": 18,
    "failed": 0
  },
  "dice.prod-2018.04.17": {
    "total": 18,
    "successful": 18,
    "failed": 0
  }
}

A week later, I purposely caused other translogs to inflate, by starting recovery on a hot index. Once the index was rolled over, the recoveries completed and the translogs were properly trimmed in a reasonable timeframe. But the stuck translogs are still there. In the same cluster!

Hopefully running a SNAPSHOT doesn't invalidate all this information. But if it does, I have another cluster running real 6.2.4 and with multiple instances of the stuck-translog symptom. I'm waiting to see if the cluster going green helps that one.

dnhatn · 2018-05-02T15:10:26Z

@danopia It would be helpful if you can provide this information.

Did you use dynamic mapping updates?
Did you use bulk index API? If so, was there any failure in bulk responses?

Thank you!

danopia · 2018-05-07T17:20:12Z

Yes, I have a common template across every index.
- https://gist.github.com/danopia/737731a488f9103df3fb25cd6c6f38cf
- We only add string-IDs, numbers, and bools dynamically so the mapping disables some auto-detection.
Yes, everything is bulked, currently set to a 20MB max size.
Looking at a few recent cases, there's definitely a couple hundred documents that were rejected for some reason or another right around the index rollover times. The # of rejects seems related to how many shards get stuck. I let indexes get created an hour early and misaligned data seeds them a little, but it must not be enough to have the mapping filled out.

Do you have a possible cause for bulk errors causing translog poison pills of some kind? It sounds like a lead..

dnhatn · 2018-05-07T22:59:36Z

Looking at a few recent cases, there's definitely a couple hundred documents that were rejected for some reason or another right around the index rollover times. The # of rejects seems related to how many shards get stuck.

@danopia Thanks again. I think we have figured out the source of the issue and fixed it in #30244. @bleskes WDYT?

danopia · 2018-05-07T23:22:29Z

Nice, good to see there's a known thing that correlates :)

I've consulted the logging cluster that my clusters log to, and definitely see this from the time period of the above graphs:

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s

Other unaffected days don't have that log.

Sounds like the fix in 6.3.0 won't fix old data, so I'm not looking forward to reprocessing all the tranlogs I've accumulated 😛

[edit] Just noticed a referenced issue mentioning an open/close fixing the indexes, so I'll see what that does for me

dnhatn · 2018-05-07T23:52:29Z

Sounds like the fix in 6.3.0 won't fix old data, so I'm not looking forward to reprocessing all the tranlogs I've accumulated

You can fix the affected indices by rebuilding its replicas. This can be done by changing the number_of_replicas to 0 then restore to the original value.

dnhatn · 2018-05-08T01:31:42Z

failed to process cluster event (put-mapping) within 30s

@danopia This is really fixed by #30244. I hope you don't mind if I close this but feel free to reopen. We really appreciate your cooperation here.

malpani · 2019-01-25T06:36:24Z

@dnhatn @bleskes @danopia - I understand this is now fixed 6.3+ but had a quick question on the impact of diverging local checkpoints.

When i hit this scenario on 6.2, translogs growing beyond index.translog.retention.size and not being truncated even beyond the retention periods set by index.translog.retention.age .

tldr; Are the translog retention settings ignored if the local checkpoints diverge? Ideally it should not right? But maybe the primary needs to keep all sequence-ids around since last global_checkpoint which will never come into sync because of this bug ?

bleskes · 2019-01-25T07:46:07Z

@malpani the translog is used for two things - the first is to reindexing all operations into lucene in the case of a restart. For this you need all operations above the global checkpoint (which is the minimum of the local checkpoints). The second things is to bring replicas that were offline up to speed using operation recoveries (they come from the translog in the 6.x polices). The first usage is required and we always keep the required portion of the translog for it (this is what you're seeing). The second usage is optional as we can always copy files. The retention setting controls how much of the translog should be kept around for the second usage alone. I hope this helps. If you have more questions please feel free to ask on our discuss forums. We try to keep GitHub for feature requests and bugs.

dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Apr 12, 2018

dnhatn added backport pending and removed backport pending labels Apr 12, 2018

colings86 added >bug feedback_needed labels Apr 24, 2018

dnhatn closed this as completed May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

danopia commented Apr 12, 2018 •

edited

Loading

dnhatn commented Apr 12, 2018

elasticmachine commented Apr 12, 2018

danopia commented Apr 12, 2018

bleskes commented Apr 12, 2018

danopia commented Apr 12, 2018

danopia commented Apr 12, 2018 •

edited

Loading

bleskes commented Apr 13, 2018

danopia commented Apr 26, 2018 •

edited

Loading

dnhatn commented May 2, 2018

danopia commented May 7, 2018

dnhatn commented May 7, 2018

danopia commented May 7, 2018 •

edited

Loading

dnhatn commented May 7, 2018

dnhatn commented May 8, 2018

malpani commented Jan 25, 2019

bleskes commented Jan 25, 2019

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Comments

danopia commented Apr 12, 2018 • edited Loading

dnhatn commented Apr 12, 2018

elasticmachine commented Apr 12, 2018

danopia commented Apr 12, 2018

bleskes commented Apr 12, 2018

danopia commented Apr 12, 2018

danopia commented Apr 12, 2018 • edited Loading

bleskes commented Apr 13, 2018

danopia commented Apr 26, 2018 • edited Loading

dnhatn commented May 2, 2018

danopia commented May 7, 2018

dnhatn commented May 7, 2018

danopia commented May 7, 2018 • edited Loading

dnhatn commented May 7, 2018

dnhatn commented May 8, 2018

malpani commented Jan 25, 2019

bleskes commented Jan 25, 2019

danopia commented Apr 12, 2018 •

edited

Loading

danopia commented Apr 12, 2018 •

edited

Loading

danopia commented Apr 26, 2018 •

edited

Loading

danopia commented May 7, 2018 •

edited

Loading