Skip to content

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danopia opened this issue Apr 12, 2018 · 16 comments
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. feedback_needed

Comments

@danopia
Copy link

danopia commented Apr 12, 2018

Elasticsearch version: 6.1.2, from https://artifacts.elastic.co/packages/6.x/yum

Plugins installed: discovery-ec2, repository-s3

JVM version:

java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

OS version: Amazon Linux 2. Linux ip-10-2-45-153.ec2.internal 4.9.76-38.79.amzn2.x86_64 #1 SMP Mon Jan 15 23:35:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: When a cluster isn't green, translogs have a tendency of not getting trimmed anymore. Flushes sure work, and the uncommitted_size of the translog goes down, but the whole translog sits there reaching into hundreds of gigabytes. I've tried the _flush API, the synced-flush API, and really any API I can find. Expected behavior: the translog size should match the uncommitted_size after a flush, unless there is an ongoing recovery on that shard.

Relevant stats from an index I haven't written to in at least 12 hours:

      "store": {
        "size": "110.1gb",
        "size_in_bytes": 118257807651
      },
      "translog": {
        "operations": 79068513,
        "size": "214.5gb",
        "size_in_bytes": 230343762719,
        "uncommitted_operations": 0,
        "uncommitted_size": "387b",
        "uncommitted_size_in_bytes": 387
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time": "7.7s",
        "throttle_time_in_millis": 7719
      }

Full stats from that index: https://gist.github.com/danopia/7390f5bbb898bfe4c2b3709a503539ec

Steps to reproduce:

Pretty hard for this issue (I haven't tried to cause this problem) but here's some general steps.

  1. Set up an EC2 cluster and index a lot of data into indexes with these settings. I indexed a few dozen terabytes, 50-200GB per shard.
      "index": {
        "refresh_interval": "60s",
        "number_of_shards": "18",
        "translog": {
          "flush_threshold_size": "5gb"
        },
        "number_of_replicas": "1",
      }
  1. Wait for an Elasticsearch node to die. I had two clusters exhibit this behavior, one initially caused by ES running out of file descriptors (since been addressed), the second was EC2-level instance failure.

  2. Create a new fully-allocated index and continue writing into it.

  3. Observe that while the existing data recovery is ongoing, the new indexes can get translogs that grow unbounded.

  4. Once an index has a translog in this state, no combination of API calls can fix it.

The worst part about the stuck translogs is even after I've moved on from indexing into those indexes, they're still in the cluster and take AGES to recover. I'm talking multiple hours per shard recovery because the entire translog must be read in order to recover from existing store, and then read AGAIN to recover a replica. So I'm not aware of any options other than deleting the impacted indexes or living with that.


Some impacted indexes have "Too many open files" logs, others have zero mentions in ES errors. So it's not a requirement but I'm including some such log lines for completeness.

[2018-04-11T11:15:30,421][WARN ][o.e.i.s.IndexShard       ] [i-0352bafb7594c1b75] [events-2018.04.10b][3] failed to flush index
[2018-04-11T18:26:16,055][WARN ][o.e.i.c.IndicesClusterStateService] [i-0c1f657db93d5c6e3] [[events-2018.04.10b][6]] marking and sending shard failed due to [shard failure, reason [lucene commit failed]]
[2018-04-11T18:26:16,056][WARN ][o.e.c.a.s.ShardStateAction] [i-07d34a9aad3760a6c] [events-2018.04.10b][6] received shard failed for shard id [[events-2018.04.10b][6]], allocation id [OWI1rj9WRUiRsHg2cRy96Q], primary term [0], message [shard failure, reason [lucene commit failed]], failure [FileSystemException[/mnt/data/nodes/0/indices/cpYZBHd3TSSMy4qxZXKEHA/6/index/_3vx.cfs: Too many open files]]

With this issue I'm looking for reasons why I can't fix the translogs once they grow unbounded. I'm not entirely sure on the root cause of the unbounded growth, and it's less important as long as I can fix it after... but the root cause really seems to be related to recoveries and the amount of file descriptors that recoveries require.

Here's a graph of a totally healthy cluster getting hit by an EC2 failure, and then spiraling into craziness. Wednesday around 12:00 I performed a full-cluster restart to 10x the file descriptor limit.

screenshot 2018-04-11 at 5 31 01 pm

@dnhatn
Copy link
Member

dnhatn commented Apr 12, 2018

@danopia Would you mind sharing the shard-level stats with us? This can be retrieved via GET /_stats?level=shards. Thank you!

@dnhatn dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Apr 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@danopia
Copy link
Author

danopia commented Apr 12, 2018

Sure. I have a hundred indexes so hopefully just stats from just one impacted index will be useful.

Index: yux-2018.04.11d

It's been an hour since I wrote into this index, and repeated synced-flush attempts say things like "pending operations" and "commit has changed" for many shards. Most recent synced-flush output: https://gist.github.com/danopia/79cf1394b9806cf374c6ea16c90f9623

Shard-level stats for that index: https://gist.github.com/danopia/af7b5df82b3d939e6e99df87cabb8b01

Thanks for the fast response :)

@bleskes
Copy link
Contributor

bleskes commented Apr 12, 2018

Thanks @danopia

I want to double check a few things:

  1. When you have an outage, I presume you continue indexing. Correct? How fast do you index? i.e., roughly how many bytes per second?
  2. Why did you set flush_threshold_size to 5gb?
  3. I noticed that the index has 14685147 documents but the maximum sequence number is 15253725.
    3.1 are you using auto generated ids or supplying your own?
    3.2 once indexed, do you update or delete documents? if so, using which API (delete/update by query?)
    3.3 do use versioned operations?

The worst part about the stuck translogs is even after I've moved on from indexing into those indexes, they're still in the cluster and take AGES to recover. I'm talking multiple hours per shard recovery because the entire translog must be read in order to recover from existing store, and then read AGAIN to recover a replica.

Can you clarify a bit more about what you mean? Did you stop indexing while replicas were still recovering or were you trying to move the indices to another node?

@danopia
Copy link
Author

danopia commented Apr 12, 2018

  1. The indexers don't automatically stop, no. I have replication over two racks with restricted allocation, so one node's failure effectively means some shards no longer have a replica, but everything keeps working. Otherwise the indexers would see indexing errors. The cluster I've provided stats from takes 1-2 million documents per minute, most of that going into one index at a time. One day of data is ~2TB in ES, including replicas.
    1. I already knew that heavy indexing into a recovering shard is bad. My strategy is creating a new index with healthy allocations and switching indexing over to that. Because of the issue I'm reporting here, though, I don't have a way to clean up the nasty indexes later.
  2. I set flush_threshold_size on my previous ES 2.2 clusters and copied over the index templates 😒 Since I've had these issues on the new ES version, I've been stepping down to 1gb, but these indexes were created using the 5gb figure. The goal was to let the translog go longer without being flushed, because I'm on spinning disks (in RAID 0).
  3. Good catch, I am supplying my own IDs. What you're seeing is unrelated data duplication (reprocessing, retries, etc) being deduplicated by the Elasticsearch indexing. this is actually a nice feature for us because even though data in ES is perfectly deduplicated, the _version attribute lets us check if a certain event was duplicated originally.
    1. The IDs are generated upstream. They are prefixed with the document's ISO-8601 timestamp, followed by a hash. Most timestamps are from the past 5 minutes but otherwise not very related.
    2. Yea, but only by including the same document (or slightly different document) in another bulk-index API. I never call delete/update APIs.
    3. Versioned operations don't sound familiar. My only regular load is _bulk calls, and Kibana usage + _search.

Sure. When I see translogs not being cleared, I manually stop all indexing until I see fully-allocated hot indexes. Usually I just make new empty indexes, wait for them to allocate, then start indexing with the new indexes. If the new index translogs aren't stuck, this is all good.

(One time, I made a new index, then performed a full cluster restart to fix a systemd config, and none of the translogs on that new index worked once I started putting data in)

Anyway, a pain point here is that any future recoveries that involve a broken shard can take hours to complete. At worst, I had a full index of translogs with 67 million entries each, and each recovery took upwards of 10 hours. ES tried to rebalance and used up all the file descriptors, crashing multiple nodes. Then the primary shards had to read the entire translog in order to recover from existing store, so my cluster was red for that entire time. Replicas also took just as long to do a peer recovery. Here's a screenshot of those recoveries:

screen shot 2018-03-24 at 1 31 19 am-crop

Basically, once I have these broken indexes, any sort of node restart (config changes) or node failure starts the whole recovery ordeal over again.

Going by #28205 it sounds like Elasticsearch doesn't have the capability of seeking into the translog, so I assume it was reading all of those entries just looking for the end.

@danopia
Copy link
Author

danopia commented Apr 12, 2018

Maybe this is related? Doesn't read like the same trigger, but I'm seeing the same behavior with # of open files, around the same time. #29097

@bleskes
Copy link
Contributor

bleskes commented Apr 13, 2018

Thanks @danopia . Let's first try to reduce the number of moving parts.

The goal was to let the translog go longer without being flushed, because I'm on spinning disks (in RAID 0).

I'm not sure whether this achieves what you want - every operation written to the translog is fsynced at the end of each bulk. The flush parameters controls how often a lucene commit point is made and not how much documents are held in memory before the translog is written to disk.

Since recovery first copies over lucene files, it uses the commit point. If the commit is very old (i.e., 5gb behind) you need to re-index a lot of data and recovery will be slow. Another side effect is that while a recovery is ongoing we can't trim the translog. So if takes a long time and you index really quickly, the translog can get very large.

I suggest you go back to using defaults and see what effect it has.

It's been an hour since I wrote into this index, and repeated synced-flush attempts say things like "pending operations" and "commit has changed" for many shards.

It sounds like you were doing this while the indexers were running. If so it's expected. Synced flush only works when no indexing is going on.

Going by #28205 it sounds like Elasticsearch doesn't have the capability of seeking into the translog, so I assume it was reading all of those entries just looking for the end.

We can random access into the translog with a granularity of 64MB. That's good enough for recovery, not good enough for other stuff (hence the issue you linked to).

Maybe this is related? Doesn't read like the same trigger, but I'm seeing the same behavior with # of open files, around the same time. #29097

You symptoms are similar because you translog is huge. I think the cause is different.

Can you please reduce the translog flush size to default (set it to null, it's a dynamic setting) and see how things go? I don't expect them to be solved but I think it will be easier to figure out.

@danopia
Copy link
Author

danopia commented Apr 26, 2018

Sorry for the radio silence. I removed the translog size override from all my clusters, thanks for the heads-up. The metrics are smoother now.

After my last comment I upgraded the aforementioned cluster to a 6.2.4-SNAPSHOT (using this tree) based on a belief that #29097 was at play. The upgrade didn't fix the translogs (as you expected), but does fix the out-of-files crashes. The number-of-files metric has been beautiful ever since. No.FDs is what was actually crashing nodes and destabilizing clusters. This means I can take my time with the translog investigation.

(The fun part was losing visibility while 6.2.4 was fixing the number-of-files. The node's stats were completely unresponsive for the hour or so that ES needed to clean up all the files, for each cleanup that happened.)


Anyway, after the upgrade was completed: Four translogs (2 shards out of 9, +replicas) got stuck on the same day without a known cause. They just stuck in-sync. ~12 hours later at 17:00PT, indexes rolled over and the new indexes cleanly took the load.

screen shot 2018-04-25 at 6 16 38 pm

"POST /dice.prod-2018.04.17/_flush/synced"
{
  "_shards": {
    "total": 18,
    "successful": 18,
    "failed": 0
  },
  "dice.prod-2018.04.17": {
    "total": 18,
    "successful": 18,
    "failed": 0
  }
}

A week later, I purposely caused other translogs to inflate, by starting recovery on a hot index. Once the index was rolled over, the recoveries completed and the translogs were properly trimmed in a reasonable timeframe. But the stuck translogs are still there. In the same cluster!

screen shot 2018-04-25 at 7 14 07 pm

Hopefully running a SNAPSHOT doesn't invalidate all this information. But if it does, I have another cluster running real 6.2.4 and with multiple instances of the stuck-translog symptom. I'm waiting to see if the cluster going green helps that one.

@dnhatn
Copy link
Member

dnhatn commented May 2, 2018

@danopia It would be helpful if you can provide this information.

  • Did you use dynamic mapping updates?
  • Did you use bulk index API? If so, was there any failure in bulk responses?

Thank you!

@danopia
Copy link
Author

danopia commented May 7, 2018

  • Yes, I have a common template across every index.
  • Yes, everything is bulked, currently set to a 20MB max size.
  • Looking at a few recent cases, there's definitely a couple hundred documents that were rejected for some reason or another right around the index rollover times. The # of rejects seems related to how many shards get stuck. I let indexes get created an hour early and misaligned data seeds them a little, but it must not be enough to have the mapping filled out.

screen shot 2018-05-07 at 10 18 58 am

Do you have a possible cause for bulk errors causing translog poison pills of some kind? It sounds like a lead..

@dnhatn
Copy link
Member

dnhatn commented May 7, 2018

Looking at a few recent cases, there's definitely a couple hundred documents that were rejected for some reason or another right around the index rollover times. The # of rejects seems related to how many shards get stuck.

@danopia Thanks again. I think we have figured out the source of the issue and fixed it in #30244. @bleskes WDYT?

@danopia
Copy link
Author

danopia commented May 7, 2018

Nice, good to see there's a known thing that correlates :)

I've consulted the logging cluster that my clusters log to, and definitely see this from the time period of the above graphs:

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s

Other unaffected days don't have that log.

Sounds like the fix in 6.3.0 won't fix old data, so I'm not looking forward to reprocessing all the tranlogs I've accumulated 😛

[edit] Just noticed a referenced issue mentioning an open/close fixing the indexes, so I'll see what that does for me

@dnhatn
Copy link
Member

dnhatn commented May 7, 2018

Sounds like the fix in 6.3.0 won't fix old data, so I'm not looking forward to reprocessing all the tranlogs I've accumulated

You can fix the affected indices by rebuilding its replicas. This can be done by changing the number_of_replicas to 0 then restore to the original value.

@dnhatn
Copy link
Member

dnhatn commented May 8, 2018

failed to process cluster event (put-mapping) within 30s

@danopia This is really fixed by #30244. I hope you don't mind if I close this but feel free to reopen. We really appreciate your cooperation here.

@dnhatn dnhatn closed this as completed May 8, 2018
@malpani
Copy link
Contributor

malpani commented Jan 25, 2019

@dnhatn @bleskes @danopia - I understand this is now fixed 6.3+ but had a quick question on the impact of diverging local checkpoints.

When i hit this scenario on 6.2, translogs growing beyond index.translog.retention.size and not being truncated even beyond the retention periods set by index.translog.retention.age .

tldr; Are the translog retention settings ignored if the local checkpoints diverge? Ideally it should not right? But maybe the primary needs to keep all sequence-ids around since last global_checkpoint which will never come into sync because of this bug ?

@bleskes
Copy link
Contributor

bleskes commented Jan 25, 2019

@malpani the translog is used for two things - the first is to reindexing all operations into lucene in the case of a restart. For this you need all operations above the global checkpoint (which is the minimum of the local checkpoints). The second things is to bring replicas that were offline up to speed using operation recoveries (they come from the translog in the 6.x polices). The first usage is required and we always keep the required portion of the translog for it (this is what you're seeing). The second usage is optional as we can always copy files. The retention setting controls how much of the translog should be kept around for the second usage alone. I hope this helps. If you have more questions please feel free to ask on our discuss forums. We try to keep GitHub for feature requests and bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. feedback_needed
Projects
None yet
Development

No branches or pull requests

6 participants