Deadlock on aggregation in 2.x #22952

maxcom · 2017-02-03T09:03:35Z

Elasticsearch version: 2.3.4 and 2.4.3

Plugins installed: None

JVM version: OpenJDK 1.8.0.101-3.b13

OS version: CentOS 6.8

Description of the problem including expected versus actual behavior:
Search threads became completely locked up sometimes. All search threads have the same track trace. No progress is made until we restart Elasticsearch.

We seen this problem on 2.3.4 and 2.4.3.

Steps to reproduce: Do not know how to reproduce it

Provide logs (if relevant):

Here is stack trace from 2.3.4:

2017-02-03 10:35:44.216064500 "elasticsearch[main][search][T#12]" #116 daemon prio=5 os_prio=0 tid=0x00007f498c0a7800 nid=0x266f in Object.wait() [0x00007f48d18da000]
2017-02-03 10:35:44.216089500    java.lang.Thread.State: RUNNABLE
2017-02-03 10:35:44.216090500 	at org.elasticsearch.search.aggregations.support.AggregationContext.bytesField(AggregationContext.java:178)
2017-02-03 10:35:44.216098500 	at org.elasticsearch.search.aggregations.support.AggregationContext.originalValuesSource(AggregationContext.java:151)
2017-02-03 10:35:44.216106500 	at org.elasticsearch.search.aggregations.support.AggregationContext.valuesSource(AggregationContext.java:85)
2017-02-03 10:35:44.216108500 	at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:60)
2017-02-03 10:35:44.216116500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216118500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-03 10:35:44.216127500 	at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-03 10:35:44.216128500 	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-03 10:35:44.216137500 	at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-03 10:35:44.216139500 	at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.<init>(FilterAggregator.java:54)
2017-02-03 10:35:44.216146500 	at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator$Factory.createInternal(FilterAggregator.java:108)
2017-02-03 10:35:44.216148500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216156500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-03 10:35:44.216158500 	at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-03 10:35:44.216166500 	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-03 10:35:44.216168500 	at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-03 10:35:44.216175500 	at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator.<init>(NestedAggregator.java:61)
2017-02-03 10:35:44.216177500 	at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator$Factory.createInternal(NestedAggregator.java:168)
2017-02-03 10:35:44.216185500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216186500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:87)
2017-02-03 10:35:44.216194500 	at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:85)
2017-02-03 10:35:44.216202500 	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:111)
2017-02-03 10:35:44.216209500 	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:366)
2017-02-03 10:35:44.216217500 	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:378)
2017-02-03 10:35:44.216225500 	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
2017-02-03 10:35:44.216234500 	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
2017-02-03 10:35:44.216236500 	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
2017-02-03 10:35:44.216245500 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
2017-02-03 10:35:44.216252500 	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
2017-02-03 10:35:44.216254500 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
2017-02-03 10:35:44.216261500 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-03 10:35:44.216277500 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-03 10:35:44.216279500 	at java.lang.Thread.run(Thread.java:745)
2017-02-03 10:35:44.216280500

The text was updated successfully, but these errors were encountered:

clintongormley · 2017-02-03T09:12:02Z

@maxcom could you provide us with the aggregations that you are running please?

maxcom · 2017-02-03T09:24:00Z

I can't find exact query that causes the problem. I'll add more logging to our application and wait for next reproduction.

jpountz · 2017-02-03T09:29:00Z

The state of the thread is runnable, so I suspect this is not really a deadlock, but rather a memory pressure issue: so much time is spent doing garbage collection that the application does not seem to make any progress. Do you have some monitoring data of garbage collection activity / memory usage of the JVM?

maxcom · 2017-02-03T09:33:20Z

I do not think that it is a memory issue. CPU is not busy at all; we see no GC activity in our logs. And no progress is made until we restart Elasticsearch (for ~30 minutes).

maxcom · 2017-02-03T09:38:00Z

I dig into sources of Elasticsearch and I think that is can be some kind of class initialization deadlock. Similar problem is described here: https://ternarysearch.blogspot.ru/2013/07/static-initialization-deadlock.html

maxcom · 2017-02-03T09:40:29Z

One more stack trace (from Elasticsearch 2.4.3):

2017-02-02 16:09:32.129209500 "elasticsearch[main][search][T#6]" #71 daemon prio=5 os_prio=0 tid=0x00007f5a24083000 nid=0x2e51 in Object.wait() [0x00007f59ef6d2000]
2017-02-02 16:09:32.129220500    java.lang.Thread.State: RUNNABLE
2017-02-02 16:09:32.130068500   at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes.<clinit>(ValuesSource.java:66)
2017-02-02 16:09:32.130070500   at org.elasticsearch.search.aggregations.support.AggregationContext.bytesField(AggregationContext.java:176)
2017-02-02 16:09:32.130071500   at org.elasticsearch.search.aggregations.support.AggregationContext.originalValuesSource(AggregationContext.java:151)
2017-02-02 16:09:32.130073500   at org.elasticsearch.search.aggregations.support.AggregationContext.valuesSource(AggregationContext.java:85)
2017-02-02 16:09:32.130088500   at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:60)
2017-02-02 16:09:32.130090500   at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-02 16:09:32.130091500   at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-02 16:09:32.130096500   at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-02 16:09:32.130097500   at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-02 16:09:32.130098500   at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-02 16:09:32.130104500   at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.<init>(FilterAggregator.java:54)
2017-02-02 16:09:32.130106500   at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator$Factory.createInternal(FilterAggregator.java:108)
2017-02-02 16:09:32.130107500   at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-02 16:09:32.130111500   at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:87)
2017-02-02 16:09:32.130113500   at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:85)
2017-02-02 16:09:32.130114500   at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:111)
2017-02-02 16:09:32.130115500   at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372)
2017-02-02 16:09:32.130128500   at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:480)
2017-02-02 16:09:32.130130500   at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
2017-02-02 16:09:32.130131500   at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
2017-02-02 16:09:32.130137500   at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
2017-02-02 16:09:32.130139500   at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
2017-02-02 16:09:32.130140500   at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
2017-02-02 16:09:32.130144500   at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
2017-02-02 16:09:32.130145500   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-02 16:09:32.130147500   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-02 16:09:32.130148500   at java.lang.Thread.run(Thread.java:745)
2017-02-02 16:09:32.130152500

jpountz · 2017-02-03T09:49:03Z

So this might be a class initialization deadlock actually. Can you provide us with the entire jstack output?

EDIT: I had not seen the two above messages where you already mentioned the fact it could be a class init deadlock.

jpountz · 2017-02-03T10:05:09Z

I think moving the EMPTY constant to the WithOrdinals class rather than its Bytes parent class would fix the problem as the Bytes class would no longer depend on initializing its WithOrdinals subclass.

maxcom · 2017-02-03T10:27:23Z

Sure, here is full thread dumps from 2.4.3: https://gist.github.com/maxcom/69d54d58284a7b5eea42db363bac5f6a

Closes elastic#22952

Closes #22952

jpountz · 2017-02-06T12:51:55Z

Closing as this bug does not exist in 5.x and will be fixed in the upcoming 2.4 release (for which we have no ETA at the moment). Thanks @maxcom for the detailed bug report and being so reactive helping us understand what was happening.

clintongormley added the feedback_needed label Feb 3, 2017

clintongormley added >bug and removed discuss feedback_needed labels Feb 3, 2017

jpountz added a commit to jpountz/elasticsearch that referenced this issue Feb 6, 2017

Avoid classloading deadlock in ValuesSource.

6620a60

Closes elastic#22952

jpountz added a commit to jpountz/elasticsearch that referenced this issue Feb 6, 2017

Avoid classloading deadlock in ValuesSource.

382a451

Closes elastic#22952

jpountz mentioned this issue Feb 6, 2017

Avoid classloading deadlock in ValuesSource. #22994

Merged

jpountz added a commit that referenced this issue Feb 6, 2017

Avoid classloading deadlock in ValuesSource. (#22994)

161984b

Closes #22952

jpountz closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on aggregation in 2.x #22952

Deadlock on aggregation in 2.x #22952

maxcom commented Feb 3, 2017 •

edited

Loading

clintongormley commented Feb 3, 2017

maxcom commented Feb 3, 2017

jpountz commented Feb 3, 2017

maxcom commented Feb 3, 2017

maxcom commented Feb 3, 2017

maxcom commented Feb 3, 2017 •

edited

Loading

jpountz commented Feb 3, 2017 •

edited

Loading

jpountz commented Feb 3, 2017

maxcom commented Feb 3, 2017

jpountz commented Feb 6, 2017 •

edited

Loading

Deadlock on aggregation in 2.x #22952

Deadlock on aggregation in 2.x #22952

Comments

maxcom commented Feb 3, 2017 • edited Loading

clintongormley commented Feb 3, 2017

maxcom commented Feb 3, 2017

jpountz commented Feb 3, 2017

maxcom commented Feb 3, 2017

maxcom commented Feb 3, 2017

maxcom commented Feb 3, 2017 • edited Loading

jpountz commented Feb 3, 2017 • edited Loading

jpountz commented Feb 3, 2017

maxcom commented Feb 3, 2017

jpountz commented Feb 6, 2017 • edited Loading

maxcom commented Feb 3, 2017 •

edited

Loading

maxcom commented Feb 3, 2017 •

edited

Loading

jpountz commented Feb 3, 2017 •

edited

Loading

jpountz commented Feb 6, 2017 •

edited

Loading