Skip to content

Deadlock on aggregation in 2.x #22952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maxcom opened this issue Feb 3, 2017 · 10 comments
Closed

Deadlock on aggregation in 2.x #22952

maxcom opened this issue Feb 3, 2017 · 10 comments

Comments

@maxcom
Copy link

maxcom commented Feb 3, 2017

Elasticsearch version: 2.3.4 and 2.4.3

Plugins installed: None

JVM version: OpenJDK 1.8.0.101-3.b13

OS version: CentOS 6.8

Description of the problem including expected versus actual behavior:
Search threads became completely locked up sometimes. All search threads have the same track trace. No progress is made until we restart Elasticsearch.

We seen this problem on 2.3.4 and 2.4.3.

Steps to reproduce: Do not know how to reproduce it

Provide logs (if relevant):

Here is stack trace from 2.3.4:

2017-02-03 10:35:44.216064500 "elasticsearch[main][search][T#12]" #116 daemon prio=5 os_prio=0 tid=0x00007f498c0a7800 nid=0x266f in Object.wait() [0x00007f48d18da000]
2017-02-03 10:35:44.216089500    java.lang.Thread.State: RUNNABLE
2017-02-03 10:35:44.216090500 	at org.elasticsearch.search.aggregations.support.AggregationContext.bytesField(AggregationContext.java:178)
2017-02-03 10:35:44.216098500 	at org.elasticsearch.search.aggregations.support.AggregationContext.originalValuesSource(AggregationContext.java:151)
2017-02-03 10:35:44.216106500 	at org.elasticsearch.search.aggregations.support.AggregationContext.valuesSource(AggregationContext.java:85)
2017-02-03 10:35:44.216108500 	at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:60)
2017-02-03 10:35:44.216116500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216118500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-03 10:35:44.216127500 	at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-03 10:35:44.216128500 	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-03 10:35:44.216137500 	at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-03 10:35:44.216139500 	at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.<init>(FilterAggregator.java:54)
2017-02-03 10:35:44.216146500 	at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator$Factory.createInternal(FilterAggregator.java:108)
2017-02-03 10:35:44.216148500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216156500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-03 10:35:44.216158500 	at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-03 10:35:44.216166500 	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-03 10:35:44.216168500 	at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-03 10:35:44.216175500 	at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator.<init>(NestedAggregator.java:61)
2017-02-03 10:35:44.216177500 	at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator$Factory.createInternal(NestedAggregator.java:168)
2017-02-03 10:35:44.216185500 	at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-03 10:35:44.216186500 	at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:87)
2017-02-03 10:35:44.216194500 	at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:85)
2017-02-03 10:35:44.216202500 	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:111)
2017-02-03 10:35:44.216209500 	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:366)
2017-02-03 10:35:44.216217500 	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:378)
2017-02-03 10:35:44.216225500 	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
2017-02-03 10:35:44.216234500 	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
2017-02-03 10:35:44.216236500 	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
2017-02-03 10:35:44.216245500 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
2017-02-03 10:35:44.216252500 	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
2017-02-03 10:35:44.216254500 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
2017-02-03 10:35:44.216261500 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-03 10:35:44.216277500 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-03 10:35:44.216279500 	at java.lang.Thread.run(Thread.java:745)
2017-02-03 10:35:44.216280500 
@clintongormley clintongormley added :Analytics/Aggregations Aggregations :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure discuss and removed :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure labels Feb 3, 2017
@clintongormley
Copy link
Contributor

@maxcom could you provide us with the aggregations that you are running please?

@maxcom
Copy link
Author

maxcom commented Feb 3, 2017

I can't find exact query that causes the problem. I'll add more logging to our application and wait for next reproduction.

@jpountz
Copy link
Contributor

jpountz commented Feb 3, 2017

The state of the thread is runnable, so I suspect this is not really a deadlock, but rather a memory pressure issue: so much time is spent doing garbage collection that the application does not seem to make any progress. Do you have some monitoring data of garbage collection activity / memory usage of the JVM?

@maxcom
Copy link
Author

maxcom commented Feb 3, 2017

I do not think that it is a memory issue. CPU is not busy at all; we see no GC activity in our logs. And no progress is made until we restart Elasticsearch (for ~30 minutes).

@maxcom
Copy link
Author

maxcom commented Feb 3, 2017

I dig into sources of Elasticsearch and I think that is can be some kind of class initialization deadlock. Similar problem is described here: https://ternarysearch.blogspot.ru/2013/07/static-initialization-deadlock.html

@maxcom
Copy link
Author

maxcom commented Feb 3, 2017

One more stack trace (from Elasticsearch 2.4.3):

2017-02-02 16:09:32.129209500 "elasticsearch[main][search][T#6]" #71 daemon prio=5 os_prio=0 tid=0x00007f5a24083000 nid=0x2e51 in Object.wait() [0x00007f59ef6d2000]
2017-02-02 16:09:32.129220500    java.lang.Thread.State: RUNNABLE
2017-02-02 16:09:32.130068500   at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes.<clinit>(ValuesSource.java:66)
2017-02-02 16:09:32.130070500   at org.elasticsearch.search.aggregations.support.AggregationContext.bytesField(AggregationContext.java:176)
2017-02-02 16:09:32.130071500   at org.elasticsearch.search.aggregations.support.AggregationContext.originalValuesSource(AggregationContext.java:151)
2017-02-02 16:09:32.130073500   at org.elasticsearch.search.aggregations.support.AggregationContext.valuesSource(AggregationContext.java:85)
2017-02-02 16:09:32.130088500   at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:60)
2017-02-02 16:09:32.130090500   at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-02 16:09:32.130091500   at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:76)
2017-02-02 16:09:32.130096500   at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:69)
2017-02-02 16:09:32.130097500   at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:48)
2017-02-02 16:09:32.130098500   at org.elasticsearch.search.aggregations.bucket.SingleBucketAggregator.<init>(SingleBucketAggregator.java:38)
2017-02-02 16:09:32.130104500   at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.<init>(FilterAggregator.java:54)
2017-02-02 16:09:32.130106500   at org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator$Factory.createInternal(FilterAggregator.java:108)
2017-02-02 16:09:32.130107500   at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:102)
2017-02-02 16:09:32.130111500   at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:87)
2017-02-02 16:09:32.130113500   at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:85)
2017-02-02 16:09:32.130114500   at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:111)
2017-02-02 16:09:32.130115500   at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372)
2017-02-02 16:09:32.130128500   at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:480)
2017-02-02 16:09:32.130130500   at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
2017-02-02 16:09:32.130131500   at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
2017-02-02 16:09:32.130137500   at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
2017-02-02 16:09:32.130139500   at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
2017-02-02 16:09:32.130140500   at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
2017-02-02 16:09:32.130144500   at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
2017-02-02 16:09:32.130145500   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-02 16:09:32.130147500   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-02 16:09:32.130148500   at java.lang.Thread.run(Thread.java:745)
2017-02-02 16:09:32.130152500 

@jpountz
Copy link
Contributor

jpountz commented Feb 3, 2017

So this might be a class initialization deadlock actually. Can you provide us with the entire jstack output?

EDIT: I had not seen the two above messages where you already mentioned the fact it could be a class init deadlock.

@jpountz
Copy link
Contributor

jpountz commented Feb 3, 2017

I think moving the EMPTY constant to the WithOrdinals class rather than its Bytes parent class would fix the problem as the Bytes class would no longer depend on initializing its WithOrdinals subclass.

@maxcom
Copy link
Author

maxcom commented Feb 3, 2017

Sure, here is full thread dumps from 2.4.3: https://gist.github.com/maxcom/69d54d58284a7b5eea42db363bac5f6a

jpountz added a commit to jpountz/elasticsearch that referenced this issue Feb 6, 2017
jpountz added a commit to jpountz/elasticsearch that referenced this issue Feb 6, 2017
@jpountz
Copy link
Contributor

jpountz commented Feb 6, 2017

Closing as this bug does not exist in 5.x and will be fixed in the upcoming 2.4 release (for which we have no ETA at the moment). Thanks @maxcom for the detailed bug report and being so reactive helping us understand what was happening.

@jpountz jpountz closed this as completed Feb 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants