Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

xo4n · 2017-02-06T18:56:40Z

Elasticsearch version:
{
"number" : "5.2.0",
"build_hash" : "24e05b9",
"build_date" : "2017-01-24T19:52:35.800Z",
"build_snapshot" : false,
"lucene_version" : "6.4.0"
}

Plugins installed: [reindexing,
x-pack]

JVM version:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

OS version:
debian 8.5

Description of the problem including expected versus actual behavior:
Elastisearch client node used by kibana to send requests to the cluster crashes with

[2017-02-06T18:07:25,659][INFO ][o.e.m.j.JvmGcMonitorService] [dashboard] [gc][7] overhead, spent [346ms] collecting in the last [1s]
[2017-02-06T18:18:24,961][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dashboard] fatal error in thread [elasticsearch[dashboard][search][T#28]], exiting
java.lang.StackOverflowError: null
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:77) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]

Steps to reproduce:

Running the following query from Kibana causes elasticsearch client node to crash

{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.24","rum_custom-2017.01.25","rum_custom-2017.01.26","rum_custom-2017.01.27","rum_custom-2017.01.22","rum_custom-2017.01.23","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486404140629}
{"query":{"bool":{"must":[{"query_string":{"query":"v_entryType: \"mark\"","analyze_wildcard":true}},{"range":{"@timestamp":{"gte":1485108407472,"lte":1486404407472,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"12h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}}

The same query for a shorter period of time runs just fine

{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.27","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486407007120}
{"query":{"bool":{"must":[{"query_string":{"analyze_wildcard":true,"query":"v_entryType: \"mark\""}},{"range":{"@timestamp":{"gte":1485543020838,"lte":1486407020838,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"3h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}}

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

martinscholz83 · 2017-02-09T11:05:48Z

Hi @xo4n , when you say "The same query for a shorter period of time" do you mean that you set the interval smaller which produces less results per aggregation, or do you mean you query not so much indices?

clintongormley · 2017-02-09T19:44:05Z

@colings86 any ideas?

colings86 · 2017-02-10T09:39:12Z

@xo4n It looks like it's getting stuck in a compression event and not making any progress.

To explain this let me describe a little about the T-Digest algorithm thats used for the percentiles aggregation: As the aggregation collects values from the matching documents it creates a histogram of centroids that combine similar values (the method of creating these centroids is not important here but if you are interested I would advise reading the t-digest paper. These centroids continue being created as more documents are created and at some point a compress event is triggered to reduce the number of centroids back to a manageable number. The compression strategy is to re-insert all of the centroids in a random order, which because of the way values are combined should result in less centroids being created.

In your case it seems to be that a compression event has been triggered and while re-inserting the centroids its is recursively triggering another compression event which in turn is triggering another compression event and so on until it overflows the stack. This is something I have not seen before.

Unfortunately, this is probably going to be a little tricky to reproduce since getting the same failure will require having the same data inserted into the algorithm, so please bear with me but I will try to get to the bottom of this for you.

The good news is that because you have time based indices and your top level aggregation is a date_histogram we should be able to filter this down to a single index and a single 12 hour period within that index. Could you please run the same query against the following indices, one index at a time (its best if you query Elasticsearch through something like Console for this rather than modfying your visualisation in Kibana) ?

"rum_custom-2017.01.22"
"rum_custom-2017.01.23"
"rum_custom-2017.01.24"
"rum_custom-2017.01.25"
"rum_custom-2017.01.26"

Doing this should hopefully tell you which of these indices is causing the issue (my assumption is it is not the indices from 2017.01.27 to 2017.02.06 since the query across these indices worked).

Once you know which index is causing the issue you will have reduced the problem to two 12 hour buckets. The next step will be to modify the range query in your search request to search from 00:00:00 to 12:00:00 on that day and run the request to see if the issue reproduces, if not change the query to 12:00:00 on that day to 00:00:00 the following day.

At this point you should have narrowed the problem down to a 12 hour period in a single index. If you are happy you can share the index with me privately (I can give you details if you let me know you want to go this route) I could then try to debug the problem. Is this something you would be willing to do? If so, could you let me know how big the index is once you have identified it using the steps above?

xo4n · 2017-05-24T14:58:08Z

We figured out which data could be causing this

We were mapping some fields to half_float that actually contain bigger values than what a half_float can handle. Not sure what value Elasticsearch indexes in such cases but when running percentile calculations on a field with several of these, the issue is reproducible.

Below there is a data example. We are sending key value fields as strings and mapping to half_float the numeric ones. Fields like "v_startTime" are holding very high values in some cases and those are causing the issue. When sorting those fields already, the ones causing trouble have "sort": ["Infinity" ] instead of its numeric value

Table View ( excerpt)

# v_duration	   	5,862.045
t v_entryType	   	measure
t v_name	   	gtm_time_to_first_tag_item
# v_startTime	   	75,810.745

JSON View

{
	"_index": "rum_custom-2017.05.23",
	"_type": "logs",
	"_id": "AVwy0ptHX7d39WTa_HhR",
	"_score": null,
	"_source": {
		"geoip": {
			"country_code2": "IN",
			"continent_code": "AS"
		},
		"ua_build": "83",
		"v_startTime": "75810.745",
		"ua_os_major": "4",
		"ua_os": "Android 4.4.4",
		"locale": "IN",
		"ua_patch": "3029",
		"ua_minor": "0",
		"@timestamp": "2017-05-23T01:01:21.000Z",
		"ua_name": "Chrome Mobile",
		"ua_os_minor": "4",
		"ua_major": "58",
		"ua_device": "HM NOTE 1S",
		"v_entryType": "measure",
		"user_agent": "Mozilla/5.0 (Linux; Android 4.4.4; HM NOTE 1S Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36",
		"v_name": "gtm_time_to_first_tag_item",
		"v_duration": "5862.045",
		"ua_os_name": "Android"
	},
	"fields": {
		"@timestamp": [
			1495501281000
		]
	},
	"sort": [
		"Infinity"
	]
}

We could attach some more data if needed

colings86 · 2017-05-24T15:32:15Z

@xo4n thanks for the update. This is interesting. My current working theory is that the problem is due caused when you insert the same value into the t-digest algorithm load of times (in your case I think it's inserting either HalfFloat.MAX_VALUE or HalfFloat.POSITIVE_INFINITY for each of the docs that have a value outside the half float range). I'm going to try to reproduce this in a test case. I'll report back what I find.

colings86 · 2017-05-24T16:39:10Z

@xo4n also, do you have the full stack trace for this? Could you paste it in a gist (not here as it will be very long) and link it here please? could you also let me know how many documents match the search request that fails (you can find this by running the query without percentiles agg and looking at the hits.total field)? and also do you know how many documents contain values for "v_startTime" which are outside of the half_float range?

xo4n · 2017-05-26T10:15:25Z

StackTrace
https://gist.github.com/xo4n/7a7ada8a34dfda351c62a653e00c9761

Query
https://gist.github.com/xo4n/531e125418d0c9eee0b38471520c98d0

hits.total:8742983
out of range values:69,568

colings86 · 2017-05-30T13:00:45Z

@xo4n unfortunately I haven't yet had any luck in reproducing this. I'm wondering if you would be willing/able to privately share your index with me to assist me in reproducing this? If you are not able to share the full index maybe you could use the reindex API to create an index where the documents only contain the field v_startTime and share that index with me?

colings86 · 2017-07-04T14:27:42Z

@xo4n thanks for providing the index to me. I managed to reproduce the issue and it seems to be down to the out of range values which are represented as Infinity in the doc_values. We think that the right fix here is to stop out of range values at index time by rejecting the request so the infinite values don't end up in the index. I opened #25534 for this

markharwood · 2018-03-16T15:44:19Z

The root cause of this (out of range numerics held in the index) looks to have been addressed by #25826

Closing, but feel free to reopen if the issue remains.

clintongormley added :Analytics/Aggregations Aggregations help wanted adoptme >bug labels Feb 7, 2017

colings86 mentioned this issue Jul 4, 2017

Reject out of range numeric values at index time #25534

Closed

markharwood closed this as completed Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

xo4n commented Feb 6, 2017

martinscholz83 commented Feb 9, 2017 •

edited

Loading

Uh oh!

clintongormley commented Feb 9, 2017

Uh oh!

colings86 commented Feb 10, 2017

Uh oh!

xo4n commented May 24, 2017

Uh oh!

colings86 commented May 24, 2017 •

edited

Loading

Uh oh!

colings86 commented May 24, 2017 •

edited

Loading

Uh oh!

xo4n commented May 26, 2017

Uh oh!

colings86 commented May 30, 2017

Uh oh!

colings86 commented Jul 4, 2017

Uh oh!

markharwood commented Mar 16, 2018

Uh oh!

Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

Comments

xo4n commented Feb 6, 2017

martinscholz83 commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintongormley commented Feb 9, 2017

Uh oh!

colings86 commented Feb 10, 2017

Uh oh!

xo4n commented May 24, 2017

Uh oh!

colings86 commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colings86 commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xo4n commented May 26, 2017

Uh oh!

colings86 commented May 30, 2017

Uh oh!

colings86 commented Jul 4, 2017

Uh oh!

markharwood commented Mar 16, 2018

Uh oh!

martinscholz83 commented Feb 9, 2017 •

edited

Loading

colings86 commented May 24, 2017 •

edited

Loading

colings86 commented May 24, 2017 •

edited

Loading