Skip to content

Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets #23003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xo4n opened this issue Feb 6, 2017 · 10 comments

Comments

@xo4n
Copy link

xo4n commented Feb 6, 2017

Elasticsearch version:
{
"number" : "5.2.0",
"build_hash" : "24e05b9",
"build_date" : "2017-01-24T19:52:35.800Z",
"build_snapshot" : false,
"lucene_version" : "6.4.0"
}

Plugins installed: [reindexing,
x-pack]

JVM version:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

OS version:
debian 8.5

Description of the problem including expected versus actual behavior:
Elastisearch client node used by kibana to send requests to the cluster crashes with

[2017-02-06T18:07:25,659][INFO ][o.e.m.j.JvmGcMonitorService] [dashboard] [gc][7] overhead, spent [346ms] collecting in the last [1s]
[2017-02-06T18:18:24,961][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dashboard] fatal error in thread [elasticsearch[dashboard][search][T#28]], exiting
java.lang.StackOverflowError: null
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:77) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]

Steps to reproduce:

  1. Running the following query from Kibana causes elasticsearch client node to crash
{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.24","rum_custom-2017.01.25","rum_custom-2017.01.26","rum_custom-2017.01.27","rum_custom-2017.01.22","rum_custom-2017.01.23","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486404140629}
{"query":{"bool":{"must":[{"query_string":{"query":"v_entryType: \"mark\"","analyze_wildcard":true}},{"range":{"@timestamp":{"gte":1485108407472,"lte":1486404407472,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"12h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}} 
  1. The same query for a shorter period of time runs just fine
{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.27","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486407007120}
{"query":{"bool":{"must":[{"query_string":{"analyze_wildcard":true,"query":"v_entryType: \"mark\""}},{"range":{"@timestamp":{"gte":1485543020838,"lte":1486407020838,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"3h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}} 

Provide logs (if relevant):

@martinscholz83
Copy link
Contributor

martinscholz83 commented Feb 9, 2017

Hi @xo4n , when you say "The same query for a shorter period of time" do you mean that you set the interval smaller which produces less results per aggregation, or do you mean you query not so much indices?

@clintongormley
Copy link
Contributor

@colings86 any ideas?

@colings86
Copy link
Contributor

@xo4n It looks like it's getting stuck in a compression event and not making any progress.

To explain this let me describe a little about the T-Digest algorithm thats used for the percentiles aggregation: As the aggregation collects values from the matching documents it creates a histogram of centroids that combine similar values (the method of creating these centroids is not important here but if you are interested I would advise reading the t-digest paper. These centroids continue being created as more documents are created and at some point a compress event is triggered to reduce the number of centroids back to a manageable number. The compression strategy is to re-insert all of the centroids in a random order, which because of the way values are combined should result in less centroids being created.

In your case it seems to be that a compression event has been triggered and while re-inserting the centroids its is recursively triggering another compression event which in turn is triggering another compression event and so on until it overflows the stack. This is something I have not seen before.

Unfortunately, this is probably going to be a little tricky to reproduce since getting the same failure will require having the same data inserted into the algorithm, so please bear with me but I will try to get to the bottom of this for you.

The good news is that because you have time based indices and your top level aggregation is a date_histogram we should be able to filter this down to a single index and a single 12 hour period within that index. Could you please run the same query against the following indices, one index at a time (its best if you query Elasticsearch through something like Console for this rather than modfying your visualisation in Kibana) ?

"rum_custom-2017.01.22"
"rum_custom-2017.01.23"
"rum_custom-2017.01.24"
"rum_custom-2017.01.25"
"rum_custom-2017.01.26"

Doing this should hopefully tell you which of these indices is causing the issue (my assumption is it is not the indices from 2017.01.27 to 2017.02.06 since the query across these indices worked).

Once you know which index is causing the issue you will have reduced the problem to two 12 hour buckets. The next step will be to modify the range query in your search request to search from 00:00:00 to 12:00:00 on that day and run the request to see if the issue reproduces, if not change the query to 12:00:00 on that day to 00:00:00 the following day.

At this point you should have narrowed the problem down to a 12 hour period in a single index. If you are happy you can share the index with me privately (I can give you details if you let me know you want to go this route) I could then try to debug the problem. Is this something you would be willing to do? If so, could you let me know how big the index is once you have identified it using the steps above?

@xo4n
Copy link
Author

xo4n commented May 24, 2017

We figured out which data could be causing this

We were mapping some fields to half_float that actually contain bigger values than what a half_float can handle. Not sure what value Elasticsearch indexes in such cases but when running percentile calculations on a field with several of these, the issue is reproducible.

Below there is a data example. We are sending key value fields as strings and mapping to half_float the numeric ones. Fields like "v_startTime" are holding very high values in some cases and those are causing the issue. When sorting those fields already, the ones causing trouble have "sort": ["Infinity" ] instead of its numeric value

Table View ( excerpt)

# v_duration	   	5,862.045
t v_entryType	   	measure
t v_name	   	gtm_time_to_first_tag_item
# v_startTime	   	75,810.745

JSON View

{
	"_index": "rum_custom-2017.05.23",
	"_type": "logs",
	"_id": "AVwy0ptHX7d39WTa_HhR",
	"_score": null,
	"_source": {
		"geoip": {
			"country_code2": "IN",
			"continent_code": "AS"
		},
		"ua_build": "83",
		"v_startTime": "75810.745",
		"ua_os_major": "4",
		"ua_os": "Android 4.4.4",
		"locale": "IN",
		"ua_patch": "3029",
		"ua_minor": "0",
		"@timestamp": "2017-05-23T01:01:21.000Z",
		"ua_name": "Chrome Mobile",
		"ua_os_minor": "4",
		"ua_major": "58",
		"ua_device": "HM NOTE 1S",
		"v_entryType": "measure",
		"user_agent": "Mozilla/5.0 (Linux; Android 4.4.4; HM NOTE 1S Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36",
		"v_name": "gtm_time_to_first_tag_item",
		"v_duration": "5862.045",
		"ua_os_name": "Android"
	},
	"fields": {
		"@timestamp": [
			1495501281000
		]
	},
	"sort": [
		"Infinity"
	]
}

We could attach some more data if needed

@colings86
Copy link
Contributor

colings86 commented May 24, 2017

@xo4n thanks for the update. This is interesting. My current working theory is that the problem is due caused when you insert the same value into the t-digest algorithm load of times (in your case I think it's inserting either HalfFloat.MAX_VALUE or HalfFloat.POSITIVE_INFINITY for each of the docs that have a value outside the half float range). I'm going to try to reproduce this in a test case. I'll report back what I find.

@colings86
Copy link
Contributor

colings86 commented May 24, 2017

@xo4n also, do you have the full stack trace for this? Could you paste it in a gist (not here as it will be very long) and link it here please? could you also let me know how many documents match the search request that fails (you can find this by running the query without percentiles agg and looking at the hits.total field)? and also do you know how many documents contain values for "v_startTime" which are outside of the half_float range?

@xo4n
Copy link
Author

xo4n commented May 26, 2017

@colings86
Copy link
Contributor

@xo4n unfortunately I haven't yet had any luck in reproducing this. I'm wondering if you would be willing/able to privately share your index with me to assist me in reproducing this? If you are not able to share the full index maybe you could use the reindex API to create an index where the documents only contain the field v_startTime and share that index with me?

@colings86
Copy link
Contributor

@xo4n thanks for providing the index to me. I managed to reproduce the issue and it seems to be down to the out of range values which are represented as Infinity in the doc_values. We think that the right fix here is to stop out of range values at index time by rejecting the request so the infinite values don't end up in the index. I opened #25534 for this

@markharwood
Copy link
Contributor

The root cause of this (out of range numerics held in the index) looks to have been addressed by #25826

Closing, but feel free to reopen if the issue remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants