-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Percentile ranks metric bug? #10216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@drej82 thanks for raising this issue. It does look like a bug. I have reproduced this on the master branch on an index with only one shard using the below script: PUT percenttest
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST percenttest/doc/1
{
"i": 20
}
POST percenttest/doc/2
{
"i": 5
}
POST percenttest/doc/3
{
"i": 10
}
POST percenttest/doc/4
{
"i": 25
}
POST percenttest/doc/5
{
"i": 14
}
POST percenttest/doc/6
{
"i": 27
}
POST percenttest/doc/7
{
"i": 13
}
GET percenttest/_search?search_type=count
{
"aggs": {
"percent_rank": {
"percentile_ranks": {
"field": "i",
"values": [
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
]
}
}
}
} The response from the last request is: {
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"percent_rank": {
"values": {
"1.0": 0,
"1.0_as_string": "0.0",
"2.0": 0,
"2.0_as_string": "0.0",
"3.0": 1.4285714285714286,
"3.0_as_string": "1.4285714285714286",
"4.0": 4.285714285714286,
"4.0_as_string": "4.285714285714286",
"5.0": 7.142857142857142,
"5.0_as_string": "7.142857142857142",
"6.0": 10,
"6.0_as_string": "10.0",
"7.0": 12.85714285714286,
"7.0_as_string": "12.85714285714286",
"8.0": 16.071428571428573,
"8.0_as_string": "16.071428571428573",
"9.0": 19.642857142857142,
"9.0_as_string": "19.642857142857142",
"10.0": 23.214285714285715,
"10.0_as_string": "23.214285714285715",
"11.0": 26.785714285714285,
"11.0_as_string": "26.785714285714285",
"12.0": 32.142857142857146,
"12.0_as_string": "32.142857142857146",
"13.0": 39.285714285714285,
"13.0_as_string": "39.285714285714285",
"14.0": 44.89795918367347,
"14.0_as_string": "44.89795918367347",
"15.0": 48.97959183673469,
"15.0_as_string": "48.97959183673469",
"16.0": 53.06122448979592,
"16.0_as_string": "53.06122448979592",
"17.0": 57.14285714285714,
"17.0_as_string": "57.14285714285714",
"18.0": 59.74025974025974,
"18.0_as_string": "59.74025974025974",
"19.0": 62.33766233766234,
"19.0_as_string": "62.33766233766234",
"20.0": 64.93506493506493,
"20.0_as_string": "64.93506493506493",
"21.0": 67.53246753246754,
"21.0_as_string": "67.53246753246754",
"22.0": 70.12987012987013,
"22.0_as_string": "70.12987012987013",
"23.0": 50,
"23.0_as_string": "50.0",
"24.0": 57.14285714285714,
"24.0_as_string": "57.14285714285714",
"25.0": 64.28571428571429,
"25.0_as_string": "64.28571428571429",
"26.0": 71.42857142857143,
"26.0_as_string": "71.42857142857143",
"27.0": 78.57142857142857,
"27.0_as_string": "78.57142857142857",
"28.0": 100,
"28.0_as_string": "100.0",
"29.0": 100,
"29.0_as_string": "100.0",
"30.0": 100,
"30.0_as_string": "100.0"
}
}
}
} @jpountz any idea what might be going on with the decrease in value from 22 to 23? |
The problem seems to happen at the value 22.5: GET percenttest/_search?search_type=count
{
"aggs": {
"percent_rank": {
"percentile_ranks": {
"field": "i",
"values": [
22.4999999,22.50
]
}
}
}
} response: {
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"percent_rank": {
"values": {
"22.4999999": 71.42857116883116,
"22.4999999_as_string": "71.42857116883116",
"22.5": 46.42857142857143,
"22.5_as_string": "46.42857142857143"
}
}
}
} |
I actually found a bug in t-digest, it skips the last interval while doing the linear interpolation. I will create pull request with a fix. For tracking purposes: tdunning/t-digest#44 |
Thanks @drej82 for investigating this bug, opening a bug upstream and proposing a fix! |
The fix was committed, please let me know when it propagates to official snapshot. |
Thanks! We'll watch for new t-digest releases and integrate it once a new version is released with this fix. |
A new release is out, I'll upgrade the dependency soon. http://search.maven.org/#artifactdetails%7Ccom.tdunning%7Ct-digest%7C3.1%7Cjar |
I tried to upgrade (see https://github.com/jpountz/elasticsearch/tree/upgrade/t-digest) but we have test failures due to the fact that the new cdf computation returns 100 even for values that are below the max value. For instance if I replay @colings86 's example above, I get the following response:
It returns 100 for 26 although the max value is 27. But maybe it's still better than what we have currently and we should make the test more lenient and work at the same time on improving the cdf computation? |
@jpountz I would agree, since t-digest is an approximate algorithm it may calculate the 100th percentile as a lower value that the actual data depending on how it buckets the data. +1 on making the test a bit more lenient as well as improving the cdf computation |
OK, I'll do that then. Thanks for the feedback. |
The change I've made couldn't cause such behavior. I'm wondering if Ted made some other changes. The way it works, it should return values < 100 % for all neighbors of max, even for neighbors > max. Instead of relaxing the test, I'd suggest we find out the root cause. Sent from my Windows Phone -----Original Message----- OK, I'll do that then. Thanks for the feedback. |
Any update on this? |
Any update on this? |
Tested this locally, has since been fixed. Was probably fixed when we upgraded to TDigest 3.2: #28305 |
I'm trying to use percentile rank metrics and it works quite odd. ElasticSearch docs say - "Percentile rank show the percentage of observed values which are below certain value." which is exactly what I need, though it gives queer results even for small set of docs (5-10 docs)
Set of numbers I use: 20, 5, 10, 25, 14, 27, 13
I don't get how it's possible that value 23 shows lower percentage than value 22. Do I misunderstand idea of percentile ranks or it's a bug?
I am using Elastic search 1.4.4
The text was updated successfully, but these errors were encountered: