Skip to content

Percentile ranks metric bug? #10216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anobra opened this issue Mar 23, 2015 · 14 comments
Closed

Percentile ranks metric bug? #10216

anobra opened this issue Mar 23, 2015 · 14 comments

Comments

@anobra
Copy link

anobra commented Mar 23, 2015

I'm trying to use percentile rank metrics and it works quite odd. ElasticSearch docs say - "Percentile rank show the percentage of observed values which are below certain value." which is exactly what I need, though it gives queer results even for small set of docs (5-10 docs)

{
  "responses": [
    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 7,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "1": {
          "values": {
            "22.0": 70.12987012987013,
            "23.0": 50
          }
        }
      }
    }
  ]
}

Set of numbers I use: 20, 5, 10, 25, 14, 27, 13

I don't get how it's possible that value 23 shows lower percentage than value 22. Do I misunderstand idea of percentile ranks or it's a bug?

I am using Elastic search 1.4.4

@colings86
Copy link
Contributor

@drej82 thanks for raising this issue. It does look like a bug.

I have reproduced this on the master branch on an index with only one shard using the below script:

PUT percenttest
{
  "settings": {
    "number_of_shards": 1, 
    "number_of_replicas": 0
  }
}
POST percenttest/doc/1
{
  "i": 20
}
POST percenttest/doc/2
{
  "i": 5
}
POST percenttest/doc/3
{
  "i": 10
}
POST percenttest/doc/4
{
  "i": 25
}
POST percenttest/doc/5
{
  "i": 14
}
POST percenttest/doc/6
{
  "i": 27
}
POST percenttest/doc/7
{
  "i": 13
}
GET percenttest/_search?search_type=count
{
  "aggs": {
    "percent_rank": {
      "percentile_ranks": {
        "field": "i",
        "values": [
          1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
        ]
      }
    }
  }
}

The response from the last request is:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 7,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "percent_rank": {
         "values": {
            "1.0": 0,
            "1.0_as_string": "0.0",
            "2.0": 0,
            "2.0_as_string": "0.0",
            "3.0": 1.4285714285714286,
            "3.0_as_string": "1.4285714285714286",
            "4.0": 4.285714285714286,
            "4.0_as_string": "4.285714285714286",
            "5.0": 7.142857142857142,
            "5.0_as_string": "7.142857142857142",
            "6.0": 10,
            "6.0_as_string": "10.0",
            "7.0": 12.85714285714286,
            "7.0_as_string": "12.85714285714286",
            "8.0": 16.071428571428573,
            "8.0_as_string": "16.071428571428573",
            "9.0": 19.642857142857142,
            "9.0_as_string": "19.642857142857142",
            "10.0": 23.214285714285715,
            "10.0_as_string": "23.214285714285715",
            "11.0": 26.785714285714285,
            "11.0_as_string": "26.785714285714285",
            "12.0": 32.142857142857146,
            "12.0_as_string": "32.142857142857146",
            "13.0": 39.285714285714285,
            "13.0_as_string": "39.285714285714285",
            "14.0": 44.89795918367347,
            "14.0_as_string": "44.89795918367347",
            "15.0": 48.97959183673469,
            "15.0_as_string": "48.97959183673469",
            "16.0": 53.06122448979592,
            "16.0_as_string": "53.06122448979592",
            "17.0": 57.14285714285714,
            "17.0_as_string": "57.14285714285714",
            "18.0": 59.74025974025974,
            "18.0_as_string": "59.74025974025974",
            "19.0": 62.33766233766234,
            "19.0_as_string": "62.33766233766234",
            "20.0": 64.93506493506493,
            "20.0_as_string": "64.93506493506493",
            "21.0": 67.53246753246754,
            "21.0_as_string": "67.53246753246754",
            "22.0": 70.12987012987013,
            "22.0_as_string": "70.12987012987013",
            "23.0": 50,
            "23.0_as_string": "50.0",
            "24.0": 57.14285714285714,
            "24.0_as_string": "57.14285714285714",
            "25.0": 64.28571428571429,
            "25.0_as_string": "64.28571428571429",
            "26.0": 71.42857142857143,
            "26.0_as_string": "71.42857142857143",
            "27.0": 78.57142857142857,
            "27.0_as_string": "78.57142857142857",
            "28.0": 100,
            "28.0_as_string": "100.0",
            "29.0": 100,
            "29.0_as_string": "100.0",
            "30.0": 100,
            "30.0_as_string": "100.0"
         }
      }
   }
}

@jpountz any idea what might be going on with the decrease in value from 22 to 23?

@colings86
Copy link
Contributor

The problem seems to happen at the value 22.5:

GET percenttest/_search?search_type=count
{
  "aggs": {
    "percent_rank": {
      "percentile_ranks": {
        "field": "i",
        "values": [
          22.4999999,22.50
        ]
      }
    }
  }
}

response:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 7,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "percent_rank": {
         "values": {
            "22.4999999": 71.42857116883116,
            "22.4999999_as_string": "71.42857116883116",
            "22.5": 46.42857142857143,
            "22.5_as_string": "46.42857142857143"
         }
      }
   }
}

@anobra
Copy link
Author

anobra commented Mar 28, 2015

I actually found a bug in t-digest, it skips the last interval while doing the linear interpolation. I will create pull request with a fix.

For tracking purposes: tdunning/t-digest#44

@jpountz
Copy link
Contributor

jpountz commented Mar 29, 2015

Thanks @drej82 for investigating this bug, opening a bug upstream and proposing a fix!

@anobra
Copy link
Author

anobra commented Mar 31, 2015

The fix was committed, please let me know when it propagates to official snapshot.

@jpountz
Copy link
Contributor

jpountz commented Apr 3, 2015

Thanks! We'll watch for new t-digest releases and integrate it once a new version is released with this fix.

@jpountz
Copy link
Contributor

jpountz commented Apr 6, 2015

A new release is out, I'll upgrade the dependency soon. http://search.maven.org/#artifactdetails%7Ccom.tdunning%7Ct-digest%7C3.1%7Cjar

@jpountz
Copy link
Contributor

jpountz commented Apr 7, 2015

I tried to upgrade (see https://github.com/jpountz/elasticsearch/tree/upgrade/t-digest) but we have test failures due to the fact that the new cdf computation returns 100 even for values that are below the max value. For instance if I replay @colings86 's example above, I get the following response:

      "percent_rank": {
         "values": {
            "1.0": 0,
            "2.0": 0,
            "3.0": 1.4285714285714286,
            "4.0": 4.285714285714286,
            "5.0": 7.142857142857142,
            "6.0": 10,
            "7.0": 12.85714285714286,
            "8.0": 16.071428571428573,
            "9.0": 19.642857142857142,
            "10.0": 23.214285714285715,
            "11.0": 26.785714285714285,
            "12.0": 32.142857142857146,
            "13.0": 39.285714285714285,
            "14.0": 44.89795918367347,
            "15.0": 48.97959183673469,
            "16.0": 53.06122448979592,
            "17.0": 57.14285714285714,
            "18.0": 59.74025974025974,
            "19.0": 62.33766233766234,
            "20.0": 64.93506493506493,
            "21.0": 67.53246753246754,
            "22.0": 70.12987012987013,
            "23.0": 73.46938775510205,
            "24.0": 77.55102040816327,
            "25.0": 81.63265306122449,
            "26.0": 100,
            "27.0": 100,
            "28.0": 100,
            "29.0": 100,
            "30.0": 100
         }
      }
   }

It returns 100 for 26 although the max value is 27. But maybe it's still better than what we have currently and we should make the test more lenient and work at the same time on improving the cdf computation?

@colings86
Copy link
Contributor

@jpountz I would agree, since t-digest is an approximate algorithm it may calculate the 100th percentile as a lower value that the actual data depending on how it buckets the data.

+1 on making the test a bit more lenient as well as improving the cdf computation

@jpountz
Copy link
Contributor

jpountz commented Apr 7, 2015

OK, I'll do that then. Thanks for the feedback.

@anobra
Copy link
Author

anobra commented Apr 7, 2015

The change I've made couldn't cause such behavior. I'm wondering if Ted made some other changes. The way it works, it should return values < 100 % for all neighbors of max, even for neighbors > max. Instead of relaxing the test, I'd suggest we find out the root cause.

Sent from my Windows Phone

-----Original Message-----
From: "Adrien Grand" [email protected]
Sent: ‎4/‎7/‎2015 3:59 AM
To: "elastic/elasticsearch" [email protected]
Cc: "Andrey" [email protected]
Subject: Re: [elasticsearch] Percentile ranks metric bug? (#10216)

OK, I'll do that then. Thanks for the feedback.

Reply to this email directly or view it on GitHub.

@rcrezende
Copy link

Any update on this?

@rcrezende
Copy link

Any update on this?

@polyfractal
Copy link
Contributor

Tested this locally, has since been fixed. Was probably fixed when we upgraded to TDigest 3.2: #28305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants