Skip to content

sum_of_squares calculation and docs don't align #50416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jmceniery opened this issue Dec 20, 2019 · 6 comments
Closed

sum_of_squares calculation and docs don't align #50416

jmceniery opened this issue Dec 20, 2019 · 6 comments
Labels
:Analytics/Aggregations Aggregations >bug >docs General docs changes >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Docs Meta label for docs team

Comments

@jmceniery
Copy link
Member

jmceniery commented Dec 20, 2019

Potentially a link in the docs to a Wikipedia article is incorrect and should be removed, or the function to calculate the intended usage of the sum of squares is incorrect and needs to be updated.

Describe the feature:
Expecting a sum_of_squares output in the Extended Stats Aggregation to align with the formula provided by Wikipedia and many other statistical sources.

The current calculation for the Extended Stats Aggregation for the sum_of_squares is calculated using the following equation: sumOfSquares += value * value; The docs references a Wikipedia article which provides a different function.

image

Elasticsearch version: Tested on 6.5.4, 7.5.0

Description of the problem including expected versus actual behavior:
Current calculation of the sum of squares does not align to the statistical technique used to calculate the sum of squares.

Sum of squares is a statistical technique used in regression analysis to determine the dispersion of data points. In a regression analysis, the goal is to determine how well a data series can be fitted to a function that might help to explain how the data series was generated. Sum of squares is used as a mathematical way to find the function that best fits (varies least) from the data.

Many sum of squares calculators do not align to the way the sum of
Steps to reproduce:

List of Numbers: 74.01,74.77,73.94,73.61,73.40
Expected outcome:

SS = (74.01 - 73.95)2 + (74.77 - 73.95)2 + (73.94 - 73.95)2 + (73.61 - 73.95)2 + (73.40 - 73.95)2
SS = (0.06) 2 + (0.82)2 + (-0.01)2 + (-0.34)2 + (-0.55)2
SS = 1.0942

Actual Outcome:
Elastic looks to be using the following formula to calculate the sum_of_squares:

SS = (74.01 )2 + (74.77)2 + (73.94)2 + (73.61)2 + (73.40)2
SS = 5477.4801 + 5590.5529 + 5467.1236 + 5418.4321 + 5387.56
SS = 27341.1487

Recreate:
Created an index:

PUT /sum_of_squares_test_2
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1
        }
    }
}

Add some Docs

POST /sum_of_squares_test_2/_doc/
{
  "grade": 74.01
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 74.77
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.94
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.61
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.40
}

Search the index:

GET sum_of_squares_test_2/_search
{
  "size":0,
  "aggs":{
    "grade_stats":{
      "extended_stats":{
        "field":"grade"
      }
    }
  }
}

Response:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "grade_stats" : {
      "count" : 5,
...
      "sum_of_squares" : 27341.149189099146,
...
      }
    }
  }
}

Search the index using SQL

POST /_xpack/sql?format=txt
{
    "query": "SELECT SUM_OF_SQUARES(grade) AS sumsq FROM sum_of_squares_test_2"
}

Response:

      sumsq       
------------------
27341.149189099146

Can the statistical method also be added if the current method is as expected. The link in the docs will need to be removed if the current method is correct. I am happy to put in the PR once I have the clarification.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@nik9000
Copy link
Member

nik9000 commented Dec 23, 2019

Can the statistical method also be added if the current method is as expected. The link in the docs will need to be removed if the current method is correct. I am happy to put in the PR once I have the clarification.

Dropping the link to Wikipedia seems appropriate. I'm actually not sure you can add the method the link describes now without two passes which is something we don't support right now.

@polyfractal, do I speak the truth here?

@polyfractal
Copy link
Contributor

Hmm, I think there are multiple things going wrong here :)

  1. ++ to removing that link given the current implementation. The "sum of squares" that extended_stats computes is more of an internal value computed for single-pass variance/standard deviation calculation. It is simply the sum of squared values, not necessarily the "total sum of squares" statistical value.

    The extended_stats agg calculates variance and standard deviation by collecting the sum of values and sum of squared values separately in a single pass, and then computing var/std at the end. I thought we were using Welford's algorithm but it appears we are just using the "naive" formula

  2. We should probably look into changing to a better algo, since that can lead to precision problems in some scenarios (which I don't think our Kahan summation will catch, because it arises from the difference of sumSq and sum being very small)

  3. Looks like we don't describe sum_of_squares in the extended_stats docs either, which we should probably fix as well.

  4. While the current sum_of_squares value is simply the sum of squared values, I think we could return the true TSS. If my summation algebra isn't entirely off base, we're calculating the TSS when calculating the variance here: (sumOfSqrs - ((sum * sum) / count)) (with variance being that value divided by count)

  5. @elastic/es-sql isSUM_OF_SQUARES a standard SQL function that we need to support, or just something that is exposed because extended_stats has it? It would probably be more appropriate to call it SUM_OF_SQUARED_VALUES or something given the current definition? Probably ok to leave if we decide to make a change re: 4) though

@costin
Copy link
Member

costin commented Jan 5, 2020

@polyfractal SUM_OF_SQUARES is not standard SQL, it is specific to Elasticsearch and is exposed thanks to extended_stats.
Personally I'm in favor of 4 (if possible keeping the current value of SUM_OF_SQUARED_VALUES wouldn't hurt).

jmceniery added a commit that referenced this issue Feb 11, 2020
I have removed the link to Wikipedia as the linked page does not reflect what is actually being calculated. A per the following issue:

#50416
jmceniery added a commit that referenced this issue Feb 16, 2020
Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
jrodewig pushed a commit that referenced this issue Apr 20, 2020
…52398)

Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
jrodewig pushed a commit that referenced this issue Apr 20, 2020
…52398)

Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
jrodewig pushed a commit that referenced this issue Apr 20, 2020
…52398)

Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
jrodewig pushed a commit that referenced this issue Apr 20, 2020
…52398)

Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
jrodewig pushed a commit that referenced this issue Apr 20, 2020
…52398)

Removed the link to Wikipedia as the function is not calculating the sum of squares in this way. More can be found here at this issue:

#50416
@rjernst rjernst added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Docs Meta label for docs team labels May 4, 2020
@jrodewig
Copy link
Contributor

Closed by #52398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug >docs General docs changes >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Docs Meta label for docs team
Projects
None yet
Development

No branches or pull requests

7 participants