Skip to content

min_score doesn't seem to prune irrelevant results #14455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lazymachinist opened this issue Nov 2, 2015 · 6 comments
Closed

min_score doesn't seem to prune irrelevant results #14455

lazymachinist opened this issue Nov 2, 2015 · 6 comments
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@lazymachinist
Copy link

Scenario: to do significant_terms aggregation on a random subset (10%) of the results. The goal is to speed up computation of the aggregation by taking only a portion of the results.
Issue: setting up score in script_score and applying min_score doesn't seem to influence the final result. The following query has 44k results in the database, and all of them seem to get aggregated.

query:

call = {
    'aggregations':
    {
        'foo_aggregation':
        {
            'significant_terms':
            {
                'field': 'bar.text',
                'size': 100
            }
        }
    },
    'query':
    {
        'function_score':
        {
            'boost_mode': 'replace',
            'min_score': 0.5,
            'query': 
            {
                'function_score':
                {
                    'query': {
                      'filtered': {
                        'filter': {
                          'bool': {
                            'must': [
                              {'query': { 'match': {'foo_field': {'query': 'foo',
                                                                  'type': 'phrase'}}}},
                              {'query': {'match': {'foo_field': {'query': 'bar',
                                                                 'type': 'phrase'}}}},
                              {'query': {'match': {'foo_field': {'query': 'baz',
                                                                 'type': 'phrase'}}}}]}}}},
                    'random_score': {}
                }
            },
            'script_score':
            {
                'lang': 'expression',
                'script': '(10.0 * _score > 1)?0.0:1.0'
            }
        }
    }
}

.search(body = call, search_type = 'count')

Result:

{u'_shards': {u'failed': 0, u'successful': 6, u'total': 6},
 u'aggregations': {u'foo_aggregation': {u'buckets': [{u'bg_count': 437361,
                                          u'doc_count': 16528,
                                          u'key': u'a',
                                          u'score': 0.16617895961177376},
                                         {u'bg_count': 214256,
                                          u'doc_count': 8869,
                                          u'key': u'b',
                                          u'score': 0.1165279936014176},
                                         {u'bg_count': 20459,
                                          u'doc_count': 1692,
                                          u'key': u'c',
                                          u'score': 0.08204490448968814},
                                         {u'bg_count': 215203,
                                          u'doc_count': 7889,
                                          u'key': u'd',
                                          u'score': 0.07167727779372135},
                                         {u'bg_count': 502079,
                                          u'doc_count': 15660,
                                          u'key': u'e',
                                          u'score': 0.06899975718865733},
                                         {u'bg_count': 24842,
                                          u'doc_count': 1681,
                                          u'key': u'f',
                                          u'score': 0.059883089238118345},
                                         {u'bg_count': 163057,
                                          u'doc_count': 5767,
                                          u'key': u'g',
                                          u'score': 0.0460286671961764},
                                         {u'bg_count': 17804,
                                          u'doc_count': 1208,
                                          u'key': u'h',
                                          u'score': 0.04322160149374382},
                                         {u'bg_count': 56574,
                                          u'doc_count': 2570,
                                          u'key': u'i',
                                          u'score': 0.04263653815448131},
                                         {u'bg_count': 161090,
                                          u'doc_count': 5617,
                                          u'key': u'j',
                                          u'score': 0.04243133744106531}],
                            u'doc_count': 44870}},
 u'hits': {u'hits': [], u'max_score': 0.0, u'total': 44870},
 u'timed_out': False,
 u'took': 1127}

Expected:
Aggregation based on ~4400 results, not 44000.

Is this an intended behaviour, and if so, is there a way to exclude results from the final resultset used in aggregation?

@jpountz jpountz self-assigned this Nov 2, 2015
@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2015

I tried it locally and it worked for me with elasticsearch 2.0. Which version are you running?

@lazymachinist
Copy link
Author

Hi Adrien,

This is the version:
"version" : {
"number" : "2.0.0",
"build_hash" : "de54438d6af8f9340d50c5c786151783ce7d6be5",
"build_timestamp" : "2015-10-22T08:09:48Z",
"build_snapshot" : false,
"lucene_version" : "5.2.1"
},
Can you print en example query that worked (i.e. returned results based on a portion of original matches)? Maybe I am missing some important detail.

@clintongormley clintongormley added feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Nov 8, 2015
@clintongormley
Copy link
Contributor

I can replicate this. If you set size=0, then the min_score condition isn't applied. This on ES 2.2.0:

POST t/t/_bulk
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}

With size=1, works as expected. With size=0, all 14 results are aggregated.

GET _search?size=1
{
  "aggregations": {
    "foo_aggregation": {
      "significant_terms": {
        "field": "bar.text",
        "size": 100
      }
    }
  },
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "min_score": 0.5,
      "query": {
        "function_score": {
          "query": {
            "filtered": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match": {
                        "foo_field": {
                          "query": "foo",
                          "type": "phrase"
                        }
                      }
                    },
                    {
                      "match": {
                        "foo_field": {
                          "query": "bar",
                          "type": "phrase"
                        }
                      }
                    },
                    {
                      "match": {
                        "foo_field": {
                          "query": "baz",
                          "type": "phrase"
                        }
                      }
                    }
                  ]
                }
              }
            }
          },
          "random_score": {}
        }
      },
      "script_score": {
        "lang": "expression",
        "script": "(10.0 * _score > 1)?0.0:1.0"
      }
    }
  }
}

@talevy
Copy link
Contributor

talevy commented Mar 23, 2018

Pinging @elastic/es-search-aggs

not sure I understand the concern here:

I've updated this example to work on Elasticsearch 6.x here:

DELETE t

PUT t
{
  "mappings": {
    "t": {
      "properties": {
        "bar": {
          "properties": {
            "text": {
              "type": "text",
              "fielddata": true
            }
          }
        }
      }
    }
  }
}


POST t/t/_bulk
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}



GET t/t/_search?size=0
{
  "aggregations": {
    "foo_aggregation": {
      "significant_terms": {
        "field": "bar.text",
        "size": 100
      }
    }
  },
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "min_score": 0.5,
      "query": {
        "function_score": {
          "query": {
            "constant_score": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_phrase": {
                        "foo_field": "foo"
                      }
                    },
                    {
                      "match_phrase": {
                        "foo_field": "bar"
                      }
                    },
                    {
                      "match_phrase": {
                        "foo_field": "baz"
                      }
                    }
                  ]
                }
              }
            }
          },
          "random_score": {}
        }
      },
      "script_score": {
        "script": "(10.0 * _score > 1)?0.0:1.0"
      }
    }
  }
}

with one random response returned as:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "foo_aggregation": {
      "doc_count": 4,
      "bg_count": 14,
      "buckets": [
        {
          "key": "brown",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        },
        {
          "key": "fox",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        },
        {
          "key": "quick",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        }
      ]
    }
  }
}

is this to mean that you would not expect buckets with terms of a score < 0.5, but in fact we see three with scores of 0.27?

@polyfractal
Copy link
Contributor

Related to the original goal -- sub-sampling the result set for improved performance -- the random_score won't really help too much imo. It still has to visit each document to generate the score before exclusion, and visiting the doc is half the performance battle. I imagine it would save some time on the heavier aggs (like sig terms), but would be nice to ignore the doc completely.

There's a proof-of-concept sampling query that I was working on (#25561) which may help, if we can figure out how to make it work. :)

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@javanna
Copy link
Member

javanna commented Oct 17, 2022

Elasticsearch supports now a random sampler aggregation, which would nicely address the initial usecase of this issue which was to execute significant_terms on a subset of documents.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

7 participants