min_score doesn't seem to prune irrelevant results #14455

lazymachinist · 2015-11-02T21:18:57Z

Scenario: to do significant_terms aggregation on a random subset (10%) of the results. The goal is to speed up computation of the aggregation by taking only a portion of the results.
Issue: setting up score in script_score and applying min_score doesn't seem to influence the final result. The following query has 44k results in the database, and all of them seem to get aggregated.

query:

call = {
    'aggregations':
    {
        'foo_aggregation':
        {
            'significant_terms':
            {
                'field': 'bar.text',
                'size': 100
            }
        }
    },
    'query':
    {
        'function_score':
        {
            'boost_mode': 'replace',
            'min_score': 0.5,
            'query': 
            {
                'function_score':
                {
                    'query': {
                      'filtered': {
                        'filter': {
                          'bool': {
                            'must': [
                              {'query': { 'match': {'foo_field': {'query': 'foo',
                                                                  'type': 'phrase'}}}},
                              {'query': {'match': {'foo_field': {'query': 'bar',
                                                                 'type': 'phrase'}}}},
                              {'query': {'match': {'foo_field': {'query': 'baz',
                                                                 'type': 'phrase'}}}}]}}}},
                    'random_score': {}
                }
            },
            'script_score':
            {
                'lang': 'expression',
                'script': '(10.0 * _score > 1)?0.0:1.0'
            }
        }
    }
}

.search(body = call, search_type = 'count')

Result:

{u'_shards': {u'failed': 0, u'successful': 6, u'total': 6},
 u'aggregations': {u'foo_aggregation': {u'buckets': [{u'bg_count': 437361,
                                          u'doc_count': 16528,
                                          u'key': u'a',
                                          u'score': 0.16617895961177376},
                                         {u'bg_count': 214256,
                                          u'doc_count': 8869,
                                          u'key': u'b',
                                          u'score': 0.1165279936014176},
                                         {u'bg_count': 20459,
                                          u'doc_count': 1692,
                                          u'key': u'c',
                                          u'score': 0.08204490448968814},
                                         {u'bg_count': 215203,
                                          u'doc_count': 7889,
                                          u'key': u'd',
                                          u'score': 0.07167727779372135},
                                         {u'bg_count': 502079,
                                          u'doc_count': 15660,
                                          u'key': u'e',
                                          u'score': 0.06899975718865733},
                                         {u'bg_count': 24842,
                                          u'doc_count': 1681,
                                          u'key': u'f',
                                          u'score': 0.059883089238118345},
                                         {u'bg_count': 163057,
                                          u'doc_count': 5767,
                                          u'key': u'g',
                                          u'score': 0.0460286671961764},
                                         {u'bg_count': 17804,
                                          u'doc_count': 1208,
                                          u'key': u'h',
                                          u'score': 0.04322160149374382},
                                         {u'bg_count': 56574,
                                          u'doc_count': 2570,
                                          u'key': u'i',
                                          u'score': 0.04263653815448131},
                                         {u'bg_count': 161090,
                                          u'doc_count': 5617,
                                          u'key': u'j',
                                          u'score': 0.04243133744106531}],
                            u'doc_count': 44870}},
 u'hits': {u'hits': [], u'max_score': 0.0, u'total': 44870},
 u'timed_out': False,
 u'took': 1127}

Expected:
Aggregation based on ~4400 results, not 44000.

Is this an intended behaviour, and if so, is there a way to exclude results from the final resultset used in aggregation?

jpountz · 2015-11-03T09:47:47Z

I tried it locally and it worked for me with elasticsearch 2.0. Which version are you running?

lazymachinist · 2015-11-03T17:20:55Z

Hi Adrien,

This is the version:
"version" : {
"number" : "2.0.0",
"build_hash" : "de54438d6af8f9340d50c5c786151783ce7d6be5",
"build_timestamp" : "2015-10-22T08:09:48Z",
"build_snapshot" : false,
"lucene_version" : "5.2.1"
},
Can you print en example query that worked (i.e. returned results based on a portion of original matches)? Maybe I am missing some important detail.

clintongormley · 2016-02-14T15:13:38Z

I can replicate this. If you set size=0, then the min_score condition isn't applied. This on ES 2.2.0:

POST t/t/_bulk
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}

With size=1, works as expected. With size=0, all 14 results are aggregated.

GET _search?size=1
{
  "aggregations": {
    "foo_aggregation": {
      "significant_terms": {
        "field": "bar.text",
        "size": 100
      }
    }
  },
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "min_score": 0.5,
      "query": {
        "function_score": {
          "query": {
            "filtered": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match": {
                        "foo_field": {
                          "query": "foo",
                          "type": "phrase"
                        }
                      }
                    },
                    {
                      "match": {
                        "foo_field": {
                          "query": "bar",
                          "type": "phrase"
                        }
                      }
                    },
                    {
                      "match": {
                        "foo_field": {
                          "query": "baz",
                          "type": "phrase"
                        }
                      }
                    }
                  ]
                }
              }
            }
          },
          "random_score": {}
        }
      },
      "script_score": {
        "lang": "expression",
        "script": "(10.0 * _score > 1)?0.0:1.0"
      }
    }
  }
}

talevy · 2018-03-23T21:49:41Z

Pinging @elastic/es-search-aggs

not sure I understand the concern here:

I've updated this example to work on Elasticsearch 6.x here:

DELETE t

PUT t
{
  "mappings": {
    "t": {
      "properties": {
        "bar": {
          "properties": {
            "text": {
              "type": "text",
              "fielddata": true
            }
          }
        }
      }
    }
  }
}


POST t/t/_bulk
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}
{"index":{}}
{"foo_field":"foo bar baz","bar":{"text":"quick brown fox"}}



GET t/t/_search?size=0
{
  "aggregations": {
    "foo_aggregation": {
      "significant_terms": {
        "field": "bar.text",
        "size": 100
      }
    }
  },
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "min_score": 0.5,
      "query": {
        "function_score": {
          "query": {
            "constant_score": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_phrase": {
                        "foo_field": "foo"
                      }
                    },
                    {
                      "match_phrase": {
                        "foo_field": "bar"
                      }
                    },
                    {
                      "match_phrase": {
                        "foo_field": "baz"
                      }
                    }
                  ]
                }
              }
            }
          },
          "random_score": {}
        }
      },
      "script_score": {
        "script": "(10.0 * _score > 1)?0.0:1.0"
      }
    }
  }
}

with one random response returned as:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "foo_aggregation": {
      "doc_count": 4,
      "bg_count": 14,
      "buckets": [
        {
          "key": "brown",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        },
        {
          "key": "fox",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        },
        {
          "key": "quick",
          "doc_count": 4,
          "score": 0.27272727272727276,
          "bg_count": 11
        }
      ]
    }
  }
}

is this to mean that you would not expect buckets with terms of a score < 0.5, but in fact we see three with scores of 0.27?

polyfractal · 2018-03-27T19:03:02Z

Related to the original goal -- sub-sampling the result set for improved performance -- the random_score won't really help too much imo. It still has to visit each document to generate the score before exclusion, and visiting the doc is half the performance battle. I imagine it would save some time on the heavier aggs (like sig terms), but would be nice to ignore the doc completely.

There's a proof-of-concept sampling query that I was working on (#25561) which may help, if we can figure out how to make it work. :)

javanna · 2022-10-17T09:38:19Z

Elasticsearch supports now a random sampler aggregation, which would nicely address the initial usecase of this issue which was to execute significant_terms on a subset of documents.

jpountz self-assigned this Nov 2, 2015

clintongormley added feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Nov 8, 2015

clintongormley added >bug and removed feedback_needed labels Feb 14, 2016

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min_score doesn't seem to prune irrelevant results #14455

min_score doesn't seem to prune irrelevant results #14455

lazymachinist commented Nov 2, 2015

jpountz commented Nov 3, 2015

lazymachinist commented Nov 3, 2015

clintongormley commented Feb 14, 2016

talevy commented Mar 23, 2018

polyfractal commented Mar 27, 2018

javanna commented Oct 17, 2022

min_score doesn't seem to prune irrelevant results #14455

min_score doesn't seem to prune irrelevant results #14455

Comments

lazymachinist commented Nov 2, 2015

jpountz commented Nov 3, 2015

lazymachinist commented Nov 3, 2015

clintongormley commented Feb 14, 2016

talevy commented Mar 23, 2018

polyfractal commented Mar 27, 2018

javanna commented Oct 17, 2022