Skip to content

random_sort on query with has_child eating insane amounts of memory in field data #20141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AndreCimander opened this issue Aug 24, 2016 · 1 comment

Comments

@AndreCimander
Copy link

Elasticsearch version: 2.3.4

Plugins installed: [license, marvel, kibana, kopf, elastic-hq]

JVM version: Java(TM) SE Runtime Environment (build 1.8.0_101-b13)

OS version: Ubuntu 14.04 with kernel 4.2

Description of the problem including expected versus actual behavior:

Hey everyone,

first of all, thanks for this fine piece of software! 👍

I noticed extremely high memory usage and cache trashing in our elasticsearch cluster, after some digging I pinned it down to one random-sort query with a has_child filter. We currently have about 70 million parents with 1.3 billion children.

Applying the query without the random_score function uses just a few GB of field data memory, using the random_sort skyrockets the field data memory usage to 60GB per query, which is a little... unsettling.

The query:
GET /instagram-user/instagram-user/_search { "query": { "function_score": { "filter": { "bool": { "must_not": [ { "exists": { "field": "calculated" } } ], "filter": [ { "term": { "private": false } }, { "has_child": { "query": { "bool": { "filter": [ { "term": { "instagram_user_id": "1397123079" } } ] } }, "score_mode": "none", "type": "instagram-like" } } ] } }, "functions": [ { "random_score": { "seed": 12345 } } ] } } }

I wonder if I missed some config parameter for the random_score, resulting in including all child documents and bloating the filter bitsets?

Happy to supply additional logs.

@clintongormley
Copy link
Contributor

No you're not missing anything. Unfortunately, this is the way it works at the moment. The problem is that random scoring uses the _uid field, which currently doesn't have doc values (see #11887). That means that you have to load the UID into fielddata (on the heap) which, considering the number of docs you have, is going to be costly.

Sorry I can't give you a better answer at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants