Skip to content

Fix bug in circuit-breaker check for geoshape grid aggregations #57962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 16, 2020

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Jun 10, 2020

There was a bug in the geoshape circuit-breaker check where the
hash values array was being allocated before its new size was
accounted for by the circuit breaker.

Fixes #57847.

There was a bug in the geoshape circuit-breaker check where the
hash values array was being allocated before its new size was
accounted for by the circuit breaker.

Fixes elastic#57847.
@talevy talevy added >bug :Analytics/Geo Indexing, search aggregations of geo points and shapes v8.0.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.8.1 labels Jun 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a big fan of guessing that the array will expand to using oversize. This is right now but there isn't anything that makes sure this stays right.

@talevy
Copy link
Contributor Author

talevy commented Jun 11, 2020

to reproduce OOM, download this USA Elasticsearch document:

usa.txt

curl -X PUT "user:pass@localhost:9200/test?pretty" -H 'Content-Type: application/json' -d'
{
    "mappings" : {
        "properties" : {
            "geometry" : { "type" : "geo_shape" }
        }
    }
}
'

curl -X PUT "user:pass@localhost:9200/test/_doc/usa?pretty" -H 'Content-Type: application/json' -d'@/path/to/usa.txt'

curl -X GET "user:pass@localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "gridSplit": {
      "geotile_grid": {
        "field": "geometry",
        "precision": 12
      },
      "aggs": {
        "gridCentroid": {
          "geo_centroid": {
            "field": "geometry"
          }
        }
      }
    }
  }
}
'

@talevy talevy added the v7.9.0 label Jun 11, 2020
@iverase
Copy link
Contributor

iverase commented Jun 11, 2020

I agree with Nick's point, I wonder if we should add a method to SortingNumericDocValues that gives the amount is going to add to a resize so we can be accurate and be tight to the implementation, wdyt?

@talevy
Copy link
Contributor Author

talevy commented Jun 11, 2020

is the concern that that there is nothing tying the usage of oversize as an estimate with the actual implementation of growing the array? And that this should be coupled better with the resizing so that the two depend on one another?

Looking at SortingNumericDocValues again now, I'm surprised there is no circuit-breaking usage there since it leverages the same resizing techniques and can, similarly, run into memory pressure. I suspect this has never been a problem because of the bucket size limit?

@nik9000
Copy link
Member

nik9000 commented Jun 11, 2020

Looking at SortingNumericDocValues again now, I'm surprised there is no circuit-breaking usage there since it leverages the same resizing techniques and can, similarly, run into memory pressure. I suspect this has never been a problem because of the bucket size limit?

I bet we just don't hit it because we don't frequently end up with very many values so the array never grows that large. I wonder if we should just pass a LongConsumer to the ctor for the breaker so we always factor it into the breaker

@iverase
Copy link
Contributor

iverase commented Jun 12, 2020

@nik9000 Note that the case for grid aggregations on geo shapes is a bit special in the sense that one value can generate potentially millions of buckets. That is not the case for other bucket aggregations where typically one doc value generates maybe a handful of buckets. (Probably the other aggregations that can generate lots of buckets are related with range field types #50109)

Adding a LongConsumer to the constructor will touch lots of other implementations so I would like to avoid it for this PR. Maybe we can find a middle ground here and add a TODO to design a more resilient implementation?

@nik9000
Copy link
Member

nik9000 commented Jun 12, 2020 via email

@iverase
Copy link
Contributor

iverase commented Jun 12, 2020

Would having a mutable LongConsumer that we
register be ok? It could default to noop and this agg could link it to the
breaker.

+1

@Mpdreamz Mpdreamz added v7.8.2 and removed v7.8.1 labels Jul 13, 2020
@iverase
Copy link
Contributor

iverase commented Jul 14, 2020

@tal @nik9000 I think we should move this PR forward to see if we can achieve 7.9?

@talevy
Copy link
Contributor Author

talevy commented Jul 14, 2020

👍 yes. I had planned to push this forward this week

@talevy talevy requested review from iverase and nik9000 July 15, 2020 02:36
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@talevy
Copy link
Contributor Author

talevy commented Jul 15, 2020

I re-ran this using gradle run -Drun.license_type=trial

and this branch resulted in a circuit-breaker response

{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[request] Data too large, data for [<agg [gridSplit]>] would be [430729928/410.7mb], which is larger than the limit of [322122547/307.1mb]","bytes_wanted":430729928,"bytes_limit":322122547,"durability":"TRANSIENT"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test","node":"Alaz6WFOQJ-4qP8Rrb1Mhw","reason":{"type":"circuit_breaking_exception","reason":"[request] Data too large, data for [<agg [gridSplit]>] would be [430729928/410.7mb], which is larger than the limit of [322122547/307.1mb]","bytes_wanted":430729928,"bytes_limit":322122547,"durability":"TRANSIENT"}}]},"status":429}

while master OOMd

@talevy
Copy link
Contributor Author

talevy commented Jul 15, 2020

it may also not be clear, but the aggregator keeps track of these bytes added, and cleans them up upon closing of the aggregation here

this.breakerService.getBreaker(CircuitBreaker.REQUEST).addWithoutBreaking(-this.requestBytesUsed);

@talevy
Copy link
Contributor Author

talevy commented Jul 15, 2020

thank you @nik9000. your suggestions were real great. updated!

@talevy talevy requested a review from nik9000 July 15, 2020 23:07
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@talevy talevy merged commit 8762cd6 into elastic:master Jul 16, 2020
@talevy talevy deleted the fixGeoTileOOM branch July 16, 2020 23:14
talevy added a commit to talevy/elasticsearch that referenced this pull request Jul 17, 2020
…tic#57962)

There was a bug in the geoshape circuit-breaker check where the
hash values array was being allocated before its new size was
accounted for by the circuit breaker.

Fixes elastic#57847.
talevy added a commit that referenced this pull request Jul 17, 2020
PR #57962 increased the precision from 6 to 7 without realizing the
effects on CI — This resulted in out of memory exceptions in CI.

This commit reduces the precision down to 5.
talevy added a commit that referenced this pull request Jul 17, 2020
…) (#59741)

There was a bug in the geoshape circuit-breaker check where the
hash values array was being allocated before its new size was
accounted for by the circuit breaker.

Fixes #57847.
@talevy talevy added the v7.10.0 label Jul 17, 2020
talevy added a commit that referenced this pull request Jul 17, 2020
…) (#59741)

There was a bug in the geoshape circuit-breaker check where the
hash values array was being allocated before its new size was
accounted for by the circuit breaker.

Fixes #57847.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES crashes with OOM error when running fine-grained geotile_grid aggs on geo_shape
6 participants