Skip to content

Geometry simplifier (streaming) #94859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 11, 2023

Conversation

craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Mar 29, 2023

There are many opportunities to enable geometry simplification in Elasticsearch, both as an explicit feature available to users, and as an internal optimization technique for reducing memory consumption for complex geometries. For the latter case, it can even be considered a bug fix. This PR provides support for constraining Line and LinearRing sizes to a fixed number of points, and thereby a fixed amount of memory usage.

Consider, for example, the geo_line aggregation. This is similar to the top-10 aggregation, but allows the top-10k (ten thousand) points to be aggregated. This is not only a lot of memory, but can still cause unwanted line truncation for very large geometries. Line simplification is a solution to this. It is likely that a much smaller limit than 10k would suffice, while at the same time not truncating the geometry at all, so we fix a bug (truncation) while improving memory usage (pull limit from 10k down to perhaps just 1k).

This PR provides two APIs:

  • Streaming:
  • Full-geometry:
    • Simplifying full geometries using the simplifier.simplify(geometry) method can work with most geometry types, even GeometryCollection, but:
      • Some geometries do not get simplified because it makes no sense to: Point, Circle, Rectangle
      • The maxPoints parameter is used as is to apply to the main component (shell for polygons, largest geometry for multi-polygons and geometry collections), and all other sub-components (holes in polygons, etc.) are simplified to a scaled down version of the maxPoints, scaled by the relative size of the sub-component to the main component.
    • The simplification itself is done on each Line and LinearRing component using the same streaming algorithm above. Since we use the Visvalingam–Whyatt algorithm, this works is applicable to both streaming and full-geometry simplification with the same essential result, but better control over memory than normal full-geometry simplifiers.

The basic algorithm for simplification on a stream of points requires maintaining two data structures:

  • an array of all currently simplified points (implicitly ordered in stream order)
  • a priority queue of all but the two end points with an estimated error on each that expresses the cost of removing that point from the line

Then running the algorithm on each incoming point:

  • If the incoming point is the third point or later:
    • Evaluate the simplification error of the previous point between its previous and the new one being added
    • Add that previous point and its error to the priority queue (note how this leads to the priority queue having two fewer points than the sorted array)
  • If the total number of points in the sorted array has reached the maximum point count:
    • Determine which point in the priority queue has the lowest error (least impact to the geometry if removed)
    • Remove that point from the array and the priority queue
    • Add the new point to the end of the array
    • Recalculate the errors of the points on either side of the removed point
  • Otherwise just add the new point to the end of the sorted array

Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon, the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings, this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.

The concept of simplification error is an estimation of the amount of damage to the Line or LinearRing that occurs when a point is removed. We have evaluated several candidates:

  • cartesian triangle area (where the lat/lon coordinates are assumed to be cartesian, ie. equirectangular projection)
  • triangle area (where we calculate area using Herons formula, with the triangle sides calculated using Haversin distance, to remove projection distortions)
  • triangle height (triangle area divided by base length)
  • cartesian height and back-paths (a very rough approximation of a Frechet distance)
  • spherical height and back-paths (the same, but calculated in x/y/x coordinates)

It turns out that the above list is also ordered from fastest to slowest, and the second one, triangle area, gives the least visually damaging results, so is by far the best choice at this stage for general usage.

In addition, this has been developed as a separate library containing only the geometry simplification algorithm and support for simplifying:

  • LineString
    • full geometry
    • on a stream of points
  • Polygon
    • full geometry (polygon with holes)
    • on a stream of points (only outer shell)
  • MultiPolygon
    • only full geometry (multiple shells and holes), no streaming
  • GeometryCollection
    • only full geometries, no streaming
  • Non simplifiable geometries:
    • Point, Circle, Rectangle

Tasks still to be done before changing this from draft:

  • Support Line simplification algorithms
    • Triangle area
    • Hausdorff distance
    • Simple Frechet distance (no core sets)
  • Move to spherical coordinates (at least required in error calculation)
    • Triangle area
    • Hausdorff distance
    • Simple Frechet distance (no core sets)
  • Extensive unit tests
    • Test simplification algorithm in GeometrySimplifierTests
    • Test error calculations in SimplificationErrorCalculatorTests
  • Write micro-benchmarks for algorithm performance
  • Move to Lucene PriorityQueue (and benchmark against java.util.PriorityQueue)
    • Using Lucene PriorityQueue.updateTop() as possible performance optimization
  • Remove dependency on Lucene (copy SloppyMath into lib/geo)
  • Remove GeometrySimplfierMonitor to separate project and enhance TestMonitor for assertions
  • Split full-geometry simplifier and streaming simplifier into separate APIs for clarity of usage
  • Write issues for proposed features:

@craigtaverner craigtaverner added >feature :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.8.0 labels Mar 29, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

@craigtaverner craigtaverner force-pushed the geometry_simplifier branch 2 times, most recently from be6ab40 to cfc774f Compare April 25, 2023 10:49
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@craigtaverner craigtaverner requested a review from iverase May 8, 2023 08:13
The simplification error calculation can be done in several ways,
so we generalize to include triangle area calculation.

Also add more tests for both triangle area and prototype Frechet error
Used in micro-benchmarks
And fix memory leak on simplification, and improve circle testing
so we can see the impact of spherical calculations more easily.
Herons formula leads to NaN results when the triangle is flat, due
to small negative values inside the square root. This fix detects
those negative values and returns a zero area.
And copy in SloppyMath from lucene for haversinMeters function
This allows code calling the GeometrySimplifier to be notified
of events related to the simplification algorithm.
For example, the project geometry-simplifier-debug uses these events to draw images and create videos of the simplification process.
And the tests for the simplifier use this to collect data to make further assertions on the results.
We re-use PointError objects instead of sending them to GC.
This does require external code to re-use the simplifier itself.
@craigtaverner craigtaverner force-pushed the geometry_simplifier branch from 390629d to 3286ab7 Compare May 8, 2023 09:53
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class GeometrySimplificationBenchmark {
@Param({ "cartesiantrianglearea", "triangleArea", "frechetError" })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the triangleheight calculator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 65cf706

double rightYrotated = rightY * cos - rightX * sin;
assert Math.abs(rightYrotated) < 1e-10;
assert Math.abs(rightXrotated - len) < len / 1e10;
// Return distance to x-axis TODO: also include back-paths for Frechet distance calculation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO still needs to be done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a0b5e05

@craigtaverner
Copy link
Contributor Author

After implementing back-paths in the Frechet distance calculation, I re-ran benchmarks:

Benchmark                                      (calculatorName)  (maxPoints)  Mode  Cnt  Score   Error  Units
GeometrySimplificationBenchmark.simplify  cartesiantrianglearea         1000  avgt    5  2,756 ± 0,063  ms/op
GeometrySimplificationBenchmark.simplify           triangleArea         1000  avgt    5  5,438 ± 0,092  ms/op
GeometrySimplificationBenchmark.simplify         triangleheight         1000  avgt    5  5,722 ± 0,134  ms/op
GeometrySimplificationBenchmark.simplify     simplefrechetError         1000  avgt    5  3,181 ± 0,066  ms/op

So it looks pretty good, however, this is still Cartesian. If I port to Spherical, it will get slower. So at this point trianglearea is still the primary one to consider, despite being a bit slow.

import java.util.List;
import java.util.Locale;

public interface SimplificationErrorCalculator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to propose to make those implementations a singleton as they are just functional interfaces. for example for the cartesian implementation we can define something like:

    /**
     * Calculate the triangle area using cartesian coordinates as described at
     * <a href="https://en.wikipedia.org/wiki/Area_of_a_triangle">Area of a triangle</a>
     */
    SimplificationErrorCalculator CARTESIAN_TRIANGLE_AREA = (left, middle, right) -> {
        double xb = middle.x() - left.x();
        double yb = middle.y() - left.y();
        double xc = right.x() - left.x();
        double yc = right.y() - left.y();
        return 0.5 * Math.abs(xb * yc - xc * yb);
    };

I think that can simplify the class and remove the need of a Registry which makes very little sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry is used by the benchmark, which needs to be parameterized by String[], and then lookup the actual calculator by name. An earlier version used a switch statement in the byName method, but I thought the registry looked a bit more elegant.

While I could move this all into the benchmark code instead, I did plan to have the tests make use of this also. Right now the tests require extending the GeometrySimplifierTests class with calculator specific sub-classes, that is not very elegant, and I would have preferred to use test parameterization, like we do in the benchmarks. I'm being blocked by ESTestCase which seems to dislike parameterized tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a separate note, the reason I did not use lambda syntax was that to support the actual paper you originally provided, we need to maintain a little more state in the calculators, and they would either need constructors and fields, or additional methods for this. Right now I do not have any calculators that need that, and renamed my FrechetError to SimpleFrechetError to emphasize that I'm not doing the full calculation, only a simplified version. But I did imagine that we might take that step (I even expected that before finishing this work, but no longer think so).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did investigate this a little further, and only one of the calculators contained a single function so could easily be converted to lambda syntax, so I did not take that step. However, I did make constants and removed the Registry class, which hopefully covers most of what you wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did require hard-coding the names into the external visualization tool, but that is not a huge issue.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach looks good. One thing that is confusing is that at the moment the API looks like a common geometry simplifier but it is important to note that the difference with other approaches is that we are doing the simplification on a streaming fashion, do let's try to make it more obvious for the users?

And pass simplifier start/end events even for non simplified
geometries, for improved symmetry.
So we can complete support for all geometry types.
@craigtaverner
Copy link
Contributor Author

So I'm thinking more about the fact that this API currently does both streaming and non-streaming, and that most tests, benchmarks and the visualisation library use the non-streaming API, which internally just delegates to the streaming API.

However, I can imagine one improvement that should satisfy your interest in making the two use cases clearer and more separated, and that would be to split them into two interfaces, and the streaming one only has two implementations, LineString and Polygon, and the non-streaming one could continue to use instances of the streaming objects internally, but not expose their APIs. This would be a little work, but quite achievable, would more clearly separate the two, and also get rid of the current code throwing errors like:

throw new IllegalArgumentException("MultiPolygon geometry simplifier cannot work in streaming mode");

Getting rid of those lines, and therefor also removing any runtime checks for usage of the two APIs would certainly be an improvement.

@iverase
Copy link
Contributor

iverase commented May 9, 2023

I'm thinking more about the fact that this API currently does both streaming and non-streaming,

I don't think that is right. The code is using an streaming approach for line simplification but it accepts as an input a geometry.

@craigtaverner
Copy link
Contributor Author

The code uses the https://en.wikipedia.org/wiki/Visvalingam%E2%80%93Whyatt_algorithm algorithm for geometry simplification, this works for both streaming and non-streaming approaches. The two reasons why I currently call down from the non-streaming API to the streaming one are:

  • We only support limiting geometries by total point count
  • The first use case we expect to implement into Elasticsearch is geo_line streaming

Structuring the non-streaming API to not use the streaming algorithm internally can certainly be done, but there is very little need at this point until some other condition drives this need. In particular, one condition that would do this is to limit the simplification by maximum error (deviation from original), or by visualization conditions (resolution), which would lead to the need for alternative internal implementations.

I propose only doing that if we get such requirements.

For now, the streaming approach limited by total point count actually works pretty well for non-streaming API. Consider the data I've tested the most with, the MultiPolygon of the US (313 polygons for all islands, Alaska, and the main continent). This object takes a certain amount of memory when loaded. When simplifying we do not want to make any copies of that memory, and so the use of point count limited streaming actually ensures we only use a fixed amount of additional memory (linear function of the specified point count) to generate the new complete MultiPolygon with all polygons and holes simplified.

The original GeometrySimplifier provided both a streaming and
non-streaming interface. The streaming interface only works for
limited geometries, necessitating runtime checks and throwing
exceptions to ensure correct usage. This was potentially confusing
and error-prone. So now we've split the two classes:

* StreamingGeometrySimplifier - only supports streaming mode, with methods consume(x, y) and produce() and works only with four geometry types: Point, Line, LinearRing, Polygon (with no holes)
* GeometrySimplifier - only supports non-streaming mode with the method simplify(geometry) and works with almost all geometry types
@craigtaverner
Copy link
Contributor Author

OK. I've split the StreamingGeometrySimplifier out of the GeometrySimplifier. This should increase clarity quite a bit. Now the two classes have two separate API's and no runtime checks need to be done, and no exceptions thrown for incorrect usage. If you create a streaming simplifier, you can only use it in a streaming way, and the simplify method does not even exist. Likewise the non-streaming simplifier only has the simplify method, and you cannot call the consume and produce methods. Obviously the streaming simplifier only works for for three specific geometry types: Line, LinearRing and Polygon (with no holes), while the non-streaming simplifier can work with all geometry types (even GeometryCollection).

This change will break the external library that I wrote for generating images and videos, but I think that is fixable. When fixing all the tests for this change, the biggest issue was related to the callback monitor that was used for image generation, but I think the consequences might be as simple as some of the image labels no longer being possible to render, which is a minor loss.

* a good enough approximation of the original line. This restriction is currently true of all the
* calculations implemented so far.
*/
class SimpleFrechetErrorCalculator implements SimplificationErrorCalculator {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is based on the FrechetError oracle from the Abam et al. paper, it is not the complete calculation, since it only considers the currently simplified geometry, not the impact on the original geometry. The prefix Simple is probably not sufficient to cover this difference, so a better name would be HeightAndBackpathDistanceCalculator, which describes literally what it does.

To simplify the code a little, we removed the calculator registry and also renamed the frechet error to describe what it actually does, instead of implying it was an actual Frechet distance, which it is only an approximation of.
@craigtaverner
Copy link
Contributor Author

I got a basic version of a spherical version of the simplified frechet calculation, and it is much slower:

Benchmark                                                   (calculatorName)  (maxPoints)  Mode  Cnt   Score   Error  Units
GeometrySimplificationBenchmark.simplify               cartesiantrianglearea         1000  avgt    5   2,745 ± 0,096  ms/op
GeometrySimplificationBenchmark.simplify                        triangleArea         1000  avgt    5   5,398 ± 0,035  ms/op
GeometrySimplificationBenchmark.simplify                      triangleheight         1000  avgt    5   5,565 ± 0,010  ms/op
GeometrySimplificationBenchmark.simplify           heightandbackpathdistance         1000  avgt    5   3,074 ± 0,006  ms/op
GeometrySimplificationBenchmark.simplify  sphericalheightandbackpathdistance         1000  avgt    5  15,128 ± 0,185  ms/op

So the current winner balancing performance with accuracy, is still trianglearea (spherical coords).

Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon,
the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that
possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings,
this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for all the javadocs, now it makes much more sense to me

@craigtaverner craigtaverner merged commit 38baa36 into elastic:main May 11, 2023
@craigtaverner craigtaverner changed the title Geometry simplifier Geometry simplifier (streaming) May 12, 2023
@craigtaverner craigtaverner deleted the geometry_simplifier branch October 20, 2023 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants