Geometry simplifier (streaming) #94859

craigtaverner · 2023-03-29T10:27:39Z

There are many opportunities to enable geometry simplification in Elasticsearch, both as an explicit feature available to users, and as an internal optimization technique for reducing memory consumption for complex geometries. For the latter case, it can even be considered a bug fix. This PR provides support for constraining Line and LinearRing sizes to a fixed number of points, and thereby a fixed amount of memory usage.

Consider, for example, the geo_line aggregation. This is similar to the top-10 aggregation, but allows the top-10k (ten thousand) points to be aggregated. This is not only a lot of memory, but can still cause unwanted line truncation for very large geometries. Line simplification is a solution to this. It is likely that a much smaller limit than 10k would suffice, while at the same time not truncating the geometry at all, so we fix a bug (truncation) while improving memory usage (pull limit from 10k down to perhaps just 1k).

This PR provides two APIs:

Streaming:
- By using the simplifier.consume(x, y) method on a stream of points, the total memory used is limited to a linear function of k, the total number of points to retain. This algorithm is at its heart based on the Visvalingam–Whyatt algorithm, with concepts from https://bost.ocks.org/mike/simplify/ and in particular the detailed streaming discussions in the paper at https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.7132&rep=rep1&type=pdf
Full-geometry:
- Simplifying full geometries using the simplifier.simplify(geometry) method can work with most geometry types, even GeometryCollection, but:
  - Some geometries do not get simplified because it makes no sense to: Point, Circle, Rectangle
  - The maxPoints parameter is used as is to apply to the main component (shell for polygons, largest geometry for multi-polygons and geometry collections), and all other sub-components (holes in polygons, etc.) are simplified to a scaled down version of the maxPoints, scaled by the relative size of the sub-component to the main component.
- The simplification itself is done on each Line and LinearRing component using the same streaming algorithm above. Since we use the Visvalingam–Whyatt algorithm, this works is applicable to both streaming and full-geometry simplification with the same essential result, but better control over memory than normal full-geometry simplifiers.

The basic algorithm for simplification on a stream of points requires maintaining two data structures:

an array of all currently simplified points (implicitly ordered in stream order)
a priority queue of all but the two end points with an estimated error on each that expresses the cost of removing that point from the line

Then running the algorithm on each incoming point:

If the incoming point is the third point or later:
- Evaluate the simplification error of the previous point between its previous and the new one being added
- Add that previous point and its error to the priority queue (note how this leads to the priority queue having two fewer points than the sorted array)
If the total number of points in the sorted array has reached the maximum point count:
- Determine which point in the priority queue has the lowest error (least impact to the geometry if removed)
- Remove that point from the array and the priority queue
- Add the new point to the end of the array
- Recalculate the errors of the points on either side of the removed point
Otherwise just add the new point to the end of the sorted array

Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon, the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings, this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.

The concept of simplification error is an estimation of the amount of damage to the Line or LinearRing that occurs when a point is removed. We have evaluated several candidates:

cartesian triangle area (where the lat/lon coordinates are assumed to be cartesian, ie. equirectangular projection)
triangle area (where we calculate area using Herons formula, with the triangle sides calculated using Haversin distance, to remove projection distortions)
triangle height (triangle area divided by base length)
cartesian height and back-paths (a very rough approximation of a Frechet distance)
spherical height and back-paths (the same, but calculated in x/y/x coordinates)

It turns out that the above list is also ordered from fastest to slowest, and the second one, triangle area, gives the least visually damaging results, so is by far the best choice at this stage for general usage.

In addition, this has been developed as a separate library containing only the geometry simplification algorithm and support for simplifying:

Tasks still to be done before changing this from draft:

elasticsearchmachine · 2023-03-29T10:28:04Z

Hi @craigtaverner, I've created a changelog YAML for you.

The simplification error calculation can be done in several ways, so we generalize to include triangle area calculation. Also add more tests for both triangle area and prototype Frechet error

Used in micro-benchmarks

And fix memory leak on simplification, and improve circle testing so we can see the impact of spherical calculations more easily.

Herons formula leads to NaN results when the triangle is flat, due to small negative values inside the square root. This fix detects those negative values and returns a zero area.

And copy in SloppyMath from lucene for haversinMeters function

This allows code calling the GeometrySimplifier to be notified of events related to the simplification algorithm. For example, the project geometry-simplifier-debug uses these events to draw images and create videos of the simplification process. And the tests for the simplifier use this to collect data to make further assertions on the results.

We re-use PointError objects instead of sending them to GC. This does require external code to re-use the simplifier itself.

elasticsearchmachine · 2023-05-08T11:14:32Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

craigtaverner · 2023-05-08T13:43:23Z

...marks/src/main/java/org/elasticsearch/benchmark/spatial/GeometrySimplificationBenchmark.java

+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Thread)
+public class GeometrySimplificationBenchmark {
+    @Param({ "cartesiantrianglearea", "triangleArea", "frechetError" })


Missing the triangleheight calculator

Fixed in 65cf706

craigtaverner · 2023-05-08T13:44:08Z

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/SimplificationErrorCalculator.java

+                double rightYrotated = rightY * cos - rightX * sin;
+                assert Math.abs(rightYrotated) < 1e-10;
+                assert Math.abs(rightXrotated - len) < len / 1e10;
+                // Return distance to x-axis TODO: also include back-paths for Frechet distance calculation


TODO still needs to be done

Fixed in a0b5e05

craigtaverner · 2023-05-08T15:14:30Z

After implementing back-paths in the Frechet distance calculation, I re-ran benchmarks:

Benchmark                                      (calculatorName)  (maxPoints)  Mode  Cnt  Score   Error  Units
GeometrySimplificationBenchmark.simplify  cartesiantrianglearea         1000  avgt    5  2,756 ± 0,063  ms/op
GeometrySimplificationBenchmark.simplify           triangleArea         1000  avgt    5  5,438 ± 0,092  ms/op
GeometrySimplificationBenchmark.simplify         triangleheight         1000  avgt    5  5,722 ± 0,134  ms/op
GeometrySimplificationBenchmark.simplify     simplefrechetError         1000  avgt    5  3,181 ± 0,066  ms/op

So it looks pretty good, however, this is still Cartesian. If I port to Spherical, it will get slower. So at this point trianglearea is still the primary one to consider, despite being a bit slow.

iverase · 2023-05-09T08:57:16Z

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/SimplificationErrorCalculator.java

+import java.util.List;
+import java.util.Locale;
+
+public interface SimplificationErrorCalculator {


I would like to propose to make those implementations a singleton as they are just functional interfaces. for example for the cartesian implementation we can define something like:

/** * Calculate the triangle area using cartesian coordinates as described at * <a href="https://en.wikipedia.org/wiki/Area_of_a_triangle">Area of a triangle</a> */ SimplificationErrorCalculator CARTESIAN_TRIANGLE_AREA = (left, middle, right) -> { double xb = middle.x() - left.x(); double yb = middle.y() - left.y(); double xc = right.x() - left.x(); double yc = right.y() - left.y(); return 0.5 * Math.abs(xb * yc - xc * yb); };

I think that can simplify the class and remove the need of a Registry which makes very little sense.

The registry is used by the benchmark, which needs to be parameterized by String[], and then lookup the actual calculator by name. An earlier version used a switch statement in the byName method, but I thought the registry looked a bit more elegant.

While I could move this all into the benchmark code instead, I did plan to have the tests make use of this also. Right now the tests require extending the GeometrySimplifierTests class with calculator specific sub-classes, that is not very elegant, and I would have preferred to use test parameterization, like we do in the benchmarks. I'm being blocked by ESTestCase which seems to dislike parameterized tests.

On a separate note, the reason I did not use lambda syntax was that to support the actual paper you originally provided, we need to maintain a little more state in the calculators, and they would either need constructors and fields, or additional methods for this. Right now I do not have any calculators that need that, and renamed my FrechetError to SimpleFrechetError to emphasize that I'm not doing the full calculation, only a simplified version. But I did imagine that we might take that step (I even expected that before finishing this work, but no longer think so).

I did investigate this a little further, and only one of the calculators contained a single function so could easily be converted to lambda syntax, so I did not take that step. However, I did make constants and removed the Registry class, which hopefully covers most of what you wanted.

This did require hard-coding the names into the external visualization tool, but that is not a huge issue.

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/GeometrySimplifier.java

iverase

I think the approach looks good. One thing that is confusing is that at the moment the API looks like a common geometry simplifier but it is important to note that the difference with other approaches is that we are doing the simplification on a streaming fashion, do let's try to make it more obvious for the users?

And pass simplifier start/end events even for non simplified geometries, for improved symmetry.

So we can complete support for all geometry types.

craigtaverner · 2023-05-09T14:46:09Z

So I'm thinking more about the fact that this API currently does both streaming and non-streaming, and that most tests, benchmarks and the visualisation library use the non-streaming API, which internally just delegates to the streaming API.

However, I can imagine one improvement that should satisfy your interest in making the two use cases clearer and more separated, and that would be to split them into two interfaces, and the streaming one only has two implementations, LineString and Polygon, and the non-streaming one could continue to use instances of the streaming objects internally, but not expose their APIs. This would be a little work, but quite achievable, would more clearly separate the two, and also get rid of the current code throwing errors like:

throw new IllegalArgumentException("MultiPolygon geometry simplifier cannot work in streaming mode");

Getting rid of those lines, and therefor also removing any runtime checks for usage of the two APIs would certainly be an improvement.

iverase · 2023-05-09T15:06:24Z

I'm thinking more about the fact that this API currently does both streaming and non-streaming,

I don't think that is right. The code is using an streaming approach for line simplification but it accepts as an input a geometry.

craigtaverner · 2023-05-09T15:34:11Z

The code uses the https://en.wikipedia.org/wiki/Visvalingam%E2%80%93Whyatt_algorithm algorithm for geometry simplification, this works for both streaming and non-streaming approaches. The two reasons why I currently call down from the non-streaming API to the streaming one are:

We only support limiting geometries by total point count
The first use case we expect to implement into Elasticsearch is geo_line streaming

Structuring the non-streaming API to not use the streaming algorithm internally can certainly be done, but there is very little need at this point until some other condition drives this need. In particular, one condition that would do this is to limit the simplification by maximum error (deviation from original), or by visualization conditions (resolution), which would lead to the need for alternative internal implementations.

I propose only doing that if we get such requirements.

For now, the streaming approach limited by total point count actually works pretty well for non-streaming API. Consider the data I've tested the most with, the MultiPolygon of the US (313 polygons for all islands, Alaska, and the main continent). This object takes a certain amount of memory when loaded. When simplifying we do not want to make any copies of that memory, and so the use of point count limited streaming actually ensures we only use a fixed amount of additional memory (linear function of the specified point count) to generate the new complete MultiPolygon with all polygons and holes simplified.

The original GeometrySimplifier provided both a streaming and non-streaming interface. The streaming interface only works for limited geometries, necessitating runtime checks and throwing exceptions to ensure correct usage. This was potentially confusing and error-prone. So now we've split the two classes: * StreamingGeometrySimplifier - only supports streaming mode, with methods consume(x, y) and produce() and works only with four geometry types: Point, Line, LinearRing, Polygon (with no holes) * GeometrySimplifier - only supports non-streaming mode with the method simplify(geometry) and works with almost all geometry types

craigtaverner · 2023-05-09T18:23:26Z

OK. I've split the StreamingGeometrySimplifier out of the GeometrySimplifier. This should increase clarity quite a bit. Now the two classes have two separate API's and no runtime checks need to be done, and no exceptions thrown for incorrect usage. If you create a streaming simplifier, you can only use it in a streaming way, and the simplify method does not even exist. Likewise the non-streaming simplifier only has the simplify method, and you cannot call the consume and produce methods. Obviously the streaming simplifier only works for for three specific geometry types: Line, LinearRing and Polygon (with no holes), while the non-streaming simplifier can work with all geometry types (even GeometryCollection).

This change will break the external library that I wrote for generating images and videos, but I think that is fixable. When fixing all the tests for this change, the biggest issue was related to the callback monitor that was used for image generation, but I think the consequences might be as simple as some of the image labels no longer being possible to render, which is a minor loss.

craigtaverner · 2023-05-10T08:22:06Z

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/SimplificationErrorCalculator.java

+     * a good enough approximation of the original line. This restriction is currently true of all the
+     * calculations implemented so far.
+     */
+    class SimpleFrechetErrorCalculator implements SimplificationErrorCalculator {


While this is based on the FrechetError oracle from the Abam et al. paper, it is not the complete calculation, since it only considers the currently simplified geometry, not the impact on the original geometry. The prefix Simple is probably not sufficient to cover this difference, so a better name would be HeightAndBackpathDistanceCalculator, which describes literally what it does.

To simplify the code a little, we removed the calculator registry and also renamed the frechet error to describe what it actually does, instead of implying it was an actual Frechet distance, which it is only an approximation of.

craigtaverner · 2023-05-10T18:51:29Z

I got a basic version of a spherical version of the simplified frechet calculation, and it is much slower:

Benchmark                                                   (calculatorName)  (maxPoints)  Mode  Cnt   Score   Error  Units
GeometrySimplificationBenchmark.simplify               cartesiantrianglearea         1000  avgt    5   2,745 ± 0,096  ms/op
GeometrySimplificationBenchmark.simplify                        triangleArea         1000  avgt    5   5,398 ± 0,035  ms/op
GeometrySimplificationBenchmark.simplify                      triangleheight         1000  avgt    5   5,565 ± 0,010  ms/op
GeometrySimplificationBenchmark.simplify           heightandbackpathdistance         1000  avgt    5   3,074 ± 0,006  ms/op
GeometrySimplificationBenchmark.simplify  sphericalheightandbackpathdistance         1000  avgt    5  15,128 ± 0,185  ms/op

So the current winner balancing performance with accuracy, is still trianglearea (spherical coords).

Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon, the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings, this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.

iverase

LGTM

Thank you for all the javadocs, now it makes much more sense to me

craigtaverner added >feature :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.8.0 labels Mar 29, 2023

craigtaverner mentioned this pull request Mar 31, 2023

Asset tracking: line-simplification for geo_line in TSDB #94954

Merged

20 tasks

craigtaverner force-pushed the geometry_simplifier branch from 6d8142d to f77ba23 Compare April 11, 2023 15:10

craigtaverner force-pushed the geometry_simplifier branch 2 times, most recently from be6ab40 to cfc774f Compare April 25, 2023 10:49

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

craigtaverner requested a review from iverase May 8, 2023 08:13

craigtaverner added 17 commits May 8, 2023 11:45

WIP line simplifier for use in geo_line

74af3fb

Generalize error calc including triangle area

9063a0d

The simplification error calculation can be done in several ways, so we generalize to include triangle area calculation. Also add more tests for both triangle area and prototype Frechet error

Support polygon and multipolygon simplification

1af51da

Refinements and more negative tests

ab71c3a

Remove system.out.println

ad147df

Update docs/changelog/94859.yaml

164481f

Export new simplification code from geo module

2b96996

Util for selecting calculator

80e65fb

Used in micro-benchmarks

Initial microbenchmarks for geo simplification

74d1b91

Add spherical area calculation for simplification

0c1d3cb

And fix memory leak on simplification, and improve circle testing so we can see the impact of spherical calculations more easily.

Fixed numerical error in flat triangle areas

e274cef

Herons formula leads to NaN results when the triangle is flat, due to small negative values inside the square root. This fix detects those negative values and returns a zero area.

Add lucene license notice

6fe869e

Fixed bug with co-located points in frechet error

4c241a0

Remove dependency on lucene

1791330

And copy in SloppyMath from lucene for haversinMeters function

Refinements in support of visualization monitor

6499b8b

Don't create more objects than actually needed

3286ab7

We re-use PointError objects instead of sending them to GC. This does require external code to re-use the simplifier itself.

craigtaverner force-pushed the geometry_simplifier branch from 390629d to 3286ab7 Compare May 8, 2023 09:53

craigtaverner added 2 commits May 8, 2023 15:26

Benchmark triangleheight

65cf706

Support back-paths in simple-frechet error

a0b5e05

craigtaverner commented May 8, 2023

View reviewed changes

iverase reviewed May 9, 2023

View reviewed changes

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/GeometrySimplifier.java Show resolved Hide resolved

iverase reviewed May 9, 2023

View reviewed changes

libs/geo/src/main/java/org/elasticsearch/geometry/simplify/GeometrySimplifier.java Show resolved Hide resolved

iverase reviewed May 9, 2023

View reviewed changes

craigtaverner added 3 commits May 9, 2023 11:46

Added GeometryCollection support

362176a

And pass simplifier start/end events even for non simplified geometries, for improved symmetry.

Generalize Point simplifier to identity simplifier

0cf3ed1

So we can complete support for all geometry types.

Use identity simplifier for Rectangle, Circle and MultiPoint

8f62e44

craigtaverner commented May 10, 2023

View reviewed changes

craigtaverner added 4 commits May 10, 2023 10:55

Renamed Frechet error and removed registry

cf6a254

To simplify the code a little, we removed the calculator registry and also renamed the frechet error to describe what it actually does, instead of implying it was an actual Frechet distance, which it is only an approximation of.

Improved comments

bd5721e

WIP Added spherical height and backpath calculator

1b8689b

Test for spherical backpath calculator

3307dc6

craigtaverner added 2 commits May 10, 2023 20:53

spotless

26a1da5

iverase approved these changes May 11, 2023

View reviewed changes

craigtaverner merged commit 38baa36 into elastic:main May 11, 2023

craigtaverner mentioned this pull request May 12, 2023

simplify geo_shape geometry for query response #96033

Open

craigtaverner changed the title ~~Geometry simplifier~~ Geometry simplifier (streaming) May 12, 2023

craigtaverner deleted the geometry_simplifier branch October 20, 2023 10:45

Geometry simplifier (streaming) #94859

Geometry simplifier (streaming) #94859

Uh oh!

Conversation

craigtaverner commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 29, 2023

Uh oh!

elasticsearchmachine commented May 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craigtaverner commented May 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iverase left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craigtaverner commented May 9, 2023

Uh oh!

iverase commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigtaverner commented May 9, 2023

Uh oh!

craigtaverner commented May 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craigtaverner commented May 10, 2023

Uh oh!

iverase left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

craigtaverner commented Mar 29, 2023 •

edited

Loading

iverase left a comment •

edited

Loading

iverase commented May 9, 2023 •

edited

Loading