-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Geometry simplifier (streaming) #94859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geometry simplifier (streaming) #94859
Conversation
Hi @craigtaverner, I've created a changelog YAML for you. |
6d8142d
to
f77ba23
Compare
be6ab40
to
cfc774f
Compare
The simplification error calculation can be done in several ways, so we generalize to include triangle area calculation. Also add more tests for both triangle area and prototype Frechet error
Used in micro-benchmarks
And fix memory leak on simplification, and improve circle testing so we can see the impact of spherical calculations more easily.
Herons formula leads to NaN results when the triangle is flat, due to small negative values inside the square root. This fix detects those negative values and returns a zero area.
And copy in SloppyMath from lucene for haversinMeters function
This allows code calling the GeometrySimplifier to be notified of events related to the simplification algorithm. For example, the project geometry-simplifier-debug uses these events to draw images and create videos of the simplification process. And the tests for the simplifier use this to collect data to make further assertions on the results.
We re-use PointError objects instead of sending them to GC. This does require external code to re-use the simplifier itself.
390629d
to
3286ab7
Compare
Pinging @elastic/es-analytics-geo (Team:Analytics) |
@OutputTimeUnit(TimeUnit.MILLISECONDS) | ||
@State(Scope.Thread) | ||
public class GeometrySimplificationBenchmark { | ||
@Param({ "cartesiantrianglearea", "triangleArea", "frechetError" }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the triangleheight calculator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 65cf706
double rightYrotated = rightY * cos - rightX * sin; | ||
assert Math.abs(rightYrotated) < 1e-10; | ||
assert Math.abs(rightXrotated - len) < len / 1e10; | ||
// Return distance to x-axis TODO: also include back-paths for Frechet distance calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO still needs to be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a0b5e05
After implementing back-paths in the Frechet distance calculation, I re-ran benchmarks:
So it looks pretty good, however, this is still Cartesian. If I port to Spherical, it will get slower. So at this point |
import java.util.List; | ||
import java.util.Locale; | ||
|
||
public interface SimplificationErrorCalculator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to propose to make those implementations a singleton as they are just functional interfaces. for example for the cartesian implementation we can define something like:
/**
* Calculate the triangle area using cartesian coordinates as described at
* <a href="https://en.wikipedia.org/wiki/Area_of_a_triangle">Area of a triangle</a>
*/
SimplificationErrorCalculator CARTESIAN_TRIANGLE_AREA = (left, middle, right) -> {
double xb = middle.x() - left.x();
double yb = middle.y() - left.y();
double xc = right.x() - left.x();
double yc = right.y() - left.y();
return 0.5 * Math.abs(xb * yc - xc * yb);
};
I think that can simplify the class and remove the need of a Registry which makes very little sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registry is used by the benchmark, which needs to be parameterized by String[], and then lookup the actual calculator by name. An earlier version used a switch statement in the byName method, but I thought the registry looked a bit more elegant.
While I could move this all into the benchmark code instead, I did plan to have the tests make use of this also. Right now the tests require extending the GeometrySimplifierTests class with calculator specific sub-classes, that is not very elegant, and I would have preferred to use test parameterization, like we do in the benchmarks. I'm being blocked by ESTestCase which seems to dislike parameterized tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a separate note, the reason I did not use lambda syntax was that to support the actual paper you originally provided, we need to maintain a little more state in the calculators, and they would either need constructors and fields, or additional methods for this. Right now I do not have any calculators that need that, and renamed my FrechetError to SimpleFrechetError to emphasize that I'm not doing the full calculation, only a simplified version. But I did imagine that we might take that step (I even expected that before finishing this work, but no longer think so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did investigate this a little further, and only one of the calculators contained a single function so could easily be converted to lambda syntax, so I did not take that step. However, I did make constants and removed the Registry class, which hopefully covers most of what you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did require hard-coding the names
into the external visualization tool, but that is not a huge issue.
libs/geo/src/main/java/org/elasticsearch/geometry/simplify/GeometrySimplifier.java
Show resolved
Hide resolved
libs/geo/src/main/java/org/elasticsearch/geometry/simplify/GeometrySimplifier.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach looks good. One thing that is confusing is that at the moment the API looks like a common geometry simplifier but it is important to note that the difference with other approaches is that we are doing the simplification on a streaming fashion, do let's try to make it more obvious for the users?
And pass simplifier start/end events even for non simplified geometries, for improved symmetry.
So we can complete support for all geometry types.
So I'm thinking more about the fact that this API currently does both streaming and non-streaming, and that most tests, benchmarks and the visualisation library use the non-streaming API, which internally just delegates to the streaming API. However, I can imagine one improvement that should satisfy your interest in making the two use cases clearer and more separated, and that would be to split them into two interfaces, and the streaming one only has two implementations, LineString and Polygon, and the non-streaming one could continue to use instances of the streaming objects internally, but not expose their APIs. This would be a little work, but quite achievable, would more clearly separate the two, and also get rid of the current code throwing errors like:
Getting rid of those lines, and therefor also removing any runtime checks for usage of the two APIs would certainly be an improvement. |
I don't think that is right. The code is using an streaming approach for line simplification but it accepts as an input a geometry. |
The code uses the https://en.wikipedia.org/wiki/Visvalingam%E2%80%93Whyatt_algorithm algorithm for geometry simplification, this works for both streaming and non-streaming approaches. The two reasons why I currently call down from the non-streaming API to the streaming one are:
Structuring the non-streaming API to not use the streaming algorithm internally can certainly be done, but there is very little need at this point until some other condition drives this need. In particular, one condition that would do this is to limit the simplification by maximum error (deviation from original), or by visualization conditions (resolution), which would lead to the need for alternative internal implementations. I propose only doing that if we get such requirements. For now, the streaming approach limited by total point count actually works pretty well for non-streaming API. Consider the data I've tested the most with, the MultiPolygon of the US (313 polygons for all islands, Alaska, and the main continent). This object takes a certain amount of memory when loaded. When simplifying we do not want to make any copies of that memory, and so the use of point count limited streaming actually ensures we only use a fixed amount of additional memory (linear function of the specified point count) to generate the new complete MultiPolygon with all polygons and holes simplified. |
The original GeometrySimplifier provided both a streaming and non-streaming interface. The streaming interface only works for limited geometries, necessitating runtime checks and throwing exceptions to ensure correct usage. This was potentially confusing and error-prone. So now we've split the two classes: * StreamingGeometrySimplifier - only supports streaming mode, with methods consume(x, y) and produce() and works only with four geometry types: Point, Line, LinearRing, Polygon (with no holes) * GeometrySimplifier - only supports non-streaming mode with the method simplify(geometry) and works with almost all geometry types
OK. I've split the StreamingGeometrySimplifier out of the GeometrySimplifier. This should increase clarity quite a bit. Now the two classes have two separate API's and no runtime checks need to be done, and no exceptions thrown for incorrect usage. If you create a streaming simplifier, you can only use it in a streaming way, and the This change will break the external library that I wrote for generating images and videos, but I think that is fixable. When fixing all the tests for this change, the biggest issue was related to the callback monitor that was used for image generation, but I think the consequences might be as simple as some of the image labels no longer being possible to render, which is a minor loss. |
* a good enough approximation of the original line. This restriction is currently true of all the | ||
* calculations implemented so far. | ||
*/ | ||
class SimpleFrechetErrorCalculator implements SimplificationErrorCalculator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is based on the FrechetError oracle from the Abam et al. paper, it is not the complete calculation, since it only considers the currently simplified geometry, not the impact on the original geometry. The prefix Simple
is probably not sufficient to cover this difference, so a better name would be HeightAndBackpathDistanceCalculator
, which describes literally what it does.
To simplify the code a little, we removed the calculator registry and also renamed the frechet error to describe what it actually does, instead of implying it was an actual Frechet distance, which it is only an approximation of.
I got a basic version of a spherical version of the simplified frechet calculation, and it is much slower:
So the current winner balancing performance with accuracy, is still |
Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon, the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings, this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for all the javadocs, now it makes much more sense to me
There are many opportunities to enable geometry simplification in Elasticsearch, both as an explicit feature available to users, and as an internal optimization technique for reducing memory consumption for complex geometries. For the latter case, it can even be considered a bug fix. This PR provides support for constraining Line and LinearRing sizes to a fixed number of points, and thereby a fixed amount of memory usage.
Consider, for example, the geo_line aggregation. This is similar to the top-10 aggregation, but allows the top-10k (ten thousand) points to be aggregated. This is not only a lot of memory, but can still cause unwanted line truncation for very large geometries. Line simplification is a solution to this. It is likely that a much smaller limit than 10k would suffice, while at the same time not truncating the geometry at all, so we fix a bug (truncation) while improving memory usage (pull limit from 10k down to perhaps just 1k).
This PR provides two APIs:
simplifier.consume(x, y)
method on a stream of points, the total memory used is limited to a linear function ofk
, the total number of points to retain. This algorithm is at its heart based on the Visvalingam–Whyatt algorithm, with concepts from https://bost.ocks.org/mike/simplify/ and in particular the detailed streaming discussions in the paper at https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.7132&rep=rep1&type=pdfsimplifier.simplify(geometry)
method can work with most geometry types, evenGeometryCollection
, but:maxPoints
parameter is used as is to apply to the main component (shell for polygons, largest geometry for multi-polygons and geometry collections), and all other sub-components (holes in polygons, etc.) are simplified to a scaled down version of themaxPoints
, scaled by the relative size of the sub-component to the main component.The basic algorithm for simplification on a stream of points requires maintaining two data structures:
Then running the algorithm on each incoming point:
Note, however, that no attempt is made to prevent invalid geometries from forming. For example, when simplifying a polygon, the removal of points can cause lines to cross, creating invalid polygons. Any usage of this algorithm needs to consider that possibility. If the simplification is only required for visualization, and the visualization code is tolerant of line crossings, this is not an issue. But if you need valid simplified geometries, do additional validation on the final results before using them.
The concept of simplification error is an estimation of the amount of damage to the Line or LinearRing that occurs when a point is removed. We have evaluated several candidates:
It turns out that the above list is also ordered from fastest to slowest, and the second one, triangle area, gives the least visually damaging results, so is by far the best choice at this stage for general usage.
In addition, this has been developed as a separate library containing only the geometry simplification algorithm and support for simplifying:
Tasks still to be done before changing this from draft:
Hausdorff distanceHausdorff distanceGeometrySimplifierTests
SimplificationErrorCalculatorTests
Move to Lucene PriorityQueue (and benchmark against java.util.PriorityQueue)Using LucenePriorityQueue.updateTop()
as possible performance optimization