New Geometry Validation Method
Date:Introduction
While I was still playing with the osmdata
R package to decide how to make a situation map of
Karabel (for background information, see my last post about
mapping a random run), I met some old friends that I have not
seen for a long time: geometry validity errors. This note
explore the new method implemented in R to validate (simple
feature) polygons.
R, rgeos and geos
My aim was to download areas covered by forest using the following lines:
Then I wanted to plot the polygons with tmap
, but I got an
error message Error: Shape contains invalid polygons. Please fix
it
. Three years ago, I used the library rgeos
to find the
problem. However, the loading of the package rgeos
adverts
that it will be deprecated by printing this warning message
.onLoad
:
[…] Please note that rgeos will be retired by the end of 2023,
plan transition to sf functions using GEOS at your earliest
convenience […]
The retirement message was added Sep. 7 2021 (SVN revision
673). At the first popping of the message, I was not so pleased
because some of my previous scripts rely on rgeos
, and they
will brake sooner than I would like. However, I completely
understand the change, and instead of complaining, it is
a good occasion to thank those behind the package for their
tremendous and reliable work. Thank you Roger Bivand. I also do
appreciate to have more than two years of time to change my
scripts, this is inspiring community stewardship and the R
community would benefit to take backward compatibility
so seriously.
It is clear that the package sf
will eventually replace the
older sp
and rgeos
packages. Therefore, I dug into the
documentation of the sf
package, to look what is the new
function name and what are the parameters. Using the function
st_is_valid(reason=TRUE)
pointed to the error
However, by querying the word ?valid
, I spotted the function
st_make_valid()
as it is documented together with
the function st_is_valid()
. It impressed me,
because when I used it for the “malformed” OSM geometries, it solved most
of my problems. However, some polygons could not be fixed with
this routine. The st_make_valid()
documentation
tells us:
st_make_valid uses the lwgeom_makevalid method also used by the PostGIS command ST_makevalid if the GEOS version linked to is smaller than 3.8.0, and otherwise the version shipped in GEOS
This reflects a change that happened at the beginning of 2019 when the GEOS library adopted a PostGIS command ST_makevalid. By looking at GEOS, I noticed that a recent release of GEOS (Version 3.10.0 from Oct. 21, 2021) stated that this version adds a new parameter called “method” which enable to choose between two methods:
- method=linework: Original method, combines all rings into a set of noded lines and then extracts valid polygons from that linework.
- method=structure: first makes all rings valid then merges shells and subtracts holes from shells to generate valid result. Assumes that holes and shells are correctly categorized.
The function has also an option “KeepCollapsed” to “drop any component that has collapsed into a lower dimensionality. For example, a ring collapsing to a line, or a line collapsing to a point.”. Paul Ramsey, one of the authors of the new “structure” method with Martin Davis presents it in this post (July 19, 2021) on Crunchy Data. Originally, this algorithm was discussed in the “RFC: Fixing invalid geometry #652 before implementation in the JTS Topology Suite. When reading Paul Ramsey’s blog post, I particularly like the conclusion, stating that it is worth to implement a “trigger” function to check the validity and avoid algorithm failures later. It is similar to my conclusion when this function was not available.1
This triggered2 piqued my curiosity to check how it works in R.
Set-up
While we could use
rgeos
, I was curious to test the new package
geos
, even if I prefer the particularly good documentation of rgeos
and its syntax.3 Moreover rgeos
provides an incredibly series of examples, much better that what will follow. So please have a look at example(rgeos::gMakeValid
), it is worth to read the doc. The package geos
(and its companion libgeos
) are new and still at the experimental stage, so it may brakes in the futur…
Let start with creating a polygon, which has a geometry that is not valid. I am reusing here the example Polygon - Inverted shell, line touch, exterior, plotted bellow
The numbering shows the order of the nodes of the polygon. It clearly
indicates an invalid geometry. We would expect 9
edges (to define a closed shape, such as a polygon, the last
point must close onto the first point). However, we have here 10
points. The point 6 overlaps the point 2. This breaks the simple
feature validity rule. If we check the
validity with geos_is_valid()
, we have a FALSE
answer and we can look at the reason with
geos_is_valid_detail()
This geometry does not define a valid polygon per definition.
The function geos_is_valid_detail()
indicates the broken
validity rule (i.e., self-intersection) and gives the coordinates
<POINT (70 -90)>. However, with the
same set of points, it is possible to define a valid linestring:
It looks almost identical, but it better illustrates the problem in the topology of the object.
Even if these two figures look similar, obviously the (invalid) polygon
and the linestring behave differently. In both plots, the
parameter colour
is set. In the first case, the inside of the
polygon is colourised, in the second, it is the line and you
cannot colourise the “inside” (that does not exist).
A bit of make-up
Let us now play with this invalid polygon. First, to test and
compare the new geos
package, I use the same “linework” method
with the package geos
and the package sf
.
Method Linework
Here the (verbose) code, first for geos
and then for sf
:
And we plot the results side by side
As expected, both functions return the same result, but what did
happen? There is still a line to be seen, but this time, the
colour changed. Indeed, if we look at the geometry of the
validated polygon with the function call geos_unnest()
, we see
that the object is a geometry collection containing a polygon and
a line: <POLYGON [10 -90…90 -10]>, <LINESTRING (30 -90, 30 -60, 70 -60, 70 -90)>. After separating
the geometrycollection, it is possible to plot these two
elements side by side:
The outer part is assembled in a rectangular well-defined polygon
and the nodes are reorganised around it in clockwise
direction.4 The “inner part” is a new geometry: a
linestring. In total we have 7+4
points: 11! But that is fair,
it follows the definition we saw at the beginning for the
method=linework: combines all rings and then extracts valid
polygons from that linework.
What happens if we use the second method?
Method Structure
The package geos
follows closely the classes
and methods from the C library. It first checks the validity of
the declared parameters with geos_make_valid_params()
, to be used in the function call geos_make_valid()
.
But how does the results look like?
Waow… not bad! 9 points! If I was the person who did draw
the invalid polygon, that was probably what I had I mind. And it
is neat, we do not have a complex geometry collection but only
one simple polygon, defined as:
<POLYGON [10 -90…90 -10]>
This sparks my curiosity and the need to check, what would have happened if I had used this function instead of correcting manually three years ago.
Looking back
First, we must read the polygon that I corrected:5
We check the validity reason with geos_is_valid_detail(geb79)
This is the same situation as three years ago. The wall of the Middle Bronze Age building 79 is invalid. Let us test the two methods to make it valid and compare the results:
The top left is the invalid polygon, right is the polygon after application of the “linework” method and at the bottom the “structure” method. After the correction (right and bottom) it looks slightly different from the original (left).
With a closer look, it is easier to spot the differences. Originally, there were no point at the intersection, as it was a “self-intersection” (left). The whole geometry was one polygon. The two other methods present two polygons, the intersection being now a common point. Therefore, it is possible to have distinct colours. The two methods give the same result and define two points at the intersection to separate two valid geometries.
I was less impressed by the results, because for me, I would have preferred if it would have been possible to swap the points 1/15 and 14 in the original polygon (top left). But I prefer these results against aggressive changes. It gives me more confidence to use this function and it does what it was designed for: “It is desirable to have a way to convert an invalid geometry into a valid one that preserves the character of the input (to some extent)”.
However, this is not a reason to give up. While I was reading about OpenStreetMap data to understand how to manipulate them to make a map of Karabel (what brought me back to this problem), I came across a paper discussing how the problem exist in OpenStreetMap and how it was dealt there.
Dealing with polygons at OpenStreetMap
I am starting to read the Overpass API User’s Manual to see
what is going on under the hood when I am using the package
osmdata
. What is the model behind OSM data? Why do I have
“invalid Polygon” and how the OSM community deals with it? I was
surprised to read that OSM does not have a polygon or
multipolygon data type. Only nodes and ways (i.e., points
and linestrings) have geometric definitions. Polygons are defined
in relation to these two entities. Polygons are a sequence of
reference to nodes and ways. OSMlab, an organization for
OpenStreetMap related projects had a project called “fixing
polygons in osm” to corrects the topology of polygons created
out of “invalid relations” (and wrong metadata). The rationale
was that the tool Osm2pgsql is part of many rendering
toolchains and hence decides what appears on most OSM-based maps.
It creates a PostgreSQL/PostGIS database. To deal with invalid
polygons, “By default Osm2pgsql
will not discard invalid
polygons, it will try to fix them (using a buffer(0) GEOS
operation), which mostly works.” This was the first method
adopted.6 However, to solve the problem,
eventually the polygons were manually edited and corrected
(second method). In 2017 an impressive community effort was made
to fix more than 100,000 bad (multi)polygons. The
one I found (and corrected) were more recent. To
me, this shows that you can make a geometry valid with an
algorithm, but eventually, you will end up changing things
manually if you want high quality data.
Takeaway
Validating geometry at the time of creation should be always enforced with a kind of linter (“a tool used to flag stylistic errors and suspicious constructs”, Wikipedia s.v. ‘linter’). This will avoid to meet these old friends geometry validity errors repeatedly. They are mind-numbing, such as this blog post!
Footnotes
-
The first
geos::MakeValid
implementation I found was realeased 2019. ↩ -
I do not want firing triggers: “The application of military metaphors is unnecessary because more positive alternatives are available” (Jing-Bao Nie, Adam Gilbertson, Malcolm de Roubaix, Ciara Staunton, Anton van Niekerk, Joseph D. Tucker & Stuart Rennie (2016) Healing Without Waging War: Beyond Military Metaphors in Medicine and HIV Cure Research, The American Journal of Bioethics, 16:10, 3-11, DOI: 10.1080/15265161.2016.1214305. ↩
-
I was pleasantly surprised to see that
rgeos
is up-to-date. It exposes the parametersoriginal
(for the method) andkeepCollapsed
since April 2021 (SVN revision 655). Moreover, the implementation inrgeos
is well documented. It also points to the apparition of this novel function: “not available before GEOS 3.8.0; from 3.10.0 two correction strategies” ↩ -
I guess that direction is clockwise because we have negative Y coordinates. See this comment on #winding-order with further ref. ↩
-
To be honest, it is my only post that is not reproducible (it is plain markdown) and I did not dig into my git history, but extract the corrected feature and made it invalid ↩
-
I did not understand how it works because even if the documentation states that Osm2pgsql “never loads invalid geometries into your database”, they are plenty of examples as you can check with the tool OpenStreetMap inspector. ↩