Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5c9e110
WIP: Add geometry logical type
wgtmac May 10, 2024
5ef28cd
address various comments
wgtmac May 25, 2024
ecd8cc2
add file level geo stats
wgtmac May 27, 2024
d81dacb
address feedback:
wgtmac May 31, 2024
80f4051
change naming and remove controversial items
wgtmac Jun 13, 2024
0db6d9f
address feedback
wgtmac Jun 16, 2024
e817af4
fix typo
wgtmac Jun 16, 2024
f78f7bd
use WKB type code
wgtmac Jun 19, 2024
1aaaca8
Update covering and geometry type protocol based on comments (#2)
zhangfengcdt Aug 7, 2024
ee5b2df
Add the new suggestion according to the meeting with Snowflake (#3)
jiayuasu Aug 15, 2024
19cc081
change metadata to string type and rewording WKB description
wgtmac Aug 20, 2024
16c5868
add example for crs
wgtmac Aug 21, 2024
56a65de
reword crs
wgtmac Aug 21, 2024
f28b282
clarify WKB
wgtmac Aug 22, 2024
5127702
clarify coverings
wgtmac Aug 24, 2024
298ab64
Update the suggestion for bbox stats (#4)
jiayuasu Sep 11, 2024
41c6394
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
d86abe4
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
c7a4f4c
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
f20f685
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
dbf9d54
address feedback about edges and wkb
wgtmac Sep 20, 2024
b4296aa
add geoparquet column metadata back
wgtmac Sep 27, 2024
9bcea6e
Update the spec according to the new feedback (#5)
jiayuasu Oct 4, 2024
99f0403
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
dbb78cf
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
25df0ff
add description to LogicalTypes.md
wgtmac Oct 13, 2024
d349727
add explanation for Z & M values
wgtmac Oct 13, 2024
9ea6559
move geo stats to ColumnMetaData
wgtmac Oct 16, 2024
011de45
Update src/main/thrift/parquet.thrift
wgtmac Oct 17, 2024
6425a3c
fix typo
wgtmac Oct 17, 2024
7d8ffa5
Merge branch 'master' into geo
wgtmac Nov 7, 2024
1502458
remove edges and simplify crs
wgtmac Nov 22, 2024
9f53c9e
Add geography type
wgtmac Dec 13, 2024
a4f79ca
remove wrong content
wgtmac Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,174 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## Geospatial Types

### GEOMETRY

`GEOMETRY` is used for geometry features from [OGC – Simple feature access][simple-feature-access].
See [Geospatial Notes](#geospatial-notes).

The type has three type parameters:
- `encoding`: A required enum value for annonated physical type and encoding
for the `GEOMETRY` type. See [Geometry Encoding](#geometry-encoding).
- `edges`: A required enum value for interpretation for edges of elements of the
`GEOMETRY` type, i.e. whether the interpolation between points along
an edge represents a straight cartesian line or the shortest line on
the sphere. See [Edges](#edges).
- `crs`: An optional string value for CRS (coordinate reference system), which
is a mapping of how coordinates refer to precise locations on earth.
See [Coordinate Reference System](#coordinate-reference-system).

The sort order used for `GEOMETRY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored. Instead, [GeometryStatistics](#geometry-statistics)
is introduced for `GEOMETRY` type.

#### Geometry Encoding

Physical type and encoding for the `GEOMETRY` type. Supported values:
- `WKB`: `GEOMETRY` type with `WKB` encoding can only be used to annotate the
`BYTE_ARRAY` primitive type. See [WKB](#well-known-binary-wkb).

##### Well-known binary (WKB)

Well-known binary (WKB) representations of geometries, see [Geospatial Notes](#geospatial-notes).

To be clear, we follow the same definitions of GeoParquet for [WKB][geoparquet-wkb]
and [coordinate axis order][coordinate-axis-order]:
- Geometries SHOULD be encoded as ISO WKB supporting XY, XYZ, XYM, XYZM. Supported
standard geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString,
MultiPolygon, and GeometryCollection.
- Coordinate axis order is always (x, y) where x is easting or longitude, and
y is northing or latitude. This ordering explicitly overrides the axis order
as specified in the CRS following the [GeoPackage specification][geopackage-spec].

This is the preferred encoding for maximum portability.

[geoparquet-wkb]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
[coordinate-axis-order]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
[geopackage-spec]: https://www.geopackage.org/spec130/#gpb_spec

#### Edges

Interpretation for edges of elements of `GEOMETRY` type. In other words, it
specifies how a point between two vertices should be interpolated in its XY
dimensions. Supported values and corresponding interpolation approaches are:
- `PLANAR`: a Cartesian line connecting the two vertices.
- `SPHERICAL`: a shortest spherical arc between the longitude and latitude
represented by the two vertices.

This value applies to all non-point geometry objects and is independent of the
[Coordinate Reference System](#coordinate-reference-system).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I work with Salesforce Data Cloud team, and evaluating GeoSpatial support in iceberg)
I am new to geospatial world, and wondering what does it mean for edges to be independent of underlying CRS? Can the edges be planar while the CRS is based on elliptic geometry?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the edges be planar while the CRS is based on elliptic geometry?

In principle, no. First, talking about "planar edges" or "spherical edges" makes no sense and was a confusion of terms in the initial draft of this specification (the group reached an agreement to fix that in recent talks, I hope it will be done before release). An edge can be a straight line, a curve, a geodesic, etc., but cannot be a plane or a sphere (because of wrong number of dimensions).

What the initial draft intended to say with "planar edges" (sic) is "edges computed as if they were in a planar (two-dimensional Cartesian) coordinate system" (the thing that is planar is the coordinate system, not the edges). This is not really correct for geographic CRS, so you are right to said that they are not really independent. However, while it would be more exact to said that lines on a geographic CRS are geodesics, loxodrome, etc., it happens often that software ignore that physical reality and just perform linear interpolations of latitude and longitude values. The line on the ellipsoid surface obtained that way has no interesting properties, it is just easy to compute. We do not recommend doing that, but the use of "planar" word in this context was an acknowledgement that it happens in practice and an attempt to describe that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the response. I do not understand what this parameter is used for in parquet. If it is the engine's property to treat the edges, how is this value helping? The engine capable of interpreting edges as geodesics should do so if the CRS reference indicates that the underlying geometry column belongs to an ellipsoid datum. Is this edge property forcing the engine to treat the values in a planar coordinate system?

In other words, is there something intrinsic to the data stored in the parquet file itself where edge parameter makes a difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redblackcoder

The Geo and Iceberg community are discussing the best way to describe this field. It is very likely that we will want to rename edges property to something else because this is not what we want to describe initially. We will post updates in a few days.

Copy link

@mentin mentin Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engine capable of interpreting edges as geodesics should do so if the CRS reference indicates that the underlying geometry column belongs to an ellipsoid datum.

Consider the most common case, SRID 4326. It is Geographic coordinate system (GEOGCS) rather than Projected one.
https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/

So the linestring from A to B should follow the geodesic line. But most systems treat 4326 as planar map. E.g. with Geometry type in PostGIS or MS SQL Server, they treat it as projected coordinate system, and the linestrings follow straight lines on flat surface. If you use latest MySQL or Geography type in PostGIS or MS SQL Server, the linestrings in 4326 follow geodesic lines on sphere. So there is ambiguity what exactly a linestring or polygon in 4326 describes. Is 'point(30 21) inside polygon((10 10, 50 10, 50 20, 10 20, 10 10))?

With geometry, in PostGIS, returns false:

select st_intersects(
  st_geomfromtext('polygon((10 10, 50 10, 50 20, 10 20, 10 10))', 4326), 
  st_geomfromtext('point(30 21)', 4326));

Same thing with geography (4326 is presumed), returns true:

 select st_intersects(
  st_geographyfromtext('srid=4326;polygon((10 10, 50 10, 50 20, 10 20, 10 10))'),
  st_geographyfromtext('srid=4326;point(30 21)'));

Unfortunately, there is no accepted way to describe the difference between geometry and geography in WKB format. You can encounter SRID=4326 with both interpretations. The edge attribute allows describing the difference between geometry and geography, and tells user how to interpret the data in a way consistent with the system that produced it.


Because most systems currently assume planar edges and do not support spherical
edges, `PLANAR` should be used as the default value.

#### Coordinate Reference System

CRS (coordinate reference system) is a mapping of how coordinates refer to
precise locations on earth. A CRS is specified by a key-value entry in the
`key_value_metadata` field of `FileMetaData` whose key is a short name of
the CRS and value is the CRS representation. An additional entry in the
`key_value_metadata` field with the suffix ".type" is required to describe
the encoding of this CRS representation.
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

For example, if a geometry column (e.g., "geom1") uses the CRS "OGC:CRS84", the
writer may write two entries to `key_value_metadata` field of `FileMetaData` as
below, and set the `crs` field of the `GEOMETRY` type to "geom1_crs":
```
"geom1_crs": an UTF-8 encoded PROJJSON representation of OGC:CRS84
"geom1_crs.type": "PROJJSON"
```

The PROJJSON representation of OGC:CRS84 can be seen at [OGC:CRS84][ogc-crs84].
Multiple geometry columns can refer to the same CRS metadata field
(e.g., "geom1_crs") if they share the same CRS.

[ogc-crs84]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details

#### Geometry Statistics

`GeometryStatistics` is an optional field of `Statistics` for `GEOMETRY` type.
It contains [Bounding Box](#bounding-box) and [Geometry Types](#geometry-types).
Note that geometry statistics in the page index is not supported yet.

##### Bounding Box

A geometry has at least two coordinate dimensions: X and Y for 2D coordinates
of each point.

A geometry can optionally have Z and / or M values associated with each point
in the geometry. The Z value introduces the third dimension coordinate. The Z
values usually are used to indicate the height, or elevation. M values are an
opportunity for a geometry to express a fourth dimension as a coordinate value.
These values can be used as a linear reference value (e.g., highway milepost
value), a timestamp, or some other value as defined by the CRS.

Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Values of Z and M are omitted
for 2D geometries.

```thrift
struct BoundingBox {
/** Min value when edges = PLANAR, westmost value if edges = SPHERICAL */
1: required double xmin;
/** Max value when edges = PLANAR, eastmost value if edges = SPHERICAL */
2: required double xmax;
/** Min value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
5: optional double zmin;
6: optional double zmax;
7: optional double mmin;
8: optional double mmax;
}
```

The meaning of each value depends on the `Edges` attribute of the `GEOMETRY` type:
- If Edges is `PLANAR`, the values are literally the actual min/max value from each axis.
- If Edges is `SPHERICAL`, the values for X and Y are `[westmost, eastmost, southmost, northmost]`,
with necessary min/max values for Z and M if needed.

##### Geometry Types

A list of geometry types from all geometries in the `GEOMETRY` column, or an
empty list if they are not known.

This is borrowed from [geometry_types of GeoParquet][geometry-types]
except that values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geometry types and their codes:

| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |

In addition, the following rules are applied:
- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geometry types are not known.
- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid).
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary

#### Geospatial Notes

The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting
XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for
Geographic information – Simple feature access – Part 1: Common architecture](
https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial
Consortium)](https://www.ogc.org/standard/sfa/).

The version of the OGC standard first used here is 1.2.1, but future versions
may also used if the WKB representation remains wire-compatible.

## UNKNOWN (always null)

Sometimes, when discovering the schema of existing data, values are always null
Expand Down
64 changes: 64 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,33 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis.
*/
struct BoundingBox {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Min value when edges = PLANAR, westmost value if edges = SPHERICAL */
1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max value when edges = PLANAR, eastmost value if edges = SPHERICAL */
2: required double xmax;
/** Min value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
5: optional double zmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
6: optional double zmax;
7: optional double mmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
8: optional double mmax;
}

/** Statistics specific to GEOMETRY logical type */
struct GeometryStatistics {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** A bounding box of geometries */
1: optional BoundingBox bbox;
/** Geometry type codes of all geometries, or an empty list if not known */
2: optional list<i32> geometry_types;
}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -286,6 +313,9 @@ struct Statistics {
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

/** statistics specific to geometry logical type */
9: optional GeometryStatistics geometry_stats;
Copy link
Contributor

@rdblue rdblue Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to include GeometryStatistics. There are a couple reasons:

  1. This is unnecessary nesting to get to the geo stats, making them harder to find
  2. Nesting within Statistics includes geo stats in places where simple Statistics make sense, but geo stats do not. For example, this would be included in page headers in addition to ColumnMetaData (and the page index already removed these)
  3. This doesn't match the approach used for SizeStatistics, which was included directly in ColumnMetaData

I think that this should match the addition of SizeStatistics and should be included as a field in ColumnMetaData.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Actually it was a little bit awkward when I did the PoC impl to nest geo stats into the common stats. I have moved it to ColumnMetaData now.

}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -380,6 +410,38 @@ struct JsonType {
struct BsonType {
}

/** Physical type and encoding for the geometry type */
enum GeometryEncoding {
/**
* Allowed for physical type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries.
*/
WKB = 0;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
}

/** Interpretation for edges of elements of a GEOMETRY type */
enum Edges {
PLANAR = 0;
SPHERICAL = 1;
}

/**
* GEOMETRY logical type annotation (added in 2.11.0)
*
* GeometryEncoding and Edges are required. CRS is optional.
*
* Once CRS is set, it MUST be a key to an entry in the `key_value_metadata`
* field of `FileMetaData`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it required that the CRS is embedded in file metadata? Isn't it clear if the CRS is a well-known one like OGC:CRS84? It seems to me that this resolution should be out of scope. Parquet can encourage that the CRS is documented in file metadata, but other systems could store the definition in a different location. For example, Iceberg could store this in a table property instead of in each data file.

I would prefer to define this string property as a "Coordinate reference system identifier" and not specify how to exchange the PROJJSON or other format definition. I would also add a note that people are encouraged to store it in a location along with the file or table metadata.

Copy link
Member

@paleolimbot paleolimbot Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string property of "Coordinate reference system identifier" (with a convention, either within this spec or outside it, of where in the file to look for the full definition) would allow for enough detail for GeoSpatial libraries to leverage Parquet.

The need for embedding a full CRS description somewhere that is programatically accessible by a Parquet implementation is to ensure a producer's intent can be faithfully transported by the consumer. In the C++ implementation we can attach this as extension type metadata that can pass through a pipeline to a consumer that does not have access to the original context (e.g., constructing a GeoPandas GeoDataFrame from a Parquet file that was read and filtered using a non-spatial tool like pyarrow). If that needs to be an external convention (e.g., one that we define in GeoParquet) to get consensus here that is OK (even though I think it would result in less misinterpreted data to have that convention be in the Parquet specification itself).

Alternatively, would removing any conventions or requirements around the string crs be acceptable? (i.e., the producer puts what it needs to put there to ensure that the coordinates in this column are not misinterpreted by the consumer, which may be an identifier or a full CRS definition according to the requirements of the producer?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Although the GeoParquet community would really appreciate the possibility of embedding a full CRS description somewhere in Parquet, we understand that compromises need to be made sometimes.

Like @paleolimbot said, will it be acceptable that if we remove any conventions or requirements around the string crs and only allow this single value in the column metadata?

This means, the writer can put whatever they want but they will need to communicate this to the reader via other channel.

Copy link
Member Author

@wgtmac wgtmac Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for embedding a full CRS description somewhere that is programatically accessible by a Parquet implementation is to ensure a producer's intent can be faithfully transported by the consumer.

To achieve this, is it possible to reserve some crs values or at least some prefixes? For example, Iceberg may store iceberg.xxx to crs where xxx is an arbitrary crs identifier defined in its table metadata. Similarly, GeoParquet may set geoparquet.xxx to crs and the key must exist in the Parquet file metadata and its associated value is the full CRS.

This still causes fragmentation but it looks better than a strong enforcement. WDYT? @rdblue @jiayuasu @paleolimbot @szehon-ho

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe iceberg.geo.XXX in table properties and parquet.geo.xxx in parquet file metadata? Not sure if this is allowed though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it required that the CRS is embedded in file metadata?

Note that it is not actually required to embed CRS in the file metadata. This is an optional field, and so as a producer of Parquet files with geospatial data, you are not required to fill it. For example Iceberg, assuming it would already be tracking the CRS elsewhere in Iceberg-specific metadata or manifest file, could just leave this field blank in the parquet files itself.

Of course that makes those files less interoperable (but my understanding is that parquet files contained in an Iceberg table generally are not meant to be read by another non-Iceberg aware tool?).
But putting something like iceberg.xxx as crs value would also not be great for interoperability.

*
* See LogicalTypes.md for detail.
*/
struct GeometryType {
1: required GeometryEncoding encoding;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required Edges edges;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
3: optional string crs;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
pitrou marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +472,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: GeometryType GEOMETRY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -980,6 +1043,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* GEOMETRY - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down