-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2471: Add geometry logical type #240
base: master
Are you sure you want to change the base?
Changes from 27 commits
5c9e110
5ef28cd
ecd8cc2
d81dacb
80f4051
0db6d9f
e817af4
f78f7bd
1aaaca8
ee5b2df
19cc081
16c5868
56a65de
f28b282
5127702
298ab64
41c6394
d86abe4
c7a4f4c
f20f685
dbf9d54
b4296aa
9bcea6e
99f0403
dbb78cf
25df0ff
d349727
9ea6559
011de45
6425a3c
7d8ffa5
1502458
9f53c9e
a4f79ca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -237,6 +237,33 @@ struct SizeStatistics { | |
3: optional list<i64> definition_level_histogram; | ||
} | ||
|
||
/** | ||
* Bounding box of geometries in the representation of min/max value pair of | ||
* coordinates from each axis. | ||
*/ | ||
struct BoundingBox { | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/** Min value when edges = PLANAR, westmost value if edges = SPHERICAL */ | ||
1: required double xmin; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/** Max value when edges = PLANAR, eastmost value if edges = SPHERICAL */ | ||
2: required double xmax; | ||
/** Min value when edges = PLANAR, southmost value if edges = SPHERICAL */ | ||
3: required double ymin; | ||
/** Max value when edges = PLANAR, northmost value if edges = SPHERICAL */ | ||
4: required double ymax; | ||
5: optional double zmin; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
6: optional double zmax; | ||
7: optional double mmin; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
8: optional double mmax; | ||
} | ||
|
||
/** Statistics specific to GEOMETRY logical type */ | ||
struct GeometryStatistics { | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/** A bounding box of geometries */ | ||
1: optional BoundingBox bbox; | ||
/** Geometry type codes of all geometries, or an empty list if not known */ | ||
2: optional list<i32> geometry_types; | ||
} | ||
|
||
/** | ||
* Statistics per row group and per page | ||
* All fields are optional. | ||
|
@@ -286,6 +313,9 @@ struct Statistics { | |
7: optional bool is_max_value_exact; | ||
/** If true, min_value is the actual minimum value for a column */ | ||
8: optional bool is_min_value_exact; | ||
|
||
/** statistics specific to geometry logical type */ | ||
9: optional GeometryStatistics geometry_stats; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is the right place to include
I think that this should match the addition of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. Actually it was a little bit awkward when I did the PoC impl to nest geo stats into the common stats. I have moved it to |
||
} | ||
|
||
/** Empty structs to use as logical type annotations */ | ||
|
@@ -380,6 +410,38 @@ struct JsonType { | |
struct BsonType { | ||
} | ||
|
||
/** Physical type and encoding for the geometry type */ | ||
enum GeometryEncoding { | ||
/** | ||
* Allowed for physical type: BYTE_ARRAY. | ||
* | ||
* Well-known binary (WKB) representations of geometries. | ||
*/ | ||
WKB = 0; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
/** Interpretation for edges of elements of a GEOMETRY type */ | ||
enum Edges { | ||
PLANAR = 0; | ||
SPHERICAL = 1; | ||
} | ||
|
||
/** | ||
* GEOMETRY logical type annotation (added in 2.11.0) | ||
* | ||
* GeometryEncoding and Edges are required. CRS is optional. | ||
* | ||
* Once CRS is set, it MUST be a key to an entry in the `key_value_metadata` | ||
* field of `FileMetaData`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it required that the CRS is embedded in file metadata? Isn't it clear if the CRS is a well-known one like I would prefer to define this string property as a "Coordinate reference system identifier" and not specify how to exchange the PROJJSON or other format definition. I would also add a note that people are encouraged to store it in a location along with the file or table metadata. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A string property of "Coordinate reference system identifier" (with a convention, either within this spec or outside it, of where in the file to look for the full definition) would allow for enough detail for GeoSpatial libraries to leverage Parquet. The need for embedding a full CRS description somewhere that is programatically accessible by a Parquet implementation is to ensure a producer's intent can be faithfully transported by the consumer. In the C++ implementation we can attach this as extension type metadata that can pass through a pipeline to a consumer that does not have access to the original context (e.g., constructing a GeoPandas GeoDataFrame from a Parquet file that was read and filtered using a non-spatial tool like pyarrow). If that needs to be an external convention (e.g., one that we define in GeoParquet) to get consensus here that is OK (even though I think it would result in less misinterpreted data to have that convention be in the Parquet specification itself). Alternatively, would removing any conventions or requirements around the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue Although the GeoParquet community would really appreciate the possibility of embedding a full CRS description somewhere in Parquet, we understand that compromises need to be made sometimes. Like @paleolimbot said, will it be acceptable that if we remove any conventions or requirements around the This means, the writer can put whatever they want but they will need to communicate this to the reader via other channel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
To achieve this, is it possible to reserve some crs values or at least some prefixes? For example, Iceberg may store This still causes fragmentation but it looks better than a strong enforcement. WDYT? @rdblue @jiayuasu @paleolimbot @szehon-ho There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note that it is not actually required to embed CRS in the file metadata. This is an optional field, and so as a producer of Parquet files with geospatial data, you are not required to fill it. For example Iceberg, assuming it would already be tracking the CRS elsewhere in Iceberg-specific metadata or manifest file, could just leave this field blank in the parquet files itself. Of course that makes those files less interoperable (but my understanding is that parquet files contained in an Iceberg table generally are not meant to be read by another non-Iceberg aware tool?). |
||
* | ||
* See LogicalTypes.md for detail. | ||
*/ | ||
struct GeometryType { | ||
1: required GeometryEncoding encoding; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2: required Edges edges; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
3: optional string crs; | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
pitrou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
/** | ||
* LogicalType annotations to replace ConvertedType. | ||
* | ||
|
@@ -410,6 +472,7 @@ union LogicalType { | |
13: BsonType BSON // use ConvertedType BSON | ||
14: UUIDType UUID // no compatible ConvertedType | ||
15: Float16Type FLOAT16 // no compatible ConvertedType | ||
16: GeometryType GEOMETRY // no compatible ConvertedType | ||
} | ||
|
||
/** | ||
|
@@ -980,6 +1043,7 @@ union ColumnOrder { | |
* ENUM - unsigned byte-wise comparison | ||
* LIST - undefined | ||
* MAP - undefined | ||
* GEOMETRY - undefined | ||
* | ||
* In the absence of logical types, the sort order is determined by the physical type: | ||
* BOOLEAN - false, true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I work with Salesforce Data Cloud team, and evaluating GeoSpatial support in iceberg)
I am new to geospatial world, and wondering what does it mean for edges to be independent of underlying CRS? Can the edges be planar while the CRS is based on elliptic geometry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, no. First, talking about "planar edges" or "spherical edges" makes no sense and was a confusion of terms in the initial draft of this specification (the group reached an agreement to fix that in recent talks, I hope it will be done before release). An edge can be a straight line, a curve, a geodesic, etc., but cannot be a plane or a sphere (because of wrong number of dimensions).
What the initial draft intended to say with "planar edges" (sic) is "edges computed as if they were in a planar (two-dimensional Cartesian) coordinate system" (the thing that is planar is the coordinate system, not the edges). This is not really correct for geographic CRS, so you are right to said that they are not really independent. However, while it would be more exact to said that lines on a geographic CRS are geodesics, loxodrome, etc., it happens often that software ignore that physical reality and just perform linear interpolations of latitude and longitude values. The line on the ellipsoid surface obtained that way has no interesting properties, it is just easy to compute. We do not recommend doing that, but the use of "planar" word in this context was an acknowledgement that it happens in practice and an attempt to describe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the response. I do not understand what this parameter is used for in parquet. If it is the engine's property to treat the edges, how is this value helping? The engine capable of interpreting edges as geodesics should do so if the CRS reference indicates that the underlying geometry column belongs to an ellipsoid datum. Is this edge property forcing the engine to treat the values in a planar coordinate system?
In other words, is there something intrinsic to the data stored in the parquet file itself where edge parameter makes a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@redblackcoder
The Geo and Iceberg community are discussing the best way to describe this field. It is very likely that we will want to rename
edges
property to something else because this is not what we want to describe initially. We will post updates in a few days.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the most common case, SRID 4326. It is Geographic coordinate system (GEOGCS) rather than Projected one.
https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
So the linestring from A to B should follow the geodesic line. But most systems treat 4326 as planar map. E.g. with Geometry type in PostGIS or MS SQL Server, they treat it as projected coordinate system, and the linestrings follow straight lines on flat surface. If you use latest MySQL or Geography type in PostGIS or MS SQL Server, the linestrings in 4326 follow geodesic lines on sphere. So there is ambiguity what exactly a linestring or polygon in 4326 describes. Is
'point(30 21)
insidepolygon((10 10, 50 10, 50 20, 10 20, 10 10))
?With geometry, in PostGIS, returns false:
Same thing with geography (4326 is presumed), returns true:
Unfortunately, there is no accepted way to describe the difference between geometry and geography in WKB format. You can encounter SRID=4326 with both interpretations. The
edge
attribute allows describing the difference between geometry and geography, and tells user how to interpret the data in a way consistent with the system that produced it.