New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add export to GeoArrow #3219
Conversation
How does this benchmark against shapely/shapely#1953 ? The export half of that is currently fine (the import half is where I'm running into a lifecycle constraint). |
I think that PR should be an implementation detail. It seems that we should be able to switch from I suppose one question is whether the GeoPandas export creates extension arrays or not. |
Got it! I missed that detail 😳 . I'll more actively poke away at that PR since I think it does speed things up a bit. I think extension arrays would be nice (and would increase the chance that the CRS is not dropped), although that would require |
hmm, you're right that does complicate this a bit. I'm guessing that if the extension type isn't registered the field is dropped through the c data interface? I'm not sure if geopandas should have a hard dependency on geoarrow-pyarrow (I assume that in 1.0 it'll still have an optional dep on pyarrow). Is there some way we can do this that'll work the same whether the user has geoarrow-pyarrow installed or not? |
The registration system causes a similar problem in R, where It seems strange, but I wonder if geoarrow-pyarrow should not depend on pyarrow (such that it could be hard dependency with a minimal module that registers an import hook after |
Not at all. CRS is stored at the GeometryArray level and each column can have its own. |
Hmm, that's interesting. As long as it's ok for the registration to happen multiple times, then geopandas could strictly depend on geoarrow-pyarrow and always register the extension types? I figure that geopandas would also need to construct classes exported from geoarrow-pyarrow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this!
- Agreed with focusing on the API here, regardless of upcoming optimizations or new features on the shapely side. I think we should start here with whatever is available in released shapely, and can later update it for new shapely developments.
- We might want to add a similar method on GeoSeries as well?
to_arrow()
orto_geoarrow()
?- Obviously will need some tests ;)
For RecordBatches, that shouldn't be the case (or if so, that sounds as a bug). But pyarrow.Array objects (not part of an object with a Schema), that will indeed not preserve that information .. I kind of like the idea of returning non-extension type data but just with the field annotations. However, if we return pyarrow objects and add a similar method for GeoSeries / GeometryArray, that's an issue .. This might also be a reason to not return pyarrow objects, but a generic wrapper object with the dunder methods. Then it is up to the user to decide which Arrow implementation or which extension types to use when importing that data. Another problem with the extension types, even if we would add |
If you support GeoArrow as a storage type for a You could try it and see if there are problems, but really think that we not want randomly dropped extension type metadata depending on the user's runtime and/or which order (or if) they imported |
Unless you just want support for preserving the geoarrow data in conversion to pandas, but don't have the whole geo stack installed.. Although we are making geopandas leaner to install, and the core just depends on pandas and shapely, which is quite small (the default install still comes with GDAL, PROJ, and its bindings, though, you have do some effort to only get the minimal install) But yes, I see how ideally, when dealing with and passing around pyarrow objects, we only have a single canonical And I have to re-read our thoughts in geoarrow/geoarrow-python#38. Ideally geopandas could also depend on something just for the extension types (while |
This is good discussion! If we do want to go down a path of exposing capsule-based objects, I wonder if it would be easiest to describe to users a two-step process of "exporting" to Arrow and then "importing" into the destination library. Maybe that informs the API here?
So this would be something like this? (pseudocode)
I like the minimalism of it, where it doesn't need a new dependency (besides pyarrow) but presumably would work with geoarrow-c, geoarrow-pyarrow, and geoarrow-rust. So |
Any temporary gripes about the number of dependencies required to
Could the |
This does sound nice (I'm guessing that pyarrow will always prefer |
One could use |
Yes, that's potentially an option, but it's hard to handle that schema negotiation without extra information about the set of geometry types in the source data, and that isn't known without a scan through the data. |
It could be WKB by default then? That would be more stable since it would be guaranteed to be the same type each time and has generally wider support. |
Yes but in the future when GeoArrow-native support is more widespread, we may want to default to exporting GeoArrow-native geometries, and changing that from WKB would be a breaking change. That's why I'm more inclined to not yet define PyCapsule methods on the GeoSeries directly. For now, let the user decide which output format to use. |
Proposal to move forward with this on the short term with an initial version that returns Arrow data with the metadata (and so not registered extension types):
For what to actually return, we have two options:
I would propose to go with the second option for now. This is a bit more cumbersome to use right now for end users, but gives us more freedom for the future (i.e. we can still decide to return pyarrow objects later (eg if pandas makes that a required dependency) without breaking user code, but not the other way around) |
I agree that geoarrow-pyarrow is not a good fit right now (and won't be until I or somebody else has time to focus efforts there and give it the requisite number of releases to iron out bugs to a level suitable for geopandas), and that exposing a "dummy" array object with the right dunder methods is the best solution. Even if this is a little more cumbersome, the workaround is compact (as you noted). I also am an obvious fan of extension types: I love the possibility that Geospatial data can be a first-class citizen in Arrow-based libraries (that support extension types) without dropping CRS metadata. I doubt there will be significant geoarrow-pyarrow usage until it is ready to be a candidate for geopandas dependency, at which point we can engineer a workaround ( |
I concur that geoarrow-pyarrow is not stable enough to be used by geopandas, especially as geopandas moves to 1.0.
How so? It seems all these cases need the actual data as well.
I think showing how the integration just works with writing to GDAL would be an impressive example. E.g. both this: gdf = gpd.GeoDataFrame(...)
pyogrio.write_arrow(gdf.to_arrow()) and also through the OGR Python bindings support for Arrow, is quite cool.
I'm +1 on this. This also gives us more time to see if other libraries like Polars adopt the PyCapsule API directly, where
off topic but I thought that was already decided? |
Sorry, I meant only the metadata and not as a registered extension type. Of course everyone wants the actual data as well ;) |
Yes, but then there were second thoughts .. ;) (pandas-dev/pandas#54466, pandas-dev/pandas#57073) |
FWIW I also have a draft PR on top of this to add GeoParquet support (#3275). It does need a bit ore, like the option to have separated instead of interleaved coordinates. |
Oh I see. Yeah, I got used to maintaining extension metadata on the field separately from the arrays because arrow-rs doesn't have the concept of extension arrays. |
I think this should be mostly ready. There are a few small remaining TODOs (making fields non-nullable, removing empty crs metadata), but for those I first need to fix the test data. Then there is the remaining API question:
|
If ``False``, the index(es) will not be written to the file. | ||
If ``None``, the index(ex) will be included as columns in the file | ||
output except `RangeIndex` which is stored as metadata only. | ||
geometry_encoding : {'WKB', 'geoarrow' }, default 'WKB' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea behind WKB being the default? Wouldn't it make more sense to default to GeoArrow? Going forward, I suppose that GeoArrow is the encoding of interest and since this is a new method, we don't need to think about backwards compatibility as we do with Parquet IO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WKB has the advantage that any GeoSeries
can be converted to it (and that two distinct GeoSeries
that are converted to it, perhaps containing different geometry types, can be combined with an Arrow C++ array concatenate operation). I love the GeoArrow encoding but I can see how users might have the most success opting in when it makes sense for them to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, we've also been discussing a mixed type (backed by an arrow union) that can hold multiple underlying arrays. That would always work here, but shapely doesn't support export to it yet.
geopandas/io/geoarrow.py
Outdated
|
||
df_attr = pd.DataFrame(df.copy(deep=False)) | ||
|
||
# replace geometry columsn with dummy values -> will get converted to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# replace geometry columsn with dummy values -> will get converted to | |
# replace geometry columns with dummy values -> will get converted to |
geopandas/io/geoarrow.py
Outdated
|
||
|
||
def geopandas_to_arrow( | ||
df, index=None, geometry_encoding="WKB", include_z=None, interleaved=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include_z
is not exposed in to_arrow
and is omitted from the docstring.
geopandas/io/geoarrow.py
Outdated
@@ -106,9 +106,6 @@ def geopandas_to_arrow( | |||
extension_metadata["ARROW:extension:metadata"] = json.dumps( | |||
{"crs": crs} | |||
) | |||
else: | |||
# TODO we shouldn't do this, just to get testing passed for now | |||
extension_metadata["ARROW:extension:metadata"] = "{}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just discovered this and am still investigating, but we might actually need this. (If an extension type is registered with metadata but an instance is attempted to be created without metadata, Arrow C++ might throw an exception). See geoarrow/geoarrow-c#94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's annoying ... I just removed all empty "ARROW:extension:metadata" from the test files
(but a good catch, and a reminder for me I should actually add a test using geoarrow to consume this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, reverted this back for compatibility
I keep trying to find time to review this. I'll try to get to this in the next couple of days |
@kylebarron I suppose we should also add a similar method to GeoSeries (just returning the geometry array and not as a tabular struct array)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
yeah absolutely |
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
geopandas/io/tests/test_geoarrow.py
Outdated
geom_type = left["geometry"].type | ||
# in case of Points (directly the inner fixed_size_list or struct type) | ||
# -> there are NaNs for empties -> we need to compare them separately | ||
# and then fill, because pyarrow.Table.equals considers NaNs as not equal | ||
if pa.types.is_fixed_size_list(geom_type): | ||
left_values = left["geometry"].chunk(0).values | ||
right_values = right["geometry"].chunk(0).values | ||
assert left_values.is_nan().equals(right_values.is_nan()) | ||
left_geoms = pa.FixedSizeListArray.from_arrays( | ||
pc.replace_with_mask(left_values, left_values.is_nan(), 0.0), | ||
type=left["geometry"].type, | ||
) | ||
right_geoms = pa.FixedSizeListArray.from_arrays( | ||
pc.replace_with_mask(right_values, right_values.is_nan(), 0.0), | ||
type=right["geometry"].type, | ||
) | ||
left = left.set_column(1, left.schema.field("geometry"), left_geoms) | ||
right = right.set_column(1, right.schema.field("geometry"), right_geoms) | ||
|
||
elif pa.types.is_struct(geom_type): | ||
left_arr = left["geometry"].chunk(0) | ||
right_arr = right["geometry"].chunk(0) | ||
|
||
for i in range(left_arr.type.num_fields): | ||
assert left_arr.field(i).is_nan().equals(right_arr.field(i).is_nan()) | ||
|
||
left_geoms = pa.StructArray.from_arrays( | ||
[ | ||
pc.replace_with_mask(left_arr.field(i), left_arr.field(i).is_nan(), 0.0) | ||
for i in range(left_arr.type.num_fields) | ||
], | ||
fields=list(left["geometry"].type), | ||
) | ||
right_geoms = pa.StructArray.from_arrays( | ||
[ | ||
pc.replace_with_mask( | ||
right_arr.field(i), right_arr.field(i).is_nan(), 0.0 | ||
) | ||
for i in range(right_arr.type.num_fields) | ||
], | ||
fields=list(right["geometry"].type), | ||
) | ||
|
||
left = left.set_column(1, left.schema.field("geometry"), left_geoms) | ||
right = right.set_column(1, right.schema.field("geometry"), right_geoms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a horrible piece of code that I wrote, but now at least we are testing the point case that they have NaNs. And taking this upstream to pyarrow so we can simplify this code in the future
I am going to merge this now, to make it easier to finish the dependent PRs (import, parquet writing), but let's further tweak the interface in follow-ups if needed! Thanks Kyle for getting this started and all for the feedback. |
Partial implementation of #3156. Adds a
to_arrow
method to export to GeoArrow.Adds a
geometry_encoding
parameter for either WKB or GeoArrow, defaulting to GeoArrow. Adds PROJJSON-encoded CRS, which is the current standard pending geoarrow/geoarrow#40.Notes:
to_geoparquet
path, so GeoParquet metadata still gets set on the table metadata, even though that isn't required for GeoArrow.