Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add export to GeoArrow #3219

Merged
merged 39 commits into from May 24, 2024
Merged

Conversation

kylebarron
Copy link
Contributor

Partial implementation of #3156. Adds a to_arrow method to export to GeoArrow.

Adds a geometry_encoding parameter for either WKB or GeoArrow, defaulting to GeoArrow. Adds PROJJSON-encoded CRS, which is the current standard pending geoarrow/geoarrow#40.

Notes:

  • Right now this isn't totally separate from the to_geoparquet path, so GeoParquet metadata still gets set on the table metadata, even though that isn't required for GeoArrow.
  • If there are multiple geometry columns, are they guaranteed to have the same CRS? It seems like there's a CRS on the table but not a GeoSeries.

@paleolimbot
Copy link

How does this benchmark against shapely/shapely#1953 ? The export half of that is currently fine (the import half is where I'm running into a lifecycle constraint).

@kylebarron
Copy link
Contributor Author

I think that PR should be an implementation detail. It seems that we should be able to switch from to_ragged_array to your implementation when it's stable/merged without changing the user facing API here. Or maybe I am missing something?

I suppose one question is whether the GeoPandas export creates extension arrays or not.

@paleolimbot
Copy link

Got it! I missed that detail 😳 . I'll more actively poke away at that PR since I think it does speed things up a bit.

I think extension arrays would be nice (and would increase the chance that the CRS is not dropped), although that would require geoarrow.pyarrow for the extension type registration.

@kylebarron
Copy link
Contributor Author

hmm, you're right that does complicate this a bit. I'm guessing that if the extension type isn't registered the field is dropped through the c data interface?

I'm not sure if geopandas should have a hard dependency on geoarrow-pyarrow (I assume that in 1.0 it'll still have an optional dep on pyarrow). Is there some way we can do this that'll work the same whether the user has geoarrow-pyarrow installed or not?

@paleolimbot
Copy link

The registration system causes a similar problem in R, where as_nanoarrow_array() or arrow::as_arrow_array() will do the wrong thing if the geoarrow package hasn't been loaded. For to_arrow() it seems pretty straightforward (pyarrow is loaded anyway and geoarrow.pyarrow is tiny compared to that)...for __arrow_c_xxx__ it would take some workshopping.

It seems strange, but I wonder if geoarrow-pyarrow should not depend on pyarrow (such that it could be hard dependency with a minimal module that registers an import hook after pyarrow is loaded to also load its non-minimal module). (Just spitballing here).

@martinfleis
Copy link
Member

If there are multiple geometry columns, are they guaranteed to have the same CRS? It seems like there's a CRS on the table but not a GeoSeries.

Not at all. CRS is stored at the GeometryArray level and each column can have its own. GeoDataFrame.crs retrieves CRS of the active geometry column only.

@martinfleis martinfleis added this to the 1.0 milestone Mar 14, 2024
@kylebarron
Copy link
Contributor Author

It seems strange, but I wonder if geoarrow-pyarrow should not depend on pyarrow (such that it could be hard dependency with a minimal module that registers an import hook after pyarrow is loaded to also load its non-minimal module). (Just spitballing here).

Hmm, that's interesting. As long as it's ok for the registration to happen multiple times, then geopandas could strictly depend on geoarrow-pyarrow and always register the extension types? I figure that geopandas would also need to construct classes exported from geoarrow-pyarrow?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this!

  • Agreed with focusing on the API here, regardless of upcoming optimizations or new features on the shapely side. I think we should start here with whatever is available in released shapely, and can later update it for new shapely developments.
  • We might want to add a similar method on GeoSeries as well?
  • to_arrow() or to_geoarrow() ?
  • Obviously will need some tests ;)

geopandas/io/geoarrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

I'm guessing that if the extension type isn't registered the field is dropped through the c data interface?

For RecordBatches, that shouldn't be the case (or if so, that sounds as a bug). But pyarrow.Array objects (not part of an object with a Schema), that will indeed not preserve that information ..


I kind of like the idea of returning non-extension type data but just with the field annotations. However, if we return pyarrow objects and add a similar method for GeoSeries / GeometryArray, that's an issue ..

This might also be a reason to not return pyarrow objects, but a generic wrapper object with the dunder methods. Then it is up to the user to decide which Arrow implementation or which extension types to use when importing that data.

Another problem with the extension types, even if we would add geoarrow-pyarrow as a dependency to geopandas, is that eventually we might want to control that extension type in geopandas itself .. For example, we will eventually want to add a pandas ExtensionArray backed by (Py)Arrow memory (instead of GEOS objects), and at that point, we might to have a pyarrow extension type that maps to that pandas extension instead of geoarrow-pandas.

@paleolimbot
Copy link

paleolimbot commented Mar 19, 2024

Another problem with the extension types, even if we would add geoarrow-pyarrow as a dependency to geopandas, is that eventually we might want to control that extension type in geopandas itself

If you support GeoArrow as a storage type for a GeoSeries, there's no need for geoarrow-pandas to exist!

You could try it and see if there are problems, but really think that we not want randomly dropped extension type metadata depending on the user's runtime and/or which order (or if) they imported geoarrow.pyarrow. I'm pretty happy to update (or review any update) to geoarrow-pyarrow that makes it easier to depend on; however, I don't think it is productive to duplicate geoarrow-pyarrow.

@jorisvandenbossche
Copy link
Member

If you support GeoArrow as a storage type for a GeoSeries, there's no need for geoarrow-pandas to exist!

Unless you just want support for preserving the geoarrow data in conversion to pandas, but don't have the whole geo stack installed.. Although we are making geopandas leaner to install, and the core just depends on pandas and shapely, which is quite small (the default install still comes with GDAL, PROJ, and its bindings, though, you have do some effort to only get the minimal install)

But yes, I see how ideally, when dealing with and passing around pyarrow objects, we only have a single canonical geoarrow pyarrow extension type implementation that gets reused.
We could start with adding geoarrow-pyarrow as an optional dependency here

And I have to re-read our thoughts in geoarrow/geoarrow-python#38. Ideally geopandas could also depend on something just for the extension types (while geoarrow-pyarrow is already having other (compute) functionality as well)

@kylebarron
Copy link
Contributor Author

This is good discussion!

If we do want to go down a path of exposing capsule-based objects, I wonder if it would be easiest to describe to users a two-step process of "exporting" to Arrow and then "importing" into the destination library. Maybe that informs the API here?

I kind of like the idea of returning non-extension type data but just with the field annotations

So this would be something like this? (pseudocode)

@dataclass
class GeoPandasArrowArray:
	array: pyarrow.Array
    field: pyarrow.Field

    def __arrow_c_array__(self, requested_schema):
		schema_capsule = self.field.__arrow_c_schema__()
		array_capsule = self.array.__arrow_c_array__()[1]
		return (schema_capsule, array_capsule)

I like the minimalism of it, where it doesn't need a new dependency (besides pyarrow) but presumably would work with geoarrow-c, geoarrow-pyarrow, and geoarrow-rust.

So to_arrow on a GeoSeries would return this GeoPandasArrowArray? Then would to_arrow on the GeoDataFrame return a pyarrow table or a similar wrapper class? Calling __arrow_c_stream__ on the Table object would maintain the data type, but once you get the geometry column as a ChunkedArray you'd have lost the extension types, right?

@paleolimbot
Copy link

paleolimbot commented Mar 20, 2024

Unless you just want support for preserving the geoarrow data in conversion to pandas, but don't have the whole geo stack installed

Any temporary gripes about the number of dependencies required to some_arrow_array.to_pandas() I think are massively outweighed by having it return a GeoSeries and having that be the canonical and only route. The role of geoarrow-pandas could still be to provide the extension array implementation, which could avoid a pyarrow dependency via nanoarrow or geoarrow-c or geoarrow-rust.

So to_arrow on a GeoSeries would return this GeoPandasArrowArray?

Could the GeoSeries implement __arrow_c_array__ and/or __arrow_c_stream__ for C consumers and __arrow_array__ for pyarrow consumers? That would ensure that pyarrow users get an extension array with the extension type registered but C consumers (who might care less about registration) just get a schema with metadata.

@kylebarron
Copy link
Contributor Author

Could the GeoSeries implement __arrow_c_array__ and/or __arrow_c_stream__ for C consumers and __arrow_array__ for pyarrow consumers? That would ensure that pyarrow users get an extension array with the extension type registered but C consumers (who might care less about registration) just get a schema with metadata.

This does sound nice (I'm guessing that pyarrow will always prefer __arrow_array__?) though we were considering starting with GeoSeries.to_arrow to allow the user to choose between WKB and native geometry encoding.

@paleolimbot
Copy link

to allow the user to choose between WKB and native geometry encoding.

One could use requested_schema (for the C protocol) or type (for the __arrow_array__ protocol)? Maybe defaulting to (inferred) GeoArrow native encoding but allowing a type request of wkb()? That would also let a caller opt out of extension types via pa.array(some_geoseries, pa.binary()).

@kylebarron
Copy link
Contributor Author

One could use requested_schema (for the C protocol) or type (for the __arrow_array__ protocol)? Maybe defaulting to (inferred) GeoArrow native encoding but allowing a type request of wkb()? That would also let a caller opt out of extension types via pa.array(some_geoseries, pa.binary()).

Yes, that's potentially an option, but it's hard to handle that schema negotiation without extra information about the set of geometry types in the source data, and that isn't known without a scan through the data.

@paleolimbot
Copy link

Yes, that's potentially an option, but it's hard to handle that schema negotiation without extra information about the set of geometry types in the source data, and that isn't known without a scan through the data.

It could be WKB by default then? That would be more stable since it would be guaranteed to be the same type each time and has generally wider support.

@kylebarron
Copy link
Contributor Author

It could be WKB by default then? That would be more stable since it would be guaranteed to be the same type each time and has generally wider support.

Yes but in the future when GeoArrow-native support is more widespread, we may want to default to exporting GeoArrow-native geometries, and changing that from WKB would be a breaking change. That's why I'm more inclined to not yet define PyCapsule methods on the GeoSeries directly. For now, let the user decide which output format to use.

@jorisvandenbossche
Copy link
Member

Proposal to move forward with this on the short term with an initial version that returns Arrow data with the metadata (and so not registered extension types):

  • I would like to get this in something quickly (ideally in the next two weeks for geopandas 1.0), and so on the short term I don't think we can rely on geoarrow-pyarrow. My feeling is that would need a bit more time to become a dependency of geopandas. We can still later see how we can deduplicate code that might exist in both geopandas and geoarrow-pyarrow.
  • In addition, I think for quite some use cases, at least the ones we are now thinking about ourselves (Kyle for lonboard / geoarrow-rust, geopandas itself for writing to geoparquet), you might actually only need the metadata. So for those use cases, this initial version will work fine. And I think the initial priority is to enable those use cases.

For what to actually return, we have two options:

  • Return pyarrow objects, but with just the field metadata for extension types and not registered types.
    • Advantage is that this is a more usable object to work with for users and will be supported by more libraries, but disadvantage is that it ties us to pyarrow
    • The main problem is that right now it's a bit difficult to actually convert columns with metadata to the custom (registered) extension type. So for people that do use geoarrow-pyarrow, you can't easily get what you want AFAIK.
    • If we want to return the extension types in the future, we could add an option for it now and make that required to specify (though this is a bit cumbersome ..), to avoid a breaking change in the future. Because switching to extension types would make the metadata go away, which other consumers of the data might rely upon
  • Don't return pyarrow objects but some wrapper that only exposes the dunder methods of the capsule interface
    • Disadvantage is that users always have to do an additional call to turn the object into something usable, but advantages is that this doesn't tie us to pyarrow (e.g. we would be free to in the future replace this piece of code with nanoarrow, or with the capsule that shapely might give us directly, or ..), i.e. we are free to change that under the hood without breaking user code assuming the return value is a pyarrow object.

    • The second point is also a disadvantage right now, because pyarrow tables are more widespread supported compared to Arrow PyCapsule interface (for example, polars will be able to convert the pyarrow table, but not yet the wrapper object)

    • This gives a clearer option for someone who wants the extension types (which we can then document), because they can do:

      import pyarrow as pa
      import geoarrow.pyarrow  # this registers the extension types
      
      table = pa.table(gdf.to_arrow(encoding="geoarrow"))

      and the step through the C Data Interface back to pyarrow will ensure it sees the metadata and uses the registered extension types.

I would propose to go with the second option for now. This is a bit more cumbersome to use right now for end users, but gives us more freedom for the future (i.e. we can still decide to return pyarrow objects later (eg if pandas makes that a required dependency) without breaking user code, but not the other way around)

@paleolimbot
Copy link

I agree that geoarrow-pyarrow is not a good fit right now (and won't be until I or somebody else has time to focus efforts there and give it the requisite number of releases to iron out bugs to a level suitable for geopandas), and that exposing a "dummy" array object with the right dunder methods is the best solution. Even if this is a little more cumbersome, the workaround is compact (as you noted).

I also am an obvious fan of extension types: I love the possibility that Geospatial data can be a first-class citizen in Arrow-based libraries (that support extension types) without dropping CRS metadata. I doubt there will be significant geoarrow-pyarrow usage until it is ready to be a candidate for geopandas dependency, at which point we can engineer a workaround (nanoarrow.c_schema(type).metadata should work everywhere).

@kylebarron
Copy link
Contributor Author

I concur that geoarrow-pyarrow is not stable enough to be used by geopandas, especially as geopandas moves to 1.0.

might actually only need the metadata

How so? It seems all these cases need the actual data as well.

some use cases

I think showing how the integration just works with writing to GDAL would be an impressive example. E.g. both this:

gdf = gpd.GeoDataFrame(...)
pyogrio.write_arrow(gdf.to_arrow())

and also through the OGR Python bindings support for Arrow, is quite cool.

  • Don't return pyarrow objects but some wrapper that only exposes the dunder methods of the capsule interface

I'm +1 on this. This also gives us more time to see if other libraries like Polars adopt the PyCapsule API directly, where pl.from_arrow(gdf.to_arrow()) would work out of the box. But I think the benefits of allowing us freedom to move in the future without making backwards-incompatible changes are quite valuable. And for now it's only one extra line of code for users.

return pyarrow objects later (eg if pandas makes that a required dependency)

off topic but I thought that was already decided?

@jorisvandenbossche
Copy link
Member

might actually only need the metadata

How so? It seems all these cases need the actual data as well.

Sorry, I meant only the metadata and not as a registered extension type. Of course everyone wants the actual data as well ;)
But so essentially what you did here, i.e. create data with field metadata, not using a pyarrow extension type.

@jorisvandenbossche
Copy link
Member

return pyarrow objects later (eg if pandas makes that a required dependency)

off topic but I thought that was already decided?

Yes, but then there were second thoughts .. ;) (pandas-dev/pandas#54466, pandas-dev/pandas#57073)
(I just wrote a PDEP last week for pandas 3.0 to keep a fallback for the "string" dtype in case pyarrow is not installed)

@jorisvandenbossche
Copy link
Member

FWIW I also have a draft PR on top of this to add GeoParquet support (#3275). It does need a bit ore, like the option to have separated instead of interleaved coordinates.

@kylebarron
Copy link
Contributor Author

Sorry, I meant only the metadata and not as a registered extension type. Of course everyone wants the actual data as well ;) But so essentially what you did here, i.e. create data with field metadata, not using a pyarrow extension type.

Oh I see. Yeah, I got used to maintaining extension metadata on the field separately from the arrays because arrow-rs doesn't have the concept of extension arrays.

@jorisvandenbossche
Copy link
Member

I think this should be mostly ready.

There are a few small remaining TODOs (making fields non-nullable, removing empty crs metadata), but for those I first need to fix the test data.

Then there is the remaining API question:

If ``False``, the index(es) will not be written to the file.
If ``None``, the index(ex) will be included as columns in the file
output except `RangeIndex` which is stored as metadata only.
geometry_encoding : {'WKB', 'geoarrow' }, default 'WKB'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind WKB being the default? Wouldn't it make more sense to default to GeoArrow? Going forward, I suppose that GeoArrow is the encoding of interest and since this is a new method, we don't need to think about backwards compatibility as we do with Parquet IO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WKB has the advantage that any GeoSeries can be converted to it (and that two distinct GeoSeries that are converted to it, perhaps containing different geometry types, can be combined with an Arrow C++ array concatenate operation). I love the GeoArrow encoding but I can see how users might have the most success opting in when it makes sense for them to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, we've also been discussing a mixed type (backed by an arrow union) that can hold multiple underlying arrays. That would always work here, but shapely doesn't support export to it yet.


df_attr = pd.DataFrame(df.copy(deep=False))

# replace geometry columsn with dummy values -> will get converted to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# replace geometry columsn with dummy values -> will get converted to
# replace geometry columns with dummy values -> will get converted to



def geopandas_to_arrow(
df, index=None, geometry_encoding="WKB", include_z=None, interleaved=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_z is not exposed in to_arrow and is omitted from the docstring.

@@ -106,9 +106,6 @@ def geopandas_to_arrow(
extension_metadata["ARROW:extension:metadata"] = json.dumps(
{"crs": crs}
)
else:
# TODO we shouldn't do this, just to get testing passed for now
extension_metadata["ARROW:extension:metadata"] = "{}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just discovered this and am still investigating, but we might actually need this. (If an extension type is registered with metadata but an instance is attempted to be created without metadata, Arrow C++ might throw an exception). See geoarrow/geoarrow-c#94

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's annoying ... I just removed all empty "ARROW:extension:metadata" from the test files
(but a good catch, and a reminder for me I should actually add a test using geoarrow to consume this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, reverted this back for compatibility

@kylebarron
Copy link
Contributor Author

I keep trying to find time to review this. I'll try to get to this in the next couple of days

@jorisvandenbossche
Copy link
Member

@kylebarron I suppose we should also add a similar method to GeoSeries (just returning the geometry array and not as a tabular struct array)?

Copy link
Contributor Author

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

geopandas/geodataframe.py Outdated Show resolved Hide resolved
geopandas/io/geoarrow.py Outdated Show resolved Hide resolved
@kylebarron
Copy link
Contributor Author

@kylebarron I suppose we should also add a similar method to GeoSeries (just returning the geometry array and not as a tabular struct array)?

yeah absolutely

Comment on lines 40 to 84
geom_type = left["geometry"].type
# in case of Points (directly the inner fixed_size_list or struct type)
# -> there are NaNs for empties -> we need to compare them separately
# and then fill, because pyarrow.Table.equals considers NaNs as not equal
if pa.types.is_fixed_size_list(geom_type):
left_values = left["geometry"].chunk(0).values
right_values = right["geometry"].chunk(0).values
assert left_values.is_nan().equals(right_values.is_nan())
left_geoms = pa.FixedSizeListArray.from_arrays(
pc.replace_with_mask(left_values, left_values.is_nan(), 0.0),
type=left["geometry"].type,
)
right_geoms = pa.FixedSizeListArray.from_arrays(
pc.replace_with_mask(right_values, right_values.is_nan(), 0.0),
type=right["geometry"].type,
)
left = left.set_column(1, left.schema.field("geometry"), left_geoms)
right = right.set_column(1, right.schema.field("geometry"), right_geoms)

elif pa.types.is_struct(geom_type):
left_arr = left["geometry"].chunk(0)
right_arr = right["geometry"].chunk(0)

for i in range(left_arr.type.num_fields):
assert left_arr.field(i).is_nan().equals(right_arr.field(i).is_nan())

left_geoms = pa.StructArray.from_arrays(
[
pc.replace_with_mask(left_arr.field(i), left_arr.field(i).is_nan(), 0.0)
for i in range(left_arr.type.num_fields)
],
fields=list(left["geometry"].type),
)
right_geoms = pa.StructArray.from_arrays(
[
pc.replace_with_mask(
right_arr.field(i), right_arr.field(i).is_nan(), 0.0
)
for i in range(right_arr.type.num_fields)
],
fields=list(right["geometry"].type),
)

left = left.set_column(1, left.schema.field("geometry"), left_geoms)
right = right.set_column(1, right.schema.field("geometry"), right_geoms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a horrible piece of code that I wrote, but now at least we are testing the point case that they have NaNs. And taking this upstream to pyarrow so we can simplify this code in the future

@jorisvandenbossche
Copy link
Member

I am going to merge this now, to make it easier to finish the dependent PRs (import, parquet writing), but let's further tweak the interface in follow-ups if needed!

Thanks Kyle for getting this started and all for the feedback.

@jorisvandenbossche jorisvandenbossche merged commit 19d9402 into geopandas:main May 24, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants