Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

Open
martinfleis opened this issue Jan 3, 2024 · 7 comments
Open

Comments

@martinfleis
Copy link
Member

I guess that merging #3080 caused a regression illustrated below.

import geopandas
from geodatasets import get_path

nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio")

# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]

nybb.crs

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/Git/geopandas/geopandas/geodataframe.py in ?(self)
    436             return self.geometry.crs
    437         except AttributeError:
--> 438             raise AttributeError(
    439                 "The CRS attribute of a GeoDataFrame without an active "

~/miniforge3/envs/geopandas/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202         ):
   6203             return self[name]
-> 6204         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'crs'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[11], line 1
----> 1 nybb.crs

File ~/miniforge3/envs/geopandas/lib/python3.12/site-packages/pandas/core/generic.py:6204, in NDFrame.__getattr__(self, name)
   6197 if (
   6198     name not in self._internal_names_set
   6199     and name not in self._metadata
   6200     and name not in self._accessors
   6201     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202 ):
   6203     return self[name]
-> 6204 return object.__getattribute__(self, name)

File ~/Git/geopandas/geopandas/geodataframe.py:438, in GeoDataFrame.crs(self)
    436     return self.geometry.crs
    437 except AttributeError:
--> 438     raise AttributeError(
    439         "The CRS attribute of a GeoDataFrame without an active "
    440         "geometry column is not defined. Use GeoDataFrame.set_geometry "
    441         "to set the active geometry column."
    442     )

AttributeError: The CRS attribute of a GeoDataFrame without an active geometry column is not defined. Use GeoDataFrame.set_geometry to set the active geometry column.

We lose the active geometry because the geometry column is now a Series

type(nybb.geometry)
pandas.core.series.Series

cc @m-richards

@jorisvandenbossche
Copy link
Member

It's the .loc setitem operation that looses the geometry dtype:

In [8]: nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio")

In [9]: nybb.dtypes
Out[9]: 
BoroCode       int32
geometry    geometry
dtype: object

In [10]: nybb.loc[5] = [6, nybb.geometry.iloc[0]]

In [11]: nybb.dtypes
Out[11]: 
BoroCode     int64
geometry    object
dtype: object

Inside .loc, after pandas has appended the new row, the mgr for the new concatted data (with the geometry column being object dtype) is passed to _constructor_from_mgr. At that point, with the new implementation, we no longer call GeoDataFrame(..) because there are no geometry columns at that point (I do wonder if concat shouldn't be able to preserve the geometry dtype?)

@m-richards
Copy link
Member

m-richards commented Jan 4, 2024

This unfortunately isn't new as such, it's just that the GeoDataFrame(..) call hid this for the most common case - where we had a geometry column called "geometry" due to our automatic call to ensure_geometry in the constructor. Slightly modifying Martin's example:

import geopandas
from geodatasets import get_path

nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio").rename_geometry("geom")
# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]

nybb.crs

we have the same crash, with or without the _constructor_from_mgr changes.

Now it looks like we could do something like this:

    def _constructor_from_mgr(self, mgr, axes):
        # analogous logic to _geodataframe_constructor_with_fallback
        if not any(isinstance(block.dtype, GeometryDtype) for block in mgr.blocks):
            df= pd.DataFrame._from_mgr(mgr, axes)
            # stuff like concat to append an object row will upcast GeometryDtype blocks
            for c in self.columns[self.dtypes == "geometry"]:
                try:
                    df[c] = _ensure_geometry(df[c], crs=self[c].crs)
                except TypeError:
                    pass
            try:
                df = df.set_geometry(self._geometry_column_name, crs=self.crs)
            except TypeError:
                pass
            return df

        return GeoDataFrame._from_mgr(mgr, axes)

which is a bit ugly but seems to work (this is super rough, so there might be a nicer way).

Having a look at how the concat logic works, it seems like ideally we'd have a way to tell pandas that the types should be specialised to geometrydtype for both things being concatenated, rather than generalised to object dtype. Currently this seems to happen in core.dtypes.concat::concat_compat and _get_result_dtype, where find_common_dtype dispatches to ExtensionDtype._get_common_dtype. Unfortunately I don't think we can easily overload this for GeometryDtype as we'd only want shapely geometry object dtypes to be accepted, not arbitrary object dtype series.

(For example, trying to take (GeometryDtype, object) -> GeometryDtype breaks trying to access individiual rows, gdf.iloc[0] as the dtype promotion will try and put a row of all types into a GeometryArray and crash, rather than an object numpy array)

@m-richards
Copy link
Member

Aside, I was trying to think about how to do this properly, and thought that a Categorical is the best analogue in pandas:

import pandas as pd
df = pd.DataFrame({"a":[1,2,3], "b":pd.Categorical(["a", "b", "b"])})
b_ = [6, "b"]
df.loc[5] = b_
print(df.dtypes)
a     int64
b    object
dtype: object

but in this case the categorical dtype also isn't preserved.

@m-richards
Copy link
Member

m-richards commented Jan 5, 2024

Just adding as a note that @jorisvandenbossche picked up on the warnings PR, this case also regresses:

crs = "EPSG:4326"
df = GeoDataFrame(
        {
            "geometry": [Point(0, 0), Point(1, 1), Point(0, 0)],
            "value1": np.arange(3, dtype="int64"),
            "value2": np.array([1, 2, 1], dtype="int64"),
        },
        crs=crs,
    )

    # dummy test asserting we can access the crs
    def func(group):
        assert isinstance(group, GeoDataFrame)
        assert group.crs == crs

    groupby = df.groupby("value2")
    groupby_sliced = groupby[df.columns] # -> groupby_sliced._selected_obj._geometry_column_name is where _geometry_column_name is lost.
    groupby_sliced.apply(func)

still looking into how this happens.

@jorisvandenbossche
Copy link
Member

Aside, I was trying to think about how to do this properly, and thought that a Categorical is the best analogue in pandas:
...
but in this case the categorical dtype also isn't preserved.

Yeah, that's a good other example. I think in the end this is something that will have to be solved in pandas to fix it properly (although your PR looks interesting as a workaround, have to take a closer look!)

Just adding as a note that @jorisvandenbossche picked up on the warnings PR, this case also regresses:

Yes, I looked into that further and commented about it at #2966 (comment). Summary, it's also "uncovered" by the _constructor change, but something that already existed before for non-"geometry" geometry columns as well. Will do a PR to fix it on the pandas side (it's an easy fix), and have #3130 to test it on our side.

@m-richards
Copy link
Member

There's actually a much older regression at play here too.

import geopandas
from geodatasets import get_path
nybb = geopandas.read_file(get_path("nybb"))[["BoroCode", "geometry"]]
crs_orig = nybb.crs
# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]
assert nybb.crs == crs_orig # now fails

The CRS is not preserved in this case since 8bac66c
(and now since 1b3a305 nybb.crs is an attributeerror). So just restoring the behaviour before this regression isn't ideal either, but perhaps the best option in the short term.

m-richards added a commit to m-richards/geopandas that referenced this issue Jan 26, 2024
@jorisvandenbossche
Copy link
Member

In the end this is all because pandas doesn't properly preserve the dtype in concat. And only our constructor that detects by default a "geometry" column fixed that again. Will try to think about how to fix it in pandas as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants