REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

martinfleis · 2024-01-03T09:25:58Z

I guess that merging #3080 caused a regression illustrated below.

import geopandas
from geodatasets import get_path

nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio")

# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]

nybb.crs

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/Git/geopandas/geopandas/geodataframe.py in ?(self)
    436             return self.geometry.crs
    437         except AttributeError:
--> 438             raise AttributeError(
    439                 "The CRS attribute of a GeoDataFrame without an active "

~/miniforge3/envs/geopandas/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202         ):
   6203             return self[name]
-> 6204         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'crs'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[11], line 1
----> 1 nybb.crs

File ~/miniforge3/envs/geopandas/lib/python3.12/site-packages/pandas/core/generic.py:6204, in NDFrame.__getattr__(self, name)
   6197 if (
   6198     name not in self._internal_names_set
   6199     and name not in self._metadata
   6200     and name not in self._accessors
   6201     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202 ):
   6203     return self[name]
-> 6204 return object.__getattribute__(self, name)

File ~/Git/geopandas/geopandas/geodataframe.py:438, in GeoDataFrame.crs(self)
    436     return self.geometry.crs
    437 except AttributeError:
--> 438     raise AttributeError(
    439         "The CRS attribute of a GeoDataFrame without an active "
    440         "geometry column is not defined. Use GeoDataFrame.set_geometry "
    441         "to set the active geometry column."
    442     )

AttributeError: The CRS attribute of a GeoDataFrame without an active geometry column is not defined. Use GeoDataFrame.set_geometry to set the active geometry column.

We lose the active geometry because the geometry column is now a Series

type(nybb.geometry)
pandas.core.series.Series

cc @m-richards

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-01-03T15:35:10Z

It's the .loc setitem operation that looses the geometry dtype:

In [8]: nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio")

In [9]: nybb.dtypes
Out[9]: 
BoroCode       int32
geometry    geometry
dtype: object

In [10]: nybb.loc[5] = [6, nybb.geometry.iloc[0]]

In [11]: nybb.dtypes
Out[11]: 
BoroCode     int64
geometry    object
dtype: object

Inside .loc, after pandas has appended the new row, the mgr for the new concatted data (with the geometry column being object dtype) is passed to _constructor_from_mgr. At that point, with the new implementation, we no longer call GeoDataFrame(..) because there are no geometry columns at that point (I do wonder if concat shouldn't be able to preserve the geometry dtype?)

m-richards · 2024-01-04T05:53:13Z

This unfortunately isn't new as such, it's just that the GeoDataFrame(..) call hid this for the most common case - where we had a geometry column called "geometry" due to our automatic call to ensure_geometry in the constructor. Slightly modifying Martin's example:

import geopandas
from geodatasets import get_path

nybb = geopandas.read_file(get_path("nybb"), columns=["BoroCode", "geometry"], engine="pyogrio").rename_geometry("geom")
# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]

nybb.crs

we have the same crash, with or without the _constructor_from_mgr changes.

Now it looks like we could do something like this:

    def _constructor_from_mgr(self, mgr, axes):
        # analogous logic to _geodataframe_constructor_with_fallback
        if not any(isinstance(block.dtype, GeometryDtype) for block in mgr.blocks):
            df= pd.DataFrame._from_mgr(mgr, axes)
            # stuff like concat to append an object row will upcast GeometryDtype blocks
            for c in self.columns[self.dtypes == "geometry"]:
                try:
                    df[c] = _ensure_geometry(df[c], crs=self[c].crs)
                except TypeError:
                    pass
            try:
                df = df.set_geometry(self._geometry_column_name, crs=self.crs)
            except TypeError:
                pass
            return df

        return GeoDataFrame._from_mgr(mgr, axes)

which is a bit ugly but seems to work (this is super rough, so there might be a nicer way).

Having a look at how the concat logic works, it seems like ideally we'd have a way to tell pandas that the types should be specialised to geometrydtype for both things being concatenated, rather than generalised to object dtype. Currently this seems to happen in core.dtypes.concat::concat_compat and _get_result_dtype, where find_common_dtype dispatches to ExtensionDtype._get_common_dtype. Unfortunately I don't think we can easily overload this for GeometryDtype as we'd only want shapely geometry object dtypes to be accepted, not arbitrary object dtype series.

(For example, trying to take (GeometryDtype, object) -> GeometryDtype breaks trying to access individiual rows, gdf.iloc[0] as the dtype promotion will try and put a row of all types into a GeometryArray and crash, rather than an object numpy array)

m-richards · 2024-01-04T23:48:20Z

Aside, I was trying to think about how to do this properly, and thought that a Categorical is the best analogue in pandas:

import pandas as pd
df = pd.DataFrame({"a":[1,2,3], "b":pd.Categorical(["a", "b", "b"])})
b_ = [6, "b"]
df.loc[5] = b_
print(df.dtypes)
a     int64
b    object
dtype: object

but in this case the categorical dtype also isn't preserved.

m-richards · 2024-01-05T10:18:55Z

Just adding as a note that @jorisvandenbossche picked up on the warnings PR, this case also regresses:

crs = "EPSG:4326"
df = GeoDataFrame(
        {
            "geometry": [Point(0, 0), Point(1, 1), Point(0, 0)],
            "value1": np.arange(3, dtype="int64"),
            "value2": np.array([1, 2, 1], dtype="int64"),
        },
        crs=crs,
    )

    # dummy test asserting we can access the crs
    def func(group):
        assert isinstance(group, GeoDataFrame)
        assert group.crs == crs

    groupby = df.groupby("value2")
    groupby_sliced = groupby[df.columns] # -> groupby_sliced._selected_obj._geometry_column_name is where _geometry_column_name is lost.
    groupby_sliced.apply(func)

still looking into how this happens.

jorisvandenbossche · 2024-01-05T21:37:05Z

Aside, I was trying to think about how to do this properly, and thought that a Categorical is the best analogue in pandas:
...
but in this case the categorical dtype also isn't preserved.

Yeah, that's a good other example. I think in the end this is something that will have to be solved in pandas to fix it properly (although your PR looks interesting as a workaround, have to take a closer look!)

Just adding as a note that @jorisvandenbossche picked up on the warnings PR, this case also regresses:

Yes, I looked into that further and commented about it at #2966 (comment). Summary, it's also "uncovered" by the _constructor change, but something that already existed before for non-"geometry" geometry columns as well. Will do a PR to fix it on the pandas side (it's an easy fix), and have #3130 to test it on our side.

m-richards · 2024-01-26T01:05:18Z

There's actually a much older regression at play here too.

import geopandas
from geodatasets import get_path
nybb = geopandas.read_file(get_path("nybb"))[["BoroCode", "geometry"]]
crs_orig = nybb.crs
# add a new row
nybb.loc[5] = [6, nybb.geometry.iloc[0]]
assert nybb.crs == crs_orig # now fails

The CRS is not preserved in this case since 8bac66c
(and now since 1b3a305 nybb.crs is an attributeerror). So just restoring the behaviour before this regression isn't ideal either, but perhaps the best option in the short term.

jorisvandenbossche · 2024-01-26T15:49:52Z

In the end this is all because pandas doesn't properly preserve the dtype in concat. And only our constructor that detects by default a "geometry" column fixed that again. Will try to think about how to fix it in pandas as well.

m-richards mentioned this issue Jan 4, 2024

BUG: adding row to GeoDataFrame with .loc unsets the active geometry column #3109

Open

m-richards added bug regression labels Jan 5, 2024

m-richards mentioned this issue Jan 17, 2024

REGR: using loc to add rows doesn't preserve dtypes #3146

Closed

m-richards added a commit to m-richards/geopandas that referenced this issue Jan 26, 2024

add regression test for geopandas#3119

5c5e8e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

martinfleis commented Jan 3, 2024

jorisvandenbossche commented Jan 3, 2024

m-richards commented Jan 4, 2024 •

edited

m-richards commented Jan 4, 2024

m-richards commented Jan 5, 2024 •

edited

jorisvandenbossche commented Jan 5, 2024

m-richards commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

REGR: adding a row to a GeoDataFrame downcasts GeoSeries to Series #3119

Comments

martinfleis commented Jan 3, 2024

jorisvandenbossche commented Jan 3, 2024

m-richards commented Jan 4, 2024 • edited

m-richards commented Jan 4, 2024

m-richards commented Jan 5, 2024 • edited

jorisvandenbossche commented Jan 5, 2024

m-richards commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

m-richards commented Jan 4, 2024 •

edited

m-richards commented Jan 5, 2024 •

edited