Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand_selectors returns resulting column names, rather than selected column names #16242

Closed
2 tasks done
machow opened this issue May 15, 2024 · 3 comments · Fixed by #16250
Closed
2 tasks done

expand_selectors returns resulting column names, rather than selected column names #16242

machow opened this issue May 15, 2024 · 3 comments · Fixed by #16250
Assignees
Labels
bug Something isn't working python Related to Python Polars

Comments

@machow
Copy link

machow commented May 15, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Currently, the expand_selectors docstring describes its intent as...

Expand a selector to column names with respect to a specific frame or schema target.

It seems like this could mean one of two things:

  1. expand source selection: list the source column names being selected
  2. expand selection result: list the names that result from expressions (like .suffix() applied to selection

I assumed expand_selectors() did the first, but it seems to do the second:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({"x": [], "y": []})

# returns ("x_z",)
cs.expand_selector(df, (cs.by_name("x") + pl.col("y")).name.suffix("_z"))

# name directly off selector, also returns ("x_z",)
cs.expand_selector(df, (cs.by_name("x").name.suffix("_z"))

Is this a bug or by design? I could see the value of both activities, but being able to expand source selection is very useful for tools that let people choose columns with selectors, so it would be helpful to have source selection somewhere!

Log output

No response

Issue description

Where is listing source selection useful?

For Great Tables, we need to do column selection, without executing computation (e.g. choose columns only). See

how R's dplyr does it

For example, in R's dplyr library, there's a function (confusingly named select) that only chooses columns. The way selection and computation are combined is through another function, called across:

library(dplyr)

# tidyselect equivalent of a selector ----
# contains("m")

# can only choose columns, can not execute computation on them ----
select(mtcars, contains("m"))

# can select, execute computation, create new columns ----
transmute(
    mtcars,
    # across combines selection, computation, and result naming
    across(
        contains("m"),                           # selection
        ~ . + 1,                                 # computation
        .names = "some_prefix_{.col}"            # result naming
    )
)
                    some_prefix_mpg some_prefix_am
Mazda RX4                      22.0              2
Mazda RX4 Wag                  22.0              2

Under the hood, functions like select and across use tidyselect::eval_select(), which returns source column selection names.

Expected behavior

The source column names being selected by selectors.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.9.5 (default, Jul 13 2022, 16:30:47) 
[Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.8
numpy:                1.26.0
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              14.0.0
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.22
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           <not installed>
@machow machow added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 15, 2024
@machow
Copy link
Author

machow commented May 15, 2024

This issue seems related to

Which brings up good language for talking about selectors, column expressions, and expressions. For example, this comment by @stinodego. I'm not sure if cs.contains("x").name.suffix("_z") has entered expression territory as laid out in his comment?

(IMO the R library dplyr's across doc does a good job of framing these pieces, though in a very different interface, with separate arguments for selection, expression, and naming; here it is ported to siuba, and to ibis)

@alexander-beedie alexander-beedie self-assigned this May 15, 2024
@alexander-beedie alexander-beedie removed the needs triage Awaiting prioritization by a maintainer label May 15, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 15, 2024

Looks like a bug to me, but possibly not the one you were thinking of; in both of the above cases I'd expect the function to raise an error, as neither input is a bare/compound selector (which is really what this function is for).

I'll see about fixing that, and then we can think how best to address your requirements - big fan of Great Tables, so let's make sure we can handle this cleanly/consistently 😅

@machow
Copy link
Author

machow commented May 15, 2024

Looks like a bug to me, but possibly not the one you were thinking of; in both of the above cases I'd expect the function to raise an error, as neither input is a bare/compound selector (which is really what this function is for).

This isn't what I was thinking of, but also exactly what I'd want, so is the dream scenario :p.

Thanks for the quick response! The polars integration with Great Tables has really been a game changer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
2 participants