Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allow additional predicates in join_by #6970

Open
cboettig opened this issue Nov 23, 2023 · 2 comments
Open

Feature request: allow additional predicates in join_by #6970

cboettig opened this issue Nov 23, 2023 · 2 comments

Comments

@cboettig
Copy link

cboettig commented Nov 23, 2023

join_by refuses to handle any predicates it doesn't recognize:

Expressions must use one of: `==`, `>=`, `>`, `<=`, `<`, `closest()`, `between()`, `overlaps()`, or
  `within()`.

This is understandable in a narrow sense but interferes with the ability of dplyr to work with external databases like duckdb that are far more expressive. This is also inconsistent with most of the other dplyr functions, that are quite happy passing along such functions. Consider manipulating the following lazy_tbl objects using the explicitly spatial functions that duckdb recognizes:

library(dplyr)
devtools::install_github("cboettig/duckdbfs@spatial-read")
library(duckdbfs)

countries <- open_dataset("/vsicurl/https://github.com/cboettig/duckdbfs/raw/spatial-read/inst/extdata/world.gpkg",
                          format = "sf", tblname = "countries")

cities <- open_dataset("/vsicurl/https://github.com/cboettig/duckdbfs/raw/spatial-read/inst/extdata/metro.fgb",
                          format = "sf", tblname = "cities")

melbourne <- st_point(c(144.9633, -37.814)) |> st_as_text()

# most dplyr functions have no difficulty using these arguments

countries |> filter(st_contains(geom, ST_GeomFromText({melbourne})))

note that dplyr::filter (and most other such functions) have no problem passing these st_* functions on to duckdb, allowing us to do spatial operations. Now a very natural operation is to do spatial joins, e.g. in SQL / DBI we can do:

con <- cached_connection() # duckdbfs connection that holds the above tables

x <- DBI::dbGetQuery(con,"
SELECT countries.iso_a3, cities.geom, countries.geom AS geometry
FROM countries
LEFT JOIN cities
ON st_contains(countries.geom, cities.geom)
") |> as_tibble()

All I'd like is for dplyr to construct this syntax with it's usual wonderful magic, e.g. I want to write:

countries |> left_join(cities, join_by(st_contains(geom, geom))) 

Wouldn't that be nice? I think it could do so if only join_by() was not so fastidious in asserting that it doesn't understand anything but a hard-coded set of functions. Unfortunately this obviously errors with the message:

Error in `join_by()`:
! Expressions must use one of: `==`, `>=`, `>`, `<=`, `<`, `closest()`, `between()`, `overlaps()`, or
  `within()`.
ℹ Expression 1 is `st_contains(geom, geom)`.

Would it be possible to just relax this assertion, e.g. when the dataset in question is a remote src?

@twest820
Copy link

twest820 commented Dec 5, 2023

+1, I think, as I've a use case which seems of interest here. I've grouped data and need to compare the individuals within different instances of groups of the same type. This results in a many to many join where groups can be matched exactly but within() (or overlaps()) is needed to produce cross joins at the individual level.

x = tibble(groupType = ..., individual = ...) %>% # many group types, individuals in each group are sequentially numbered from 1
  group(groupType) %>% mutate(minIndividual = 1, maxIndividual = n())
y = tibble(groupType = ..., individual = ...) %>% # same group types as data1, different individuals
  group(groupType) %>% mutate(minIndividual = 1, maxIndividual = n())

z = full_join(x, y, by = join_by(groupType, within(x$minIndividual, x$maxIndividual, y$minIndividual, y$maxIndividual)), suffix = c("X", "Y")) %>%
  filter((individualX == individualY) | ((individualX == maxIndividualX) & (individualY > maxIndividualX)) | ((individualY == maxIndividualY) & (individualX > maxIndividualY)))

x and y have a few tens of thousands of rows so cross_join() is infeasible—it'd be about a billion rows—and the cross join scoped by within() still produces an order of magnitude more rows than are retained after filtering. (A regular full join doesn't work because, depending on group matching, a given row in y may need be repeated several times in z.)

So it'd be nice to do more of the filtering in the join and less of it post hoc. I'm not seeing another way to request dplyr 1.1.3 repeat rows of y.

@twest820
Copy link

Hit the same repeatedly join all rows in y where they match in x use case for a different problem where x is about a million rows, y's around 10 million, and the output's 25 million or so. While the output tibble wouldn't be more than 10 GB or so dplyr can't form it as the overlapped full_join() doesn't fit into 128 GB of DDR (this is running on dual channel DDR4 so using more or larger DIMMs isn't possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants