Feature request: allow additional predicates in join_by #6970

cboettig · 2023-11-23T00:19:30Z

join_by refuses to handle any predicates it doesn't recognize:

Expressions must use one of: `==`, `>=`, `>`, `<=`, `<`, `closest()`, `between()`, `overlaps()`, or
  `within()`.

This is understandable in a narrow sense but interferes with the ability of dplyr to work with external databases like duckdb that are far more expressive. This is also inconsistent with most of the other dplyr functions, that are quite happy passing along such functions. Consider manipulating the following lazy_tbl objects using the explicitly spatial functions that duckdb recognizes:

library(dplyr)
devtools::install_github("cboettig/duckdbfs@spatial-read")
library(duckdbfs)

countries <- open_dataset("/vsicurl/https://github.com/cboettig/duckdbfs/raw/spatial-read/inst/extdata/world.gpkg",
                          format = "sf", tblname = "countries")

cities <- open_dataset("/vsicurl/https://github.com/cboettig/duckdbfs/raw/spatial-read/inst/extdata/metro.fgb",
                          format = "sf", tblname = "cities")

melbourne <- st_point(c(144.9633, -37.814)) |> st_as_text()

# most dplyr functions have no difficulty using these arguments

countries |> filter(st_contains(geom, ST_GeomFromText({melbourne})))

note that dplyr::filter (and most other such functions) have no problem passing these st_* functions on to duckdb, allowing us to do spatial operations. Now a very natural operation is to do spatial joins, e.g. in SQL / DBI we can do:

con <- cached_connection() # duckdbfs connection that holds the above tables

x <- DBI::dbGetQuery(con,"
SELECT countries.iso_a3, cities.geom, countries.geom AS geometry
FROM countries
LEFT JOIN cities
ON st_contains(countries.geom, cities.geom)
") |> as_tibble()

All I'd like is for dplyr to construct this syntax with it's usual wonderful magic, e.g. I want to write:

countries |> left_join(cities, join_by(st_contains(geom, geom)))

Wouldn't that be nice? I think it could do so if only join_by() was not so fastidious in asserting that it doesn't understand anything but a hard-coded set of functions. Unfortunately this obviously errors with the message:

Error in `join_by()`:
! Expressions must use one of: `==`, `>=`, `>`, `<=`, `<`, `closest()`, `between()`, `overlaps()`, or
  `within()`.
ℹ Expression 1 is `st_contains(geom, geom)`.

Would it be possible to just relax this assertion, e.g. when the dataset in question is a remote src?

The text was updated successfully, but these errors were encountered:

twest820 · 2023-12-05T20:05:42Z

+1, I think, as I've a use case which seems of interest here. I've grouped data and need to compare the individuals within different instances of groups of the same type. This results in a many to many join where groups can be matched exactly but within() (or overlaps()) is needed to produce cross joins at the individual level.

x = tibble(groupType = ..., individual = ...) %>% # many group types, individuals in each group are sequentially numbered from 1
  group(groupType) %>% mutate(minIndividual = 1, maxIndividual = n())
y = tibble(groupType = ..., individual = ...) %>% # same group types as data1, different individuals
  group(groupType) %>% mutate(minIndividual = 1, maxIndividual = n())

z = full_join(x, y, by = join_by(groupType, within(x$minIndividual, x$maxIndividual, y$minIndividual, y$maxIndividual)), suffix = c("X", "Y")) %>%
  filter((individualX == individualY) | ((individualX == maxIndividualX) & (individualY > maxIndividualX)) | ((individualY == maxIndividualY) & (individualX > maxIndividualY)))

x and y have a few tens of thousands of rows so cross_join() is infeasible—it'd be about a billion rows—and the cross join scoped by within() still produces an order of magnitude more rows than are retained after filtering. (A regular full join doesn't work because, depending on group matching, a given row in y may need be repeated several times in z.)

So it'd be nice to do more of the filtering in the join and less of it post hoc. I'm not seeing another way to request dplyr 1.1.3 repeat rows of y.

twest820 · 2024-01-16T21:27:01Z

Hit the same repeatedly join all rows in y where they match in x use case for a different problem where x is about a million rows, y's around 10 million, and the output's 25 million or so. While the output tibble wouldn't be more than 10 GB or so dplyr can't form it as the overlapped full_join() doesn't fit into 128 GB of DDR (this is running on dual channel DDR4 so using more or larger DIMMs isn't possible).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: allow additional predicates in join_by #6970

Feature request: allow additional predicates in join_by #6970

cboettig commented Nov 23, 2023 •

edited

twest820 commented Dec 5, 2023

twest820 commented Jan 16, 2024

Feature request: allow additional predicates in join_by #6970

Feature request: allow additional predicates in join_by #6970

Comments

cboettig commented Nov 23, 2023 • edited

twest820 commented Dec 5, 2023

twest820 commented Jan 16, 2024

cboettig commented Nov 23, 2023 •

edited