Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regexp_like scalar function #9102

Closed
Omega359 opened this issue Feb 1, 2024 · 9 comments · Fixed by #9137
Closed

Add regexp_like scalar function #9102

Omega359 opened this issue Feb 1, 2024 · 9 comments · Fixed by #9137
Labels
enhancement New feature or request

Comments

@Omega359
Copy link
Contributor

Omega359 commented Feb 1, 2024

Is your feature request related to a problem or challenge?

Currently there is regexp_match and regexp_replace however there isn't a corresponding regexp_like function that could be used in when(..) dataframe method or sql case statements.

Describe the solution you'd like

An implementation of regexp_like that matches the syntax and style of the postgresql implementation as close as possible. Note that the Spark version of the regexp_like function is very similar but does not include any flags in the function signature.

Not all of the flags that postgresql supports may be included in the initial implementation - likely just 'i' may be implemented.

It is noted that the implementation for the existing regexp_match currently resides in datafusion however with apache/arrow-rs#5235 this functionality was moved into arrow_rs (but has yet to be removed from datafusion). The implementation for regexp_like may take a similar path - implement in datafusion first then move to arrow_rs if the community thinks that would be a good idea.

Describe alternatives you've considered

You can use regexp_match and test the return list for empty to imitate this function however that is not the most performant way to implement this as that methodology cannot return immediately after the first match.

Additional context

No response

@Omega359 Omega359 added the enhancement New feature or request label Feb 1, 2024
@Omega359
Copy link
Contributor Author

Omega359 commented Feb 1, 2024

Looks like arrow_rs actually does implement a function that can be used in this issue - regexp_is_match_utf8:
https://github.com/apache/arrow-rs/blob/master/arrow-string/src/regexp.rs#L38

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

DataFusion also has Expr::Like but that implements the % SQL type matching

Also potentially interesting is that we have a version of this function in InfluxDB:

https://github.com/influxdata/influxdb/blob/ff567cd33f6f9c42a4e24e9c93d1799d1a8f4461/query_functions/src/regex.rs#L22-L121

(though that one is designed to match the behavior of the Go regexp library)

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

It is noted that the implementation for the existing regexp_match currently resides in datafusion however with apache/arrow-rs#5235

I actually ran into the same thing this morning as I tried to move the regexp code into a different crate -- #9101. In general the code in regexp_expressions.rs is quite messy and I think could be significantly simplified now that that arrow-rs has more functionality

@Omega359 is there any chance you might be willing to try and remove the copy? Otherwise I will try and find time to do it over the next few days

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 1, 2024

@Omega359 is there any chance you might be willing to try and remove the copy? Otherwise I will try and find time to do it over the next few days

@alamb yes, I should be able to work on that tonight or tomorrow - hopefully it won't make the merge too difficult for you.

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

@alamb yes, I should be able to work on that tonight or tomorrow - hopefully it won't make the merge too difficult for you.

Don't worry about it -- I think it will make things easier -- the current logic to figure out when the arguments are all constants seems overly convoluted so the fewer things going on in that module the better

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 1, 2024

DataFusion also has Expr::Like but that implements the % SQL type matching

Also potentially interesting is that we have a version of this function in InfluxDB:

https://github.com/influxdata/influxdb/blob/ff567cd33f6f9c42a4e24e9c93d1799d1a8f4461/query_functions/src/regex.rs#L22-L121

(though that one is designed to match the behavior of the Go regexp library)

That is definitely interesting. I spent some time today looking into the differences between the postgresql, Java and Rust implementations of regex. There is as expected a very large amount of overlap but some advanced features are only found in one implementation or another. Postgresql has an expanded version of one of the posix definitions, Java I think is more based on the Perl regex, and the Rust crate specifically calls out that it isn't posix based.

Essentially, I was thinking of just sticking with what is in use already and documenting the syntax via references to the rust crate's documentation and noting that anyone expecting to have 100% compatibility with either postgresql, Java's, Perl's, Go's, Posix, etc is bound to be disappointed. Once we have proper separation between default, postgres and spark syntaxes then others can have a go at specific versions.

Primarily I suspect this might possible impact the comet contribution being made - I haven't checked to see if they did any work in this area.

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 1, 2024

BTW, from the looks of things the rust regex crate now allows escapes on a lot more things - see rust-lang/regex#501 (comment) . The code you referenced may no longer be required.

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

Essentially, I was thinking of just sticking with what is in use already and documenting the syntax via references to the rust crate's documentation and noting that anyone expecting to have 100% compatibility with either postgresql, Java's, Perl's, Go's, Posix, etc is bound to be disappointed. Once we have proper separation between default, postgres and spark syntaxes then others can have a go at specific versions.

I think this is a solid plan FWIW

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 2, 2024

Additional note - datafusion does have support for postgres sql regex operators such as text ~ text. That support however isn't easily exposed to dataframe use cases to my knowledge. It does however have the advantage that it does have optimizer support for simplifying expressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants