Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Matching Functions - Jaro Winkler Similarity and Levenshtein Distance #2839

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

abhishoya-gs
Copy link
Contributor

@abhishoya-gs abhishoya-gs commented May 8, 2024

What type of PR is this?

Improvement

What does this PR do / why is it needed ?

Support for Jaro Winkler Similarity and Levenshtein Distance in Snowflake and DuckDb Runtime

Which issue(s) this PR fixes:

Fixes #

Other notes for reviewers:

Depends on finos/legend-pure#805

Does this PR introduce a user-facing change?

Yes, new functions

@abhishoya-gs abhishoya-gs requested a review from a team as a code owner May 8, 2024 17:07
@abhishoya-gs abhishoya-gs changed the title Fuzzy Matching Functions - Jaro Winkler and Levenshtein Distance Fuzzy Matching Functions - Jaro Winkler Similarity and Levenshtein Distance May 8, 2024
Copy link

github-actions bot commented May 8, 2024

Test Results

0 files   -      764  0 suites   - 764   0s ⏱️ - 1h 5m 38s
0 tests  - 12 617  0 ✔️  - 12 384  0 💤  - 233  0 ±0 
0 runs   - 15 728  0 ✔️  - 15 477  0 💤  - 251  0 ±0 

Results for commit 56dbf87. ± Comparison against base commit 5911e08.

♻️ This comment has been updated with latest results.

fc2(hash_String_1__HashType_1__String_1_, {ctx,text,hashType | $library->j_invoke('hash', [$text, $hashType], javaString())})

fc2(hash_String_1__HashType_1__String_1_, {ctx,text,hashType | $library->j_invoke('hash', [$text, $hashType], javaString())}),
fc2(jaroWinklerSimilarity_String_1__String_1__Float_1_, {ctx,str1,str2 | $library->j_invoke('jaroWinklerSimilarity',[$str1, $str2], javaFloat())}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't return type javaDouble? As the Library method is this -

public static double jaroWinklerSimilarity(String str1, String str2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -42,6 +42,7 @@ function <<access.private>> meta::relational::functions::sqlQueryToString::duckD
{
let allStates = allGenerationStates();
[
dynaFnToSql('jaroWinklerSimilarity', $allStates, ^ToSql(format='JARO_WINKLER_SIMILARITY(%s, %s)'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing edit distance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added

@@ -158,9 +158,11 @@ function <<access.private>> meta::relational::functions::sqlQueryToString::snowf
dynaFnToSql('hour', $allStates, ^ToSql(format='date_part(\'hour\', %s)')),
dynaFnToSql('indexOf', $allStates, ^ToSql(format='CHARINDEX(%s)', transform={p:String[2] | $p->at(1) + ', ' + $p->at(0)})),
dynaFnToSql('isAlphaNumeric', $allStates, ^ToSql(format=regexpPattern('%s'), transform={p:String[1]|$p->transformAlphaNumericParamsDefault()})),
dynaFnToSql('jaroWinklerSimilarity', $allStates, ^ToSql(format='ROUND(JAROWINKLER_SIMILARITY(%s, %s)/100, 2)')),
Copy link
Contributor

@gs-ssh16 gs-ssh16 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why round(x, 2)? Anyways it will be two digits only as does number b/n 0-100/100. Also, should we keep these in lowercase like most other functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for precision to output. But we can work without it as well, removed. Yes, lowercase would be more semantic, done.

{
if (string1 == ValueNull.INSTANCE || string2 == ValueNull.INSTANCE)
{
throw new IllegalArgumentException("Null values not supported in legend_h2_extension_edit_distance");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return null rather than failing. Same goes for other method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants