Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text 158 #112

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Text 158 #112

wants to merge 12 commits into from

Conversation

milaKisialiova
Copy link

Fix Jaccard similarity with empty strings:
Due to bug, Jaccard similarity for two empty strings returned 0.0, and the distance 1.0, now it returns similarity 1.0, and the distance 0.0

@aherbert
Copy link
Contributor

This also includes the fix for TEXT-86 #106. Can you remove that so the only files changed are for the JaccardSimilarity and its test.

@coveralls
Copy link

coveralls commented Mar 11, 2019

Coverage Status

Coverage increased (+0.0008%) to 97.918% when pulling 61becad on CAPS50:TEXT-158 into 700a066 on apache:master.

@@ -37,7 +37,7 @@ public static void setUp() {
@Test
public void testGettingJaccardDistance() {
// Expected Jaccard distance = 1.0 - (intersect / union)
assertEquals(1.0, classBeingTested.apply("", ""));
assertEquals(0.0, classBeingTested.apply("", ""));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works because the unit test passes in the same string on both sides, i.e. "" vs "".

Do not use == to measure equality. This measures the same object reference. Try this in your test:

// Change 
assertEquals(0.0, classBeingTested.apply("", ""));
// to 
assertEquals(0.0, classBeingTested.apply("", new StringBuilder()));

You should fix the code to use StringUtils.equals(CharSequence, CharSequence) to test for equality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, but

  1. Operator == is used to measure two int values (length).
  2. with new StringBuilder() also works well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are right. It was my mistake carried over from fixing another similarity class to use StringUtils.equals.

However you should consider using StringUtils.equals before splitting all the sequences into a set to provide a fast exit in the case of equal input. This is done in JaroWinklerSimilarity.

@CAPS50
Copy link
Contributor

CAPS50 commented Mar 22, 2019

Hi @aherbert

it looks like Travis is failing in jdk 12 for a reason that is not related to this change:

install-jdk.sh 2019-01-18 II
Couldn't determine a download url for 12-GPL on linux-x64

can you please have a look and advise how to move forward?

@aherbert
Copy link
Contributor

I do not know why openjdk12 was in the Travis build matrix. Travis did not like it.

This has just been fixed on master by @garydgregory to use oraclejdk11 and openjdk8. Any new work pushed to this this PR will use the new matrix from master.

The current travis results pass JDK 8 and 11 so this is not an issue.

@@ -37,7 +37,7 @@ public static void setUp() {
@Test
public void testGettingJaccardDistance() {
// Expected Jaccard distance = 1.0 - (intersect / union)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this comment is out of sync with the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can you describe why this comment looks wrong? For me it's good, shows the formula, as well as in JaccardSimilarityTest.java
// Expected Jaccard similarity = (intersect / union)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the comment reads:

// Expected Jaccard distance = 1.0 - (intersect / union)

and the next line is in contradiction with this comment, expecting 0.0, not 1.0.

assertEquals(0.0, classBeingTested.apply("", ""));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @garydgregory , I'm still not convinced.
Intersection of two empty sets is empty set, union is also empty set. They are equal. So Jaccard similarity is 1. And therefore distance is 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants