Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

wants to merge 20 commits into from

Conversation

ameyjadiye
Copy link
Contributor

This pull request adds the Sorensen-Dice Similarity Algorithm to Apache Commons Text.

The Sorensen–Dice coefficient is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sorensen and Lee Raymond Dice who published in 1948 and 1945 respectively.

@coveralls
Copy link

coveralls commented Feb 26, 2019

Coverage Status

Coverage increased (+0.009%) to 97.926% when pulling 8f0a97c on ameyjadiye:master into a6f2b44 on apache:master.

@kinow
Copy link
Member

kinow commented Feb 26, 2019

Hi @ameyjadiye ! A new metric! 🎉 Added a few comments, but only spent a couple of minutes looking at the code. I will take a better look later, and read about this similarity (never heard about it).

Thanks !!!

@aherbert
Copy link
Contributor

Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric:

Sørensen–Dice coefficient

Thus you must compute the matches (true-positives) between the two sets X and Y (the union) and then include a penalty for items in either set that are not matched (false-positives and false negatives).

(2 * TP) / (2 * TP + FP + FN)

Or simply:

2 * |X union Y| / (|X| + |Y|)

with |N| the cardinality, or count.

The score is related to the Jaccard score and can be computed directly from it (S = 2J / (1 + J)). However the JaccardSimilarity class in the same package uses single characters from each CharSequence and not character pairs (bigrams). So this algorithm using pairs is new.

The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus:

sim.apply("aa", "aa")      = [aa] vs [aa] = 1.0
sim.apply("aaa", "aa")     = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aa")    = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aaa")   = [aa] vs [aa] = 1.0

The original article you referenced includes duplicate observations of each unique character pair (bigram). Thus:

sim.apply("aa", "aa")      = [aa]         vs [aa]     = (2 * 1) / (1 + 1) = 1.0
sim.apply("aaa", "aa")     = [aa][aa]     vs [aa]     = (2 * 1) / (2 + 1) = 0.667
sim.apply("aaaa", "aa")    = [aa][aa][aa] vs [aa]     = (2 * 1) / (3 + 1) = 0.5
sim.apply("aaaa", "aaa")   = [aa][aa][aa] vs [aa][aa] = (2 * 2) / (3 + 2) = 0.8

This requires a different algorithm without using Set. The author provides one which could be used for testing although it could be improved for efficiency.

Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is.

I also note that splitting the CharSequence into bigrams can be done efficiently using a 32-bit integer to store each pair of 16-bit chars:

/**
 * Creates the bigrams packed as two 16-bit chars into a 32-bit int. This does not
 * support extended character sets.
 *
 * @param cs the char sequence
 * @return the bigrams
 */
private int[] createBigrams(CharSequence cs) {
    if (cs.length() < 2) {
        return new int[0];
    }
    final int[] bigrams = new int[cs.length() - 1];
    int ch2 = cs.charAt(0);
    for (int i = 0; i < bigrams.length; i++) {
        final int ch1 = ch2;
        ch2 = cs.charAt(i + 1);
        bigrams[i] = (ch1 << 16) | ch2;
    }
    return bigrams;
}

You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with Set<Integer>.

Also be careful of overflow in the result computation. I acknowledge that with the Set method you would have to input two CharSequences with close to all 2^16 * 2^16 bigrams to get two set sizes above half Integer.MAX_VALUE. With the duplicates method you just have to input two CharSequence that have a combined length above Integer.MAX_VALUE and then adding the sizes will overflow the int. For example if you were to compare two books!

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished first round of reviews @ameyjadiye ! Great work, and thanks for the links. Today-I-Learned about the Sorensen-Dice similarity. Hadn't heard about it. However, in NLP F1-Score is quite common, and looks like it is a - if not the same? - Sorensen-Dice coefficient. Will read more later. Cheers

@ameyjadiye
Copy link
Contributor Author

Hi @kinow , looks like all suggestions are accommodated with readable code and considerable performance.
the code also mimics algorithm given on Wikipedia. test cases passing as well. do you want me to improve anything more here?

@aherbert
Copy link
Contributor

aherbert commented Mar 6, 2019

Hi @ameyjadiye,

The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score.

This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing SimilarityScore<R>:

JaccardSimilarity - uses Set<Character> for matching single characters to produce Double
JaroWinklerSimilarity - uses single character matching to produce an edit distance which is then converted to a similarity Double
LongestCommonSubsequence - uses single character matching for sub-sequences to produce Integer

The rest of the package is for an extension of SimilarityScore<R> which is EditDistance<R>:

CosineDistance - uses a RegexTokenizer on whitespace to match words to produce Double
HammingDistance - uses single character matching to produce Integer
JaccardDistance - just inverts the JaccardSimilarity to produce Double
JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce Double
LevenshteinDetailedDistance - single character changes to produce LevenshteinResults
LevenshteinDistance - single character changes to produce Integer
LongestCommonSubsequenceDistance - single character based substring to produce Double

So all the other measures use single characters or words (CosineDistance).

If you want a Sorensen-Dice distance score using single characters then just use the JaccardDistance and map it: S = 2J / (1 + J)

This class is effectively a JaccardDistance with bigrams but with a mapped output score.

One suggestion would be a change to be something like BigramJaccardSimilarity and then compute the Jaccard using bigrams. Then put new classes in for SorensenDiceSimilarity and BiSorensenDiceSimilarity.

However the current JaccardDistance uses Set which has space and efficiency advantages over Set. Once you are using strings then there is no reason to have only bigrams.

My preference would be to add a new class TokenizerJaccardSimilarity (or something nicer) that accepts a Tokenizer<CharSequence> in the constructor. This then tokenises the input into two sets and computes the union of the two sets.

Your bigram case can be fulfilled by writing a new Tokenizer<CharSequence> that returns the bigrams. This new class would be flexible for computing the Jaccard for words, bigrams, trigrams, n-grams, or whatever else you can tokenise a CharSequence into.

Then you add a TokenSorensonDiceSimilarity that just maps the Jaccard score. You can see this for example in the JaccardDistance class.

@kinow Opinions on a more flexible approach?

@aherbert
Copy link
Contributor

aherbert commented Mar 7, 2019

I've had more of a think and have mocked an implementation for a generic IntersectionSimilarity.

See TEXT-155 and the code in the linked PR.

This class can be used to compute the Sorensen-Dice similarity for bigrams or any other type of structure that you would like to analyse from the CharSequence, e.g. characters, bigrams, trigrams, n-grams, words, etc.

Note: Sorensen-Dice similarity is the same as the F1-score

@kinow
Copy link
Member

kinow commented Mar 7, 2019

Been busy with a few other bugs these past two days, but hopefully will be able to review this and TEXT-155 this weekend 🎉 thanks a lot @aherbert ! @ameyjadiye sorry the delay, just a bit longer and we should be done with the review for your contribution (:

@aherbert
Copy link
Contributor

aherbert commented Mar 7, 2019

I have just noticed that the JaccardSimilarity computes 0 for the similarity of zero length input sequences. This computes 1 so it should be changed.

https://github.com/aherbert/commons-text into feature-TEXT-155

# Conflicts:
#	src/test/java/org/apache/commons/text/similarity/IntersectionSimilarityTest.java
This adds Bag functionality to allow counting duplicate elements.
@kinow
Copy link
Member

kinow commented Mar 9, 2019

@ameyjadiye see last comment from @aherbert about empty strings and 0 vs. 1.

@aherbert while we are discussing #109 , do you think that is a blocker for this pull request? So far I think at least the API proposed here would be kept right?

If so, this could be merged once the last comment is resolved, and then we can discuss how to organize the classes and where the sorensen-dice coefficient is calculated.

I think the only thing missing is deciding on the name of the classes? Whether it should use Bigram in the name or be just SorensenDiceSimilarity.

I like the idea of having a descriptive name such as BigramSorensenDiceSimilarity (or Bigram in other place/order). However, I think we should also considerate what users would expect. i.e. in other libraries, does the Sorensen Dice similarity used is for bigrams always? If other implementations Python/JS/Java in used bigrams, then we could leave it as SorensenDiceSimilarity and either add another method/constructor/etc to customize the similarity, or then have another class...

What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too 👍 [or any other person reading this 🙂 ])

@ameyjadiye
Copy link
Contributor Author

@kinow @aherbert
I have written code by keeping Wikipedia as standard and researched some other libraries from other languages just for reference. all of them are using bigrams for calculating similarities. and I personally think that if we go further and use triGram, qurterGram ... nGram the resulting %age would be incorrect. we can't use charachter by charachter match i.e uniGram as that will also make result bad as ab!=ba

with nGram are we tampering the existing proved algoritham ? if its giving better results than existing algo I'm ok with that, also does someone really need it in real world examples ?

Wikipedia says

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[9]

s=2nt / nx + ny
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht
We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}
Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

not sure but are we over engineering similarities with #109 ? let me know if there is practicle use of nGram in real world ? would like to study it more.

@ameyjadiye
Copy link
Contributor Author

@aherbert @kinow , however up and above the existing algorithm I liked the Idea of changing the code in #109 so the results will be "aa" vs "aaaaaaaa" == 2 / 8 = 0.25.

you guys can change the existing class later or you want me to change the code here to pass this case?

@aherbert
Copy link
Contributor

aherbert commented Mar 9, 2019

I'll try and summarise where we are:

  1. It is not clear what the result should be of 0 / 0

This is the "" vs "" case.

Possibles:

0 / 0 = n / n = 1
0 / 0 = n / 0 = NaN
0 / 0 = 0 / n = 0

The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing).

The current JaccardSimilarity has decided to return 0.

The new SorensenDiceSimilarity returns 1.

So this discrepancy in our own library should be resolved.

  1. characters or bigrams

How do you split up a CharSequence to build a Set? There are so many ways. Or do you build a List?

There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python textdistance library, supports using a Set or List, and supports variable length n-grams:

>>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab")
1.0
>>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab")
0.75
>>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab")
0.8
>>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab")
0.6666666666666666

This problem is solved by #109 which allows the user to define how to split the sequence with a function.

The current JaccardSimilarity has decided to split into single characters.

The new SorensenDiceSimilarity splits into bigrams.

Both use sets. So this discrepancy should be resolved.

I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a useSet flag with default of true. This is all possible using #109 to do the intersect computation. The key is to not break compatibility for anyone already using the JaccardSimilarity.

  1. How to score a CharSequence using bigrams with only 1 character
"a" vs "a"  = 0   or   1?
"a" vs "b"  = 0   or   NaN?

This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.

  1. Handling null

Currently this just throws an exception. However null is similar to null.

So if the library decides to not support null because it does not exists where does it stand on a zero length CharSequence. This does also not exist, but does have established meaning in the context of strings.

Currently the stance in the Jaccard is it is a programming error to pass around null and expect comparisons but it is perfectly reasonable to pass around empty stuff.

Discussion

To consolidate 0, 1 and 4 perhaps we can change the library to state that:

  • Similarity is undefined for null and throw an exception
  • Similarity between two empty sequences is either: (a) exception; (b) perfect; or (c) nothing
  • Similarity between equal sequences that do not satisfy the criteria for the comparison algorithm is either: (a) perfect; or (b) nothing

Then support n-gram size and both the Set or List approach to defining the two collections of n-grams to compare.

This involves some decisions.

My vote is:

  • Add n-gram and useSet parameters to the JaccardSimilarity
  • Add n-gram and useSet parameters to the SorensenDiceSimilarity
  • Make similarity 1 if the sequence does not satisfy the criteria for the algorithm but is identical. This is less destructive than throwing an exception or returning NaN

As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar:

"" vs "" == 1 (using a character comparison algorithm)
"a" vs "a" == 1 (using a bigram algorithm)

This is in-line with other similarity measures in the library which state equal sequences are perfectly similar.

Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved.

Alex

@ameyjadiye
Copy link
Contributor Author

I'm okay with that either :) , and just overlooked #109. looks nice @aherbert . let me know if you guys merged it with trunk will make changes here as well. Thanks.

Removed computation of metrics from the OverlapResult.
@kinow
Copy link
Member

kinow commented Mar 10, 2019

Good summary @aherbert , thanks a lot for that.

My brain is still digesting it, and I won't risk commenting here on Sunday ~evening. Will sleep on it, and probably work on a few other things next week (might be a bit busier at $work too).

Let's see if we can find one or two possible alternatives, and if we can't find a solution that looks really good, we can take it to the dev mailing list to see what others think too.

@ameyjadiye thanks for your patience so far. The work on #109 is really nice indeed! I was going to merge it now but there are two pending checkstyle issues. Once fixed, I believe they will be merged. Then if you could update this branch with that code, that would make three of us who have looked over the API (so some good code review). Any feedback welcome, as we have time to fix anything before 1.7.

Cheers
Bruno

@ameyjadiye
Copy link
Contributor Author

@kinow @aherbert changed my code and looks good to me now. please review.

@kinow
Copy link
Member

kinow commented Mar 10, 2019

Will have time throughout this week to review with more calm (Sunday evening here now). But I suspect it would be easier to wait for the other pull request to be merged first, then simply rebase your code and update it?

@ameyjadiye
Copy link
Contributor Author

@kinow @aherbert , this looks good to me now, can you please take a look.

/**
* Converter class for creating Bigrams for SorensenDice similarity.
*/
private static class SorensenDiceConverter implements Function<CharSequence, Collection<Integer>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a singleton. No need to instantiate it for each use.

private static final SorensenDiceConverter INSTANCE = new SorensenDiceConverter()

But this functionality could also be delivered by using a method reference. It is also not related to SorensenDice and should be in a package level utils class something like:

CharSequenceConverterUtils ...

List<Character> toCharacterList(CharSequence cs);
List<Integer> toBigramList(CharSequence cs);
List<String> toNGramList(CharSequence cs, int n);

});
}

public static double round(double value, int precision) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to be public. Why are you rounding anyway? A test like this can be very precise.

/**
* For shifting bigrams to fit in single integer.
*/
private static final int SHIFT_NUMBER = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this class but the internal converter class.

@aherbert
Copy link
Contributor

Overall this class now works nicely. But I think that my initial reservations have not been met.

There is no reason that this should use bigrams, or allow duplicate bigrams.

If this goes through then you end up with a library that can compute the Jaccard with unique single characters (by using a Set) and the Sorensen with bigrams including duplicates (by using a List converted to a Bag). Since the two scores are related by the equations:

J = S / (2 - S)
S = 2J / (1 + J)

It really does not make sense to compute one with a different method from the other.

I would actually state we have:

  • J1U and S2D

Where the number is the number of characters (n-gram), U is for unique (non-duplicates), and D is for duplicates.

We actually have possibilities for the following in the library:

  • J1U
  • J1D
  • J2U
  • J2D
  • S1U
  • S1D
  • S2U
  • S2D

Or generically: [JS]N[UD]

Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent library.

BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm. It just totally contradicts the current Jaccard implementation.

A cleaner design is to have both metrics support single or bigrams (or even n-grams) and also allow or prevent duplicates via constructor arguments.

The current Jaccard can be modified to do so by allowing the default constructor to default to single, non-duplicate mode and overloaded constructors added to it.

This idea may be best achieved by pushing shared functionality into a NGramIntersectionResult class that accepts the size of the n-gram and the duplicates option and computes the IntersectionResult. This can internally use an IntersectionSimilarity class to do the computation.

class NGramIntersectionResult {

    NGramIntersectionResult(int ngram, boolean duplicates) {
         // ...
    }

    IntersectionResult compute(CharSequence cs1, CharSequence cs2) {
         // Use IntersectionSimilarity to compute the IntersectionResult:
         // Pass to the IntersectionSimilarity the function to split the CharSequence into 
         // the appropriate collection (Set or List) using n-grams
         //
         // All the handling of the generic type (Character, Integer, String) used for the n-gram
         // should be done in here and hidden from the API.
    }
}

This can be made more generic if necessary. But this then leads to the library that @kinow identified (Simetrics) which can support tokenisation of input, case handling, splitting to n-grams, handling duplicates, extended unicode character handling, etc.

This level of generalisation is out of scope for the current similarity package. But n-grams and duplicates should be in scope because this is the difference between this algorithm for S2D and the current library J1U algorithm.

@ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to this or attempt to add a package scope class that can be shared with the JaccardSimilarity?

@ameyjadiye
Copy link
Contributor Author

@aherbert sounds a good idea, let me create another PR for that, will comeback on this.

@ameyjadiye
Copy link
Contributor Author

@aherbert apologies I was busy with much other stuff, will have to go through this. this PR is real mess.

Copy link

@mvarela mvarela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor errors in the javadoc

* similarity.apply("", "") = 1.0
* similarity.apply("foo", "foo") = 1.0
* similarity.apply("foo", "foo ") = 0.8
* similarity.apply("foo", "foo ") = 0.66
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some erroneous examples in the docstring. similarity.apply("foo", "foo ") is repeated twice, and the second result (0.66) is incorrect.
Likewise, similarity.apply("foo", " foo ") should be 0.66666...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @ameyjadiye

Good spotting @mvarela

* similarity.apply("", "") = 1.0
* similarity.apply("foo", "foo") = 1.0
* similarity.apply("foo", "foo ") = 0.8
* similarity.apply("foo", "foo ") = 0.66
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @ameyjadiye

Good spotting @mvarela

@kinow
Copy link
Member

kinow commented Jun 2, 2020

@ameyjadiye whenever you have time to update this PR, I will have another look and try to review/merge it 👍 thanks!

@ameyjadiye
Copy link
Contributor Author

Hey, thanks for reminder @kinow , I will finish this asap. this PR became real mess though. squash will be real hard. I will try to incorporate @aherbert suggestions here and push the changes.

@kinow
Copy link
Member

kinow commented Jun 2, 2020

Brilliant @ameyjadiye , but if you are busy having fun with commons-graph, we can leave this one for later. Whenever you have time 👍

@garydgregory
Copy link
Member

Hi All. Where are we here?

@ameyjadiye
Copy link
Contributor Author

@garydgregory - Long pending issue, its development lost in discussions. I'll go through it once again and try to finish. thanks for reminding for one. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants