TEXT-126: Adding Sorensen-Dice similarity algoritham #103

ameyjadiye · 2019-02-26T18:22:23Z

This pull request adds the Sorensen-Dice Similarity Algorithm to Apache Commons Text.

The Sorensen–Dice coefficient is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sorensen and Lee Raymond Dice who published in 1948 and 1945 respectively.

coveralls · 2019-02-26T18:55:08Z

Coverage increased (+0.009%) to 97.926% when pulling 8f0a97c on ameyjadiye:master into a6f2b44 on apache:master.

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java

src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java

kinow · 2019-02-26T20:45:22Z

Hi @ameyjadiye ! A new metric! 🎉 Added a few comments, but only spent a couple of minutes looking at the code. I will take a better look later, and read about this similarity (never heard about it).

Thanks !!!

aherbert · 2019-02-27T11:07:13Z

Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric:

Sørensen–Dice coefficient

Thus you must compute the matches (true-positives) between the two sets X and Y (the union) and then include a penalty for items in either set that are not matched (false-positives and false negatives).

(2 * TP) / (2 * TP + FP + FN)

Or simply:

2 * |X union Y| / (|X| + |Y|)

with |N| the cardinality, or count.

The score is related to the Jaccard score and can be computed directly from it (S = 2J / (1 + J)). However the JaccardSimilarity class in the same package uses single characters from each CharSequence and not character pairs (bigrams). So this algorithm using pairs is new.

The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus:

sim.apply("aa", "aa")      = [aa] vs [aa] = 1.0
sim.apply("aaa", "aa")     = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aa")    = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aaa")   = [aa] vs [aa] = 1.0

The original article you referenced includes duplicate observations of each unique character pair (bigram). Thus:

sim.apply("aa", "aa")      = [aa]         vs [aa]     = (2 * 1) / (1 + 1) = 1.0
sim.apply("aaa", "aa")     = [aa][aa]     vs [aa]     = (2 * 1) / (2 + 1) = 0.667
sim.apply("aaaa", "aa")    = [aa][aa][aa] vs [aa]     = (2 * 1) / (3 + 1) = 0.5
sim.apply("aaaa", "aaa")   = [aa][aa][aa] vs [aa][aa] = (2 * 2) / (3 + 2) = 0.8

This requires a different algorithm without using Set. The author provides one which could be used for testing although it could be improved for efficiency.

Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is.

I also note that splitting the CharSequence into bigrams can be done efficiently using a 32-bit integer to store each pair of 16-bit chars:

/**
 * Creates the bigrams packed as two 16-bit chars into a 32-bit int. This does not
 * support extended character sets.
 *
 * @param cs the char sequence
 * @return the bigrams
 */
private int[] createBigrams(CharSequence cs) {
    if (cs.length() < 2) {
        return new int[0];
    }
    final int[] bigrams = new int[cs.length() - 1];
    int ch2 = cs.charAt(0);
    for (int i = 0; i < bigrams.length; i++) {
        final int ch1 = ch2;
        ch2 = cs.charAt(i + 1);
        bigrams[i] = (ch1 << 16) | ch2;
    }
    return bigrams;
}

You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with Set<Integer>.

Also be careful of overflow in the result computation. I acknowledge that with the Set method you would have to input two CharSequences with close to all 2^16 * 2^16 bigrams to get two set sizes above half Integer.MAX_VALUE. With the duplicates method you just have to input two CharSequence that have a combined length above Integer.MAX_VALUE and then adding the sizes will overflow the int. For example if you were to compare two books!

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java

src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java

kinow

Finished first round of reviews @ameyjadiye ! Great work, and thanks for the links. Today-I-Learned about the Sorensen-Dice similarity. Hadn't heard about it. However, in NLP F1-Score is quite common, and looks like it is a - if not the same? - Sorensen-Dice coefficient. Will read more later. Cheers

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java

ameyjadiye · 2019-03-06T15:21:22Z

Hi @kinow , looks like all suggestions are accommodated with readable code and considerable performance.
the code also mimics algorithm given on Wikipedia. test cases passing as well. do you want me to improve anything more here?

aherbert · 2019-03-06T22:58:26Z

Hi @ameyjadiye,

The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score.

This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing SimilarityScore<R>:

JaccardSimilarity - uses Set<Character> for matching single characters to produce Double
JaroWinklerSimilarity - uses single character matching to produce an edit distance which is then converted to a similarity Double
LongestCommonSubsequence - uses single character matching for sub-sequences to produce Integer

The rest of the package is for an extension of SimilarityScore<R> which is EditDistance<R>:

CosineDistance - uses a RegexTokenizer on whitespace to match words to produce Double
HammingDistance - uses single character matching to produce Integer
JaccardDistance - just inverts the JaccardSimilarity to produce Double
JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce Double
LevenshteinDetailedDistance - single character changes to produce LevenshteinResults
LevenshteinDistance - single character changes to produce Integer
LongestCommonSubsequenceDistance - single character based substring to produce Double

So all the other measures use single characters or words (CosineDistance).

If you want a Sorensen-Dice distance score using single characters then just use the JaccardDistance and map it: S = 2J / (1 + J)

This class is effectively a JaccardDistance with bigrams but with a mapped output score.

One suggestion would be a change to be something like BigramJaccardSimilarity and then compute the Jaccard using bigrams. Then put new classes in for SorensenDiceSimilarity and BiSorensenDiceSimilarity.

However the current JaccardDistance uses Set which has space and efficiency advantages over Set. Once you are using strings then there is no reason to have only bigrams.

My preference would be to add a new class TokenizerJaccardSimilarity (or something nicer) that accepts a Tokenizer<CharSequence> in the constructor. This then tokenises the input into two sets and computes the union of the two sets.

Your bigram case can be fulfilled by writing a new Tokenizer<CharSequence> that returns the bigrams. This new class would be flexible for computing the Jaccard for words, bigrams, trigrams, n-grams, or whatever else you can tokenise a CharSequence into.

Then you add a TokenSorensonDiceSimilarity that just maps the Jaccard score. You can see this for example in the JaccardDistance class.

@kinow Opinions on a more flexible approach?

aherbert · 2019-03-07T13:52:01Z

I've had more of a think and have mocked an implementation for a generic IntersectionSimilarity.

See TEXT-155 and the code in the linked PR.

This class can be used to compute the Sorensen-Dice similarity for bigrams or any other type of structure that you would like to analyse from the CharSequence, e.g. characters, bigrams, trigrams, n-grams, words, etc.

Note: Sorensen-Dice similarity is the same as the F1-score

kinow · 2019-03-07T21:02:27Z

Been busy with a few other bugs these past two days, but hopefully will be able to review this and TEXT-155 this weekend 🎉 thanks a lot @aherbert ! @ameyjadiye sorry the delay, just a bit longer and we should be done with the review for your contribution (:

aherbert · 2019-03-07T23:21:10Z

I have just noticed that the JaccardSimilarity computes 0 for the similarity of zero length input sequences. This computes 1 so it should be changed.

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

https://github.com/aherbert/commons-text into feature-TEXT-155 # Conflicts: # src/test/java/org/apache/commons/text/similarity/IntersectionSimilarityTest.java

This adds Bag functionality to allow counting duplicate elements.

kinow · 2019-03-09T04:24:32Z

@ameyjadiye see last comment from @aherbert about empty strings and 0 vs. 1.

@aherbert while we are discussing #109 , do you think that is a blocker for this pull request? So far I think at least the API proposed here would be kept right?

If so, this could be merged once the last comment is resolved, and then we can discuss how to organize the classes and where the sorensen-dice coefficient is calculated.

I think the only thing missing is deciding on the name of the classes? Whether it should use Bigram in the name or be just SorensenDiceSimilarity.

I like the idea of having a descriptive name such as BigramSorensenDiceSimilarity (or Bigram in other place/order). However, I think we should also considerate what users would expect. i.e. in other libraries, does the Sorensen Dice similarity used is for bigrams always? If other implementations Python/JS/Java in used bigrams, then we could leave it as SorensenDiceSimilarity and either add another method/constructor/etc to customize the similarity, or then have another class...

What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too 👍 [or any other person reading this 🙂 ])

ameyjadiye · 2019-03-09T06:48:37Z

@kinow @aherbert
I have written code by keeping Wikipedia as standard and researched some other libraries from other languages just for reference. all of them are using bigrams for calculating similarities. and I personally think that if we go further and use triGram, qurterGram ... nGram the resulting %age would be incorrect. we can't use charachter by charachter match i.e uniGram as that will also make result bad as ab!=ba

with nGram are we tampering the existing proved algoritham ? if its giving better results than existing algo I'm ok with that, also does someone really need it in real world examples ?

Wikipedia says

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[9]

s=2nt / nx + ny
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht
We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}
Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

not sure but are we over engineering similarities with #109 ? let me know if there is practicle use of nGram in real world ? would like to study it more.

ameyjadiye · 2019-03-09T06:59:38Z

@aherbert @kinow , however up and above the existing algorithm I liked the Idea of changing the code in #109 so the results will be "aa" vs "aaaaaaaa" == 2 / 8 = 0.25.

you guys can change the existing class later or you want me to change the code here to pass this case?

aherbert · 2019-03-09T13:16:31Z

I'll try and summarise where we are:

It is not clear what the result should be of 0 / 0

This is the "" vs "" case.

Possibles:

0 / 0 = n / n = 1
0 / 0 = n / 0 = NaN
0 / 0 = 0 / n = 0

The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing).

The current JaccardSimilarity has decided to return 0.

The new SorensenDiceSimilarity returns 1.

So this discrepancy in our own library should be resolved.

characters or bigrams

How do you split up a CharSequence to build a Set? There are so many ways. Or do you build a List?

There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python textdistance library, supports using a Set or List, and supports variable length n-grams:

>>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab")
1.0
>>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab")
0.75
>>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab")
0.8
>>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab")
0.6666666666666666

This problem is solved by #109 which allows the user to define how to split the sequence with a function.

The current JaccardSimilarity has decided to split into single characters.

The new SorensenDiceSimilarity splits into bigrams.

Both use sets. So this discrepancy should be resolved.

I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a useSet flag with default of true. This is all possible using #109 to do the intersect computation. The key is to not break compatibility for anyone already using the JaccardSimilarity.

How to score a CharSequence using bigrams with only 1 character

"a" vs "a"  = 0   or   1?
"a" vs "b"  = 0   or   NaN?

This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.

Handling null

Currently this just throws an exception. However null is similar to null.

So if the library decides to not support null because it does not exists where does it stand on a zero length CharSequence. This does also not exist, but does have established meaning in the context of strings.

Currently the stance in the Jaccard is it is a programming error to pass around null and expect comparisons but it is perfectly reasonable to pass around empty stuff.

Discussion

To consolidate 0, 1 and 4 perhaps we can change the library to state that:

Similarity is undefined for null and throw an exception
Similarity between two empty sequences is either: (a) exception; (b) perfect; or (c) nothing
Similarity between equal sequences that do not satisfy the criteria for the comparison algorithm is either: (a) perfect; or (b) nothing

Then support n-gram size and both the Set or List approach to defining the two collections of n-grams to compare.

This involves some decisions.

My vote is:

Add n-gram and useSet parameters to the JaccardSimilarity
Add n-gram and useSet parameters to the SorensenDiceSimilarity
Make similarity 1 if the sequence does not satisfy the criteria for the algorithm but is identical. This is less destructive than throwing an exception or returning NaN

As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar:

"" vs "" == 1 (using a character comparison algorithm)
"a" vs "a" == 1 (using a bigram algorithm)

This is in-line with other similarity measures in the library which state equal sequences are perfectly similar.

Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved.

Alex

ameyjadiye · 2019-03-09T15:24:26Z

I'm okay with that either :) , and just overlooked #109. looks nice @aherbert . let me know if you guys merged it with trunk will make changes here as well. Thanks.

Removed computation of metrics from the OverlapResult.

kinow · 2019-03-10T04:47:01Z

Good summary @aherbert , thanks a lot for that.

My brain is still digesting it, and I won't risk commenting here on Sunday ~evening. Will sleep on it, and probably work on a few other things next week (might be a bit busier at $work too).

Let's see if we can find one or two possible alternatives, and if we can't find a solution that looks really good, we can take it to the dev mailing list to see what others think too.

@ameyjadiye thanks for your patience so far. The work on #109 is really nice indeed! I was going to merge it now but there are two pending checkstyle issues. Once fixed, I believe they will be merged. Then if you could update this branch with that code, that would make three of us who have looked over the API (so some good code review). Any feedback welcome, as we have time to fix anything before 1.7.

Cheers
Bruno

Merge apache/commons-text to ameyjadiye/commons-text

Feature text 155

ameyjadiye · 2019-03-10T08:13:13Z

@kinow @aherbert changed my code and looks good to me now. please review.

kinow · 2019-03-10T08:20:21Z

Will have time throughout this week to review with more calm (Sunday evening here now). But I suspect it would be easier to wait for the other pull request to be merged first, then simply rebase your code and update it?

ameyjadiye · 2019-03-18T17:46:03Z

@kinow @aherbert , this looks good to me now, can you please take a look.

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

aherbert · 2019-03-24T14:24:50Z

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

+    /**
+     * Converter class for creating Bigrams for SorensenDice similarity.
+     */
+    private static class SorensenDiceConverter implements Function<CharSequence, Collection<Integer>> {


This can be a singleton. No need to instantiate it for each use.

private static final SorensenDiceConverter INSTANCE = new SorensenDiceConverter()

But this functionality could also be delivered by using a method reference. It is also not related to SorensenDice and should be in a package level utils class something like:

CharSequenceConverterUtils ... List<Character> toCharacterList(CharSequence cs); List<Integer> toBigramList(CharSequence cs); List<String> toNGramList(CharSequence cs, int n);

aherbert · 2019-03-24T14:26:10Z

src/test/java/org/apache/commons/text/similarity/SorensenDiceSimilarityTest.java

+        });
+    }
+
+    public static double round(double value, int precision) {


No need to be public. Why are you rounding anyway? A test like this can be very precise.

aherbert · 2019-03-24T14:26:50Z

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

+    /**
+     * For shifting bigrams to fit in single integer.
+     */
+    private static final int SHIFT_NUMBER = 16;


Not part of this class but the internal converter class.

aherbert · 2019-03-24T15:06:01Z

Overall this class now works nicely. But I think that my initial reservations have not been met.

There is no reason that this should use bigrams, or allow duplicate bigrams.

If this goes through then you end up with a library that can compute the Jaccard with unique single characters (by using a Set) and the Sorensen with bigrams including duplicates (by using a List converted to a Bag). Since the two scores are related by the equations:

J = S / (2 - S)
S = 2J / (1 + J)

It really does not make sense to compute one with a different method from the other.

I would actually state we have:

J1U and S2D

Where the number is the number of characters (n-gram), U is for unique (non-duplicates), and D is for duplicates.

We actually have possibilities for the following in the library:

J1U
J1D
J2U
J2D
S1U
S1D
S2U
S2D

Or generically: [JS]N[UD]

Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent library.

BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm. It just totally contradicts the current Jaccard implementation.

A cleaner design is to have both metrics support single or bigrams (or even n-grams) and also allow or prevent duplicates via constructor arguments.

The current Jaccard can be modified to do so by allowing the default constructor to default to single, non-duplicate mode and overloaded constructors added to it.

This idea may be best achieved by pushing shared functionality into a NGramIntersectionResult class that accepts the size of the n-gram and the duplicates option and computes the IntersectionResult. This can internally use an IntersectionSimilarity class to do the computation.

class NGramIntersectionResult {

    NGramIntersectionResult(int ngram, boolean duplicates) {
         // ...
    }

    IntersectionResult compute(CharSequence cs1, CharSequence cs2) {
         // Use IntersectionSimilarity to compute the IntersectionResult:
         // Pass to the IntersectionSimilarity the function to split the CharSequence into 
         // the appropriate collection (Set or List) using n-grams
         //
         // All the handling of the generic type (Character, Integer, String) used for the n-gram
         // should be done in here and hidden from the API.
    }
}

This can be made more generic if necessary. But this then leads to the library that @kinow identified (Simetrics) which can support tokenisation of input, case handling, splitting to n-grams, handling duplicates, extended unicode character handling, etc.

This level of generalisation is out of scope for the current similarity package. But n-grams and duplicates should be in scope because this is the difference between this algorithm for S2D and the current library J1U algorithm.

@ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to this or attempt to add a package scope class that can be shared with the JaccardSimilarity?

ameyjadiye · 2019-03-25T10:14:15Z

@aherbert sounds a good idea, let me create another PR for that, will comeback on this.

ameyjadiye · 2019-09-28T18:30:56Z

@aherbert apologies I was busy with much other stuff, will have to go through this. this PR is real mess.

mvarela

There are some minor errors in the javadoc

mvarela · 2019-11-26T18:54:32Z

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

+     * similarity.apply("", "")                     = 1.0
+     * similarity.apply("foo", "foo")               = 1.0
+     * similarity.apply("foo", "foo ")              = 0.8
+     * similarity.apply("foo", "foo ")              = 0.66


There are some erroneous examples in the docstring. similarity.apply("foo", "foo ") is repeated twice, and the second result (0.66) is incorrect.
Likewise, similarity.apply("foo", " foo ") should be 0.66666...

ping @ameyjadiye

Good spotting @mvarela

kinow · 2020-06-02T04:06:37Z

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java

+     * similarity.apply("", "")                     = 1.0
+     * similarity.apply("foo", "foo")               = 1.0
+     * similarity.apply("foo", "foo ")              = 0.8
+     * similarity.apply("foo", "foo ")              = 0.66


ping @ameyjadiye

Good spotting @mvarela

kinow · 2020-06-02T04:07:26Z

@ameyjadiye whenever you have time to update this PR, I will have another look and try to review/merge it 👍 thanks!

ameyjadiye · 2020-06-02T07:00:37Z

Hey, thanks for reminder @kinow , I will finish this asap. this PR became real mess though. squash will be real hard. I will try to incorporate @aherbert suggestions here and push the changes.

kinow · 2020-06-02T08:40:02Z

Brilliant @ameyjadiye , but if you are busy having fun with commons-graph, we can leave this one for later. Whenever you have time 👍

garydgregory · 2022-03-31T11:00:01Z

Hi All. Where are we here?

ameyjadiye · 2022-03-31T11:04:59Z

@garydgregory - Long pending issue, its development lost in discussions. I'll go through it once again and try to finish. thanks for reminding for one. :)

Adding Sorensen-Dice similarity algoritham

7aae02b

kinow requested changes Feb 26, 2019

View reviewed changes

aherbert reviewed Feb 27, 2019

View reviewed changes

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java Outdated Show resolved Hide resolved

aherbert reviewed Feb 27, 2019

View reviewed changes

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java Outdated Show resolved Hide resolved

aherbert reviewed Feb 27, 2019

View reviewed changes

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java Outdated Show resolved Hide resolved

ameyjadiye added 2 commits February 27, 2019 23:16

added documentation and optimised code

325658b

replacing bit faster version of empty check

4191840

kinow reviewed Mar 3, 2019

View reviewed changes

kinow requested changes Mar 3, 2019

View reviewed changes

src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java Outdated Show resolved Hide resolved

Improved javadocs and handled some edge cases

9070f05

TEXT-155: Add a generic IntersectionSimilarity measure

505066c

TEXT-155: Add a generic IntersectionSimilarity measure

8b92150

aherbert reviewed Mar 7, 2019

View reviewed changes

src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java Show resolved Hide resolved

aherbert mentioned this pull request Mar 8, 2019

TEXT-155: Add a generic OverlapSimilarity measure #109

Merged

aherbert added 4 commits March 8, 2019 11:58

Merge branch 'feature-TEXT-155' of

0ebc77f

https://github.com/aherbert/commons-text into feature-TEXT-155 # Conflicts: # src/test/java/org/apache/commons/text/similarity/IntersectionSimilarityTest.java

TEXT-155: IntersectionSimilarity to support duplicates in union.

0c66921

This adds Bag functionality to allow counting duplicate elements.

TEXT-155: Add word letter pairs test to IntersectionSimilarityTest

dae816a

Text-155: Javadoc fix in IntersectionResult

ae21c63

TEXT-155: Renamed to OverlapSimilarity.

9a7d018

Removed computation of metrics from the OverlapResult.

ameyjadiye added 5 commits March 10, 2019 11:55

Merge pull request #1 from apache/master

af9ed62

Merge apache/commons-text to ameyjadiye/commons-text

Merge pull request #2 from aherbert/feature-TEXT-155

6344be4

Feature text 155

using OverlapSimilarity for scoring Sorensendice similarity

7a4aee8

rounded resulring scores for tests

211077b

fixed spotbug checkstyle errors

29fd2da

ameyjadiye added 3 commits March 13, 2019 22:45

Merge pull request #3 from apache/master

0d77d4e

no need

bcff974

using new IntersectionSimilarity

2de106e

kinow requested changes Mar 24, 2019

View reviewed changes

corrected javadoc, removed unused code and made instance var private.

8f0a97c

aherbert reviewed Mar 24, 2019

View reviewed changes

mvarela reviewed Nov 26, 2019

View reviewed changes

kinow requested changes Jun 2, 2020

View reviewed changes

kinow mentioned this pull request Jul 30, 2020

[TEXT-158]: empty strings must have similarity of 1, and distance of 0 (i.e. identical) #142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

ameyjadiye commented Feb 26, 2019

coveralls commented Feb 26, 2019 •

edited

kinow commented Feb 26, 2019 •

edited

aherbert commented Feb 27, 2019

kinow left a comment

ameyjadiye commented Mar 6, 2019

aherbert commented Mar 6, 2019

aherbert commented Mar 7, 2019

kinow commented Mar 7, 2019

aherbert commented Mar 7, 2019

kinow commented Mar 9, 2019 •

edited

ameyjadiye commented Mar 9, 2019

ameyjadiye commented Mar 9, 2019

aherbert commented Mar 9, 2019

ameyjadiye commented Mar 9, 2019

kinow commented Mar 10, 2019

ameyjadiye commented Mar 10, 2019

kinow commented Mar 10, 2019

ameyjadiye commented Mar 18, 2019

aherbert Mar 24, 2019

aherbert Mar 24, 2019

aherbert Mar 24, 2019

aherbert commented Mar 24, 2019

ameyjadiye commented Mar 25, 2019

ameyjadiye commented Sep 28, 2019

mvarela left a comment

mvarela Nov 26, 2019

kinow Jun 2, 2020

kinow Jun 2, 2020

kinow commented Jun 2, 2020

ameyjadiye commented Jun 2, 2020

kinow commented Jun 2, 2020

garydgregory commented Mar 31, 2022

ameyjadiye commented Mar 31, 2022

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

Are you sure you want to change the base?

TEXT-126: Adding Sorensen-Dice similarity algoritham #103

Conversation

ameyjadiye commented Feb 26, 2019

coveralls commented Feb 26, 2019 • edited

kinow commented Feb 26, 2019 • edited

aherbert commented Feb 27, 2019

kinow left a comment

Choose a reason for hiding this comment

ameyjadiye commented Mar 6, 2019

aherbert commented Mar 6, 2019

aherbert commented Mar 7, 2019

kinow commented Mar 7, 2019

aherbert commented Mar 7, 2019

kinow commented Mar 9, 2019 • edited

ameyjadiye commented Mar 9, 2019

ameyjadiye commented Mar 9, 2019

aherbert commented Mar 9, 2019

ameyjadiye commented Mar 9, 2019

kinow commented Mar 10, 2019

ameyjadiye commented Mar 10, 2019

kinow commented Mar 10, 2019

ameyjadiye commented Mar 18, 2019

aherbert Mar 24, 2019

Choose a reason for hiding this comment

aherbert Mar 24, 2019

Choose a reason for hiding this comment

aherbert Mar 24, 2019

Choose a reason for hiding this comment

aherbert commented Mar 24, 2019

ameyjadiye commented Mar 25, 2019

ameyjadiye commented Sep 28, 2019

mvarela left a comment

Choose a reason for hiding this comment

mvarela Nov 26, 2019

Choose a reason for hiding this comment

kinow Jun 2, 2020

Choose a reason for hiding this comment

kinow Jun 2, 2020

Choose a reason for hiding this comment

kinow commented Jun 2, 2020

ameyjadiye commented Jun 2, 2020

kinow commented Jun 2, 2020

garydgregory commented Mar 31, 2022

ameyjadiye commented Mar 31, 2022

coveralls commented Feb 26, 2019 •

edited

kinow commented Feb 26, 2019 •

edited

kinow commented Mar 9, 2019 •

edited