-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEXT-126: Adding Sorensen-Dice similarity algoritham #103
base: master
Are you sure you want to change the base?
Conversation
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java
Outdated
Show resolved
Hide resolved
Hi @ameyjadiye ! A new metric! 🎉 Added a few comments, but only spent a couple of minutes looking at the code. I will take a better look later, and read about this similarity (never heard about it). Thanks !!! |
Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric: Thus you must compute the matches (true-positives) between the two sets
Or simply:
with The score is related to the The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus:
The original article you referenced includes duplicate observations of each unique character pair (bigram). Thus:
This requires a different algorithm without using Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is. I also note that splitting the
You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with Also be careful of overflow in the result computation. I acknowledge that with the |
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished first round of reviews @ameyjadiye ! Great work, and thanks for the links. Today-I-Learned about the Sorensen-Dice similarity. Hadn't heard about it. However, in NLP F1-Score is quite common, and looks like it is a - if not the same? - Sorensen-Dice coefficient. Will read more later. Cheers
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
Hi @kinow , looks like all suggestions are accommodated with readable code and considerable performance. |
Hi @ameyjadiye, The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score. This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing JaccardSimilarity - uses The rest of the package is for an extension of CosineDistance - uses a So all the other measures use single characters or words (CosineDistance). If you want a Sorensen-Dice distance score using single characters then just use the This class is effectively a JaccardDistance with One suggestion would be a change to be something like However the current JaccardDistance uses Set which has space and efficiency advantages over Set. Once you are using strings then there is no reason to have only bigrams. My preference would be to add a new class Your Then you add a @kinow Opinions on a more flexible approach? |
I've had more of a think and have mocked an implementation for a generic See TEXT-155 and the code in the linked PR. This class can be used to compute the Sorensen-Dice similarity for bigrams or any other type of structure that you would like to analyse from the Note: Sorensen-Dice similarity is the same as the F1-score |
Been busy with a few other bugs these past two days, but hopefully will be able to review this and TEXT-155 this weekend 🎉 thanks a lot @aherbert ! @ameyjadiye sorry the delay, just a bit longer and we should be done with the review for your contribution (: |
I have just noticed that the |
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
https://github.com/aherbert/commons-text into feature-TEXT-155 # Conflicts: # src/test/java/org/apache/commons/text/similarity/IntersectionSimilarityTest.java
This adds Bag functionality to allow counting duplicate elements.
@ameyjadiye see last comment from @aherbert about empty strings and @aherbert while we are discussing #109 , do you think that is a blocker for this pull request? So far I think at least the API proposed here would be kept right? If so, this could be merged once the last comment is resolved, and then we can discuss how to organize the classes and where the sorensen-dice coefficient is calculated. I think the only thing missing is deciding on the name of the classes? Whether it should use I like the idea of having a descriptive name such as What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too 👍 [or any other person reading this 🙂 ]) |
@kinow @aherbert with nGram are we tampering the existing proved algoritham ? if its giving better results than existing algo I'm ok with that, also does someone really need it in real world examples ? Wikipedia says
not sure but are we over engineering similarities with #109 ? let me know if there is practicle use of nGram in real world ? would like to study it more. |
I'll try and summarise where we are:
This is the Possibles:
The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing). The current The new So this discrepancy in our own library should be resolved.
How do you split up a There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python
This problem is solved by #109 which allows the user to define how to split the sequence with a function. The current The new Both use sets. So this discrepancy should be resolved. I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a
This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.
Currently this just throws an exception. However So if the library decides to not support Currently the stance in the Jaccard is it is a programming error to pass around Discussion To consolidate 0, 1 and 4 perhaps we can change the library to state that:
Then support n-gram size and both the This involves some decisions. My vote is:
As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar:
This is in-line with other similarity measures in the library which state equal sequences are perfectly similar. Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved. Alex |
Removed computation of metrics from the OverlapResult.
Good summary @aherbert , thanks a lot for that. My brain is still digesting it, and I won't risk commenting here on Sunday ~evening. Will sleep on it, and probably work on a few other things next week (might be a bit busier at $work too). Let's see if we can find one or two possible alternatives, and if we can't find a solution that looks really good, we can take it to the dev mailing list to see what others think too. @ameyjadiye thanks for your patience so far. The work on #109 is really nice indeed! I was going to merge it now but there are two pending checkstyle issues. Once fixed, I believe they will be merged. Then if you could update this branch with that code, that would make three of us who have looked over the API (so some good code review). Any feedback welcome, as we have time to fix anything before 1.7. Cheers |
Merge apache/commons-text to ameyjadiye/commons-text
Feature text 155
Will have time throughout this week to review with more calm (Sunday evening here now). But I suspect it would be easier to wait for the other pull request to be merged first, then simply rebase your code and update it? |
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
/** | ||
* Converter class for creating Bigrams for SorensenDice similarity. | ||
*/ | ||
private static class SorensenDiceConverter implements Function<CharSequence, Collection<Integer>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a singleton. No need to instantiate it for each use.
private static final SorensenDiceConverter INSTANCE = new SorensenDiceConverter()
But this functionality could also be delivered by using a method reference. It is also not related to SorensenDice and should be in a package level utils class something like:
CharSequenceConverterUtils ...
List<Character> toCharacterList(CharSequence cs);
List<Integer> toBigramList(CharSequence cs);
List<String> toNGramList(CharSequence cs, int n);
}); | ||
} | ||
|
||
public static double round(double value, int precision) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to be public. Why are you rounding anyway? A test like this can be very precise.
/** | ||
* For shifting bigrams to fit in single integer. | ||
*/ | ||
private static final int SHIFT_NUMBER = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of this class but the internal converter class.
Overall this class now works nicely. But I think that my initial reservations have not been met. There is no reason that this should use bigrams, or allow duplicate bigrams. If this goes through then you end up with a library that can compute the Jaccard with unique single characters (by using a
It really does not make sense to compute one with a different method from the other. I would actually state we have:
Where the number is the number of characters (n-gram), U is for unique (non-duplicates), and D is for duplicates. We actually have possibilities for the following in the library:
Or generically: [JS]N[UD] Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent library. BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm. It just totally contradicts the current Jaccard implementation. A cleaner design is to have both metrics support single or bigrams (or even n-grams) and also allow or prevent duplicates via constructor arguments. The current Jaccard can be modified to do so by allowing the default constructor to default to single, non-duplicate mode and overloaded constructors added to it. This idea may be best achieved by pushing shared functionality into a
This can be made more generic if necessary. But this then leads to the library that @kinow identified (Simetrics) which can support tokenisation of input, case handling, splitting to n-grams, handling duplicates, extended unicode character handling, etc. This level of generalisation is out of scope for the current similarity package. But n-grams and duplicates should be in scope because this is the difference between this algorithm for S2D and the current library J1U algorithm. @ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to this or attempt to add a package scope class that can be shared with the JaccardSimilarity? |
@aherbert sounds a good idea, let me create another PR for that, will comeback on this. |
@aherbert apologies I was busy with much other stuff, will have to go through this. this PR is real mess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some minor errors in the javadoc
* similarity.apply("", "") = 1.0 | ||
* similarity.apply("foo", "foo") = 1.0 | ||
* similarity.apply("foo", "foo ") = 0.8 | ||
* similarity.apply("foo", "foo ") = 0.66 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some erroneous examples in the docstring. similarity.apply("foo", "foo ")
is repeated twice, and the second result (0.66) is incorrect.
Likewise, similarity.apply("foo", " foo ")
should be 0.66666...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @ameyjadiye
Good spotting @mvarela
* similarity.apply("", "") = 1.0 | ||
* similarity.apply("foo", "foo") = 1.0 | ||
* similarity.apply("foo", "foo ") = 0.8 | ||
* similarity.apply("foo", "foo ") = 0.66 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @ameyjadiye
Good spotting @mvarela
@ameyjadiye whenever you have time to update this PR, I will have another look and try to review/merge it 👍 thanks! |
Brilliant @ameyjadiye , but if you are busy having fun with commons-graph, we can leave this one for later. Whenever you have time 👍 |
Hi All. Where are we here? |
@garydgregory - Long pending issue, its development lost in discussions. I'll go through it once again and try to finish. thanks for reminding for one. :) |
This pull request adds the Sorensen-Dice Similarity Algorithm to Apache Commons Text.
The Sorensen–Dice coefficient is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sorensen and Lee Raymond Dice who published in 1948 and 1945 respectively.