-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of duplicate entries via uniq #70
Comments
I have considered porting all list/query functions to return a Bibliography so that we can use them like regular Ruby Enumerables. Currently, most functions (like queries) return arrays; the main difference between an array and the Bibliography class is that the Bibliography keeps track of the entries hash for quick access to individual Entries and in order to do stuff like resolving cross references etc. What this mostly boils down to is how to handle object references. In a Bibliography object we have a list of Entry instances which, in turn, contain Value objects — these help us mask the different value types (typically, strings or names). When we have methods like Have you given this some thought as well? Would you expect for |
I haven't thought too much about whether or not copies make more sense. For Overall I like a design for the Bibliography object that lets me use many of the standard Ruby Array functions. |
I think we could add a If you like we can also add |
This would be great. I slightly prefer |
Ah yes that's a long standing issue. : ) I need to check but there are bang methods in the standard lib that do not do that and I tend to not do it either, but I've been told off for it on ruby-lang so… Personally I find conditions that use bang methods extremely dangerous. |
I'll try to implement this in the next couple of days; would be great if you could review before we release anything though. |
I'm of course happy to review. |
also adds experimental #uniq method see #70
I've added a quick implementation of #uniq! and a first experimental #uniq method. The signature is quite complex (based on select duplicates by) and I haven't tested many of the more complex cases. The idea is to give you a lot of flexibility, because dropping duplicates is a pretty sensitive task, especially when working with large datasets where you could lose important data by accident. So I'll explain how I want it to work (as I said, need more tests to make sure it actually does this). By default uniq! will get rid of duplicates with the same year-title combination. You can pass in any number of field names and the method will compute a digest based on those fields. Sometimes however you will need even more control: e.g., let's say you want to discard duplicated by year, title, all authors last name and initials — for such cases you can also pass in a block and that block will be passed the computed digest for each entry and the full entry. The block should then return the final digest — that means you can either add to the digest or you can disregard it entirely and compute your own digest for each entry. So the example I posted above would be (untested pseudo-code):
|
About the experimental #uniq — this should currently return a new Bibliography with all items duplicated. If we want to make shallow copies instead we'd have to rewrite a number of methods because currently each element can only be associated with a single bibliography (this makes stuff like resolving strings or cross references easier). |
I have problems installing the lastest version from Github:
|
Ah, this is because we currently do not have the generated files in the repository. As suggested elsewhere this is common practice however; I'll add the generated parsers in a moment. |
Can you try it one more time? |
Works now, will do some testing and report back. |
It would be nice if we could use the ruby Array methods
uniq
(anduniq!
) with bibliographies. Just like the corresponding Array methoduniq
should use a default (:title
and:year
), and should accept other attributes for comparison via a block.I have implemented
hash
,==
andeql?
for theEntry
class (see example at https://github.com/mfenner/orcid-feed/blob/master/lib/work.rb), and this makes it very easy to remove dupicates. I'm struggling with implementinguniq
in theBibliography
class and am therefore using an Array of BibTex::Entry has intermediate step.The text was updated successfully, but these errors were encountered: