Handling of duplicate entries via uniq #70

mfenner · 2013-07-29T18:20:15Z

It would be nice if we could use the ruby Array methods uniq (and uniq!) with bibliographies. Just like the corresponding Array method uniq should use a default (:title and :year), and should accept other attributes for comparison via a block.

I have implemented hash, == and eql? for the Entry class (see example at https://github.com/mfenner/orcid-feed/blob/master/lib/work.rb), and this makes it very easy to remove dupicates. I'm struggling with implementing uniq in the Bibliography class and am therefore using an Array of BibTex::Entry has intermediate step.

The text was updated successfully, but these errors were encountered:

inukshuk · 2013-07-29T18:57:25Z

I have considered porting all list/query functions to return a Bibliography so that we can use them like regular Ruby Enumerables. Currently, most functions (like queries) return arrays; the main difference between an array and the Bibliography class is that the Bibliography keeps track of the entries hash for quick access to individual Entries and in order to do stuff like resolving cross references etc.

What this mostly boils down to is how to handle object references. In a Bibliography object we have a list of Entry instances which, in turn, contain Value objects — these help us mask the different value types (typically, strings or names). When we have methods like #uniq (or #query) return a new Bibliography we need to make up our mind how to deal with all these instances. Up to now I have always assumed that we ought to make copies of everything — otherwise modifying an object in one Bibliography would change the same object in another Bibliography. On the other hand, perhaps this is exactly what one would expect, seeing as regular Ruby Arrays for example would not make deep copies. For instance, something I use a lot is the #names function wich returns all Name values. I can then modify values in this name objects and the changes will be visible in my Bibliography because the entries there still hold the references.

Have you given this some thought as well? Would you expect for #uniq to return a new Bibliography with carbon copies or with references to the same Entry objects?

mfenner · 2013-07-30T15:03:34Z

I haven't thought too much about whether or not copies make more sense. For #uniq I would probably expect a new Bibliography with carbon copies. I currently don't use the #names function, but I can see your point about changing the object directly.

Overall I like a design for the Bibliography object that lets me use many of the standard Ruby Array functions.

inukshuk · 2013-07-30T15:19:07Z

I think we could add a #uniq! method right away; ideally it would use the same interface as select_duplicates_by like you explained earlier.

If you like we can also add #uniq for now, but the #uniq! method will definitely be more efficient.

mfenner · 2013-07-30T15:35:30Z

This would be great. I slightly prefer #uniq over #uniq!, because with the latter I have to check that the method didn't return nil because there were no duplicates.

inukshuk · 2013-07-30T15:46:34Z

Ah yes that's a long standing issue. : ) I need to check but there are bang methods in the standard lib that do not do that and I tend to not do it either, but I've been told off for it on ruby-lang so… Personally I find conditions that use bang methods extremely dangerous.

inukshuk · 2013-07-31T16:21:30Z

I'll try to implement this in the next couple of days; would be great if you could review before we release anything though.

mfenner · 2013-07-31T17:02:49Z

I'm of course happy to review.

also adds experimental #uniq method see #70

inukshuk · 2013-08-05T19:58:06Z

I've added a quick implementation of #uniq! and a first experimental #uniq method. The signature is quite complex (based on select duplicates by) and I haven't tested many of the more complex cases. The idea is to give you a lot of flexibility, because dropping duplicates is a pretty sensitive task, especially when working with large datasets where you could lose important data by accident.

So I'll explain how I want it to work (as I said, need more tests to make sure it actually does this). By default uniq! will get rid of duplicates with the same year-title combination. You can pass in any number of field names and the method will compute a digest based on those fields. Sometimes however you will need even more control: e.g., let's say you want to discard duplicated by year, title, all authors last name and initials — for such cases you can also pass in a block and that block will be passed the computed digest for each entry and the full entry. The block should then return the final digest — that means you can either add to the digest or you can disregard it entirely and compute your own digest for each entry.

So the example I posted above would be (untested pseudo-code):

bib.uniq!(:year, :title) do |digest, entry|
  digest << entry.author.map { |a| [a.family, a.initials].join }.join
end

inukshuk · 2013-08-05T20:01:18Z

About the experimental #uniq — this should currently return a new Bibliography with all items duplicated. If we want to make shallow copies instead we'd have to rewrite a number of methods because currently each element can only be associated with a single bibliography (this makes stuff like resolving strings or cross references easier).

mfenner · 2013-08-07T10:20:30Z

I have problems installing the lastest version from Github:

gem 'bibtex-ruby', :git => 'git://github.com/inukshuk/bibtex-ruby.git'

bibtex-ruby at /home/vagrant/.bundler/ruby/1.9.1/bibtex-ruby-c1addae0af57 did not have a valid gemspec.
This prevents bundler from installing bins or native extensions, but that may not affect its functionality.
The validation message from Rubygems was:
  ["Gemfile.lock", "lib/bibtex/name_parser.rb", "lib/bibtex/parser.rb"] are not files

inukshuk · 2013-08-07T10:23:31Z

Ah, this is because we currently do not have the generated files in the repository. As suggested elsewhere this is common practice however; I'll add the generated parsers in a moment.

inukshuk · 2013-08-07T10:28:17Z

Can you try it one more time?

mfenner · 2013-08-07T10:56:23Z

Works now, will do some testing and report back.

inukshuk added a commit that referenced this issue Aug 5, 2013

add a #uniq! implementation

5a3c356

also adds experimental #uniq method see #70

inukshuk mentioned this issue Dec 29, 2013

Improve Sorting of Bibliographies #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of duplicate entries via uniq #70

Handling of duplicate entries via uniq #70

mfenner commented Jul 29, 2013

inukshuk commented Jul 29, 2013

mfenner commented Jul 30, 2013

inukshuk commented Jul 30, 2013

mfenner commented Jul 30, 2013

inukshuk commented Jul 30, 2013

inukshuk commented Jul 31, 2013

mfenner commented Jul 31, 2013

inukshuk commented Aug 5, 2013

inukshuk commented Aug 5, 2013

mfenner commented Aug 7, 2013

inukshuk commented Aug 7, 2013

inukshuk commented Aug 7, 2013

mfenner commented Aug 7, 2013

Handling of duplicate entries via uniq #70

Handling of duplicate entries via uniq #70

Comments

mfenner commented Jul 29, 2013

inukshuk commented Jul 29, 2013

mfenner commented Jul 30, 2013

inukshuk commented Jul 30, 2013

mfenner commented Jul 30, 2013

inukshuk commented Jul 30, 2013

inukshuk commented Jul 31, 2013

mfenner commented Jul 31, 2013

inukshuk commented Aug 5, 2013

inukshuk commented Aug 5, 2013

mfenner commented Aug 7, 2013

inukshuk commented Aug 7, 2013

inukshuk commented Aug 7, 2013

mfenner commented Aug 7, 2013