Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve author name resolution #9003

Merged

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Apr 2, 2024

Partially closes #7349
Related: internetarchive/infogami#217

Feature.

Technical

This PR does three things:

  • Make name search case insensitive (though really this is an Infogami change);
  • Drop a variety of honorifics completely (so they are not imported, nor used for comparison to existing authors); and
  • Change the author name resolution strategy.

This PR would change the author name resolution order as follows:

  1. match on name, with priority going to name + birth date + death date;
  2. match on alternate_names, with priority going to alternate_names + birth date + death date; and
  3. match on surname with a matching birth date + death date required.

It does this in part by relying on an Infobase change to change the op
associated with ~ to use ILIKE rather than LIKE. See internetarchive/infogami#217

It also updates mock_site() to more accurately approximate the LIKE
(and nowILIKE) query done by PostgreSQL.

I should note this does not drop honorifics at the end of an author name, which was part of the problem in #7349. My rationale was this case would get picked up by alternate_names (once an alternate name has been added..), and only operating at the start would of a name would be less likely to make an unintended removal. However, it would be easy to remove honorifics at the end if that is also desired.

This PR also does not address punctuation removal, which was called for in #7349. That's partially because I simply wasn't sure if I should, as this could result in overhead or required changes to the database structure, depending on the manner in which we do this.

However, one strategy that could use indexing and require no changes to the DB would be to use PostgreSQLs _single character wildcard withLIKE/ILIKE. That would allow a search such as William H_ Brewer to match an existing William H. Brewer, though it would not work the other way around.

I don't think it would add that much complexity to enable this using a new operator (e.g. ~_ or something, instead of ~), but I wanted to check before pursing this at all.

Testing

NOTE: testing, or at least the case-insensitive aspect of it, relies on the one line change found in internetarchive/infogami#217. Without that merged, tests (against PostgreSQL) relating to case-insensitivity will fail. The unit tests, however, will pass, as they rely on an updated mock_infobase, as frustrating as that may be.

The tests should lay out a fairly comprehensive listing of the cases covered. However, those do rely on mock_site(), which, try as I might, may not fully replicate ILIKE.

The last commit adds a /api/author endpoint solely for testing purposes. It is NOT intended for merging.

The idea there is just to make it easier to test the author-specific part of build_query without having to actually import things -- one can simply see what the result would be.

Sample use:

❯ curl -X POST http://localhost:8080/api/author \
    -H "Content-Type: application/json" \
    -b ~/cookies.txt \
    -d '{
    "authors": [{"name": "Mr. ePictEtus"}]
}'
{'type': {'key': '/type/edition'}, 'authors': [<Author: '/authors/OL5A'>]}

Stakeholders

@tfmorris
@seabelis
@cdrini

@scottbarnes scottbarnes changed the title 7349/fix/better author resolution Improve author name resolution Apr 2, 2024
@cdrini cdrini self-assigned this Apr 8, 2024
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! This is going to make such a big difference in avoiding duplicate authors being created!

We should add a grafana metric every time an author is created, cause I want to see a drop in a graph so I can tell people about it. We talked and can't find a simple hook for this infogami (we didn't look that hard), so I'll create a temporary script that walks back through ImportBot's edits. That should suffice for this.

openlibrary/catalog/add_book/load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/tests/test_load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/tests/test_load_book.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/tests/test_load_book.py Outdated Show resolved Hide resolved
openlibrary/mocks/mock_infobase.py Outdated Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/load_book.py Outdated Show resolved Hide resolved
@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Apr 30, 2024
@scottbarnes scottbarnes force-pushed the 7349/fix/better-author-resolution branch 2 times, most recently from d1ceaf3 to 3a56c1f Compare May 8, 2024 04:17
Related: internetarchive#7349

Honorifics (e.g. the French `M.`) can throw off the importer, and
librarians will remove them manually. Instead, this commit removes them
at the point of import, at least for some titles in English, French,
Spanish, and German.

There is also a way to create exceptions, such as "Dr. Seuss".
This commit changes the author name resolution order to be:
  1. match on name + birth date + death date;
  2. match on alternate_names + birth date + death date;
  3. match on surname + birth date + death date.

It does this by relying on an Infobase change to change the `op`
associated with `~` to use `ILIKE` rather than `LIKE`.
internetarchive/infogami#217

It also updates `mock_site()` to more accurately approximate the `LIKE`
and (and now`ILIKE`) query done by PostgreSQL.
@scottbarnes scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 3a56c1f to 9c87553 Compare May 15, 2024 04:23
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 15, 2024
@scottbarnes scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 2135012 to 467b795 Compare May 15, 2024 04:26
@scottbarnes scottbarnes added the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label May 15, 2024
@scottbarnes scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 467b795 to 30c6dc5 Compare May 15, 2024 04:28
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Once the infogami version bump is included we can merge 👍 I didn't test this code; the unit tests seem pretty thorough.

@cdrini cdrini removed the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label May 22, 2024
@cdrini cdrini merged commit cd46a0a into internetarchive:master May 22, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better Author Resolution: Don't create duplicate author records for M. <surname> with exact date match
2 participants