Improve author name resolution #9003

scottbarnes · 2024-04-02T04:26:26Z

Partially closes #7349
Related: internetarchive/infogami#217

Feature.

Technical

This PR does three things:

Make name search case insensitive (though really this is an Infogami change);
Drop a variety of honorifics completely (so they are not imported, nor used for comparison to existing authors); and
Change the author name resolution strategy.

This PR would change the author name resolution order as follows:

match on name, with priority going to name + birth date + death date;
match on alternate_names, with priority going to alternate_names + birth date + death date; and
match on surname with a matching birth date + death date required.

It does this in part by relying on an Infobase change to change the op
associated with ~ to use ILIKE rather than LIKE. See internetarchive/infogami#217

It also updates mock_site() to more accurately approximate the LIKE
(and nowILIKE) query done by PostgreSQL.

I should note this does not drop honorifics at the end of an author name, which was part of the problem in #7349. My rationale was this case would get picked up by alternate_names (once an alternate name has been added..), and only operating at the start would of a name would be less likely to make an unintended removal. However, it would be easy to remove honorifics at the end if that is also desired.

This PR also does not address punctuation removal, which was called for in #7349. That's partially because I simply wasn't sure if I should, as this could result in overhead or required changes to the database structure, depending on the manner in which we do this.

However, one strategy that could use indexing and require no changes to the DB would be to use PostgreSQLs _single character wildcard withLIKE/ILIKE. That would allow a search such as William H_ Brewer to match an existing William H. Brewer, though it would not work the other way around.

I don't think it would add that much complexity to enable this using a new operator (e.g. ~_ or something, instead of ~), but I wanted to check before pursing this at all.

Testing

NOTE: testing, or at least the case-insensitive aspect of it, relies on the one line change found in internetarchive/infogami#217. Without that merged, tests (against PostgreSQL) relating to case-insensitivity will fail. The unit tests, however, will pass, as they rely on an updated mock_infobase, as frustrating as that may be.

The tests should lay out a fairly comprehensive listing of the cases covered. However, those do rely on mock_site(), which, try as I might, may not fully replicate ILIKE.

The last commit adds a /api/author endpoint solely for testing purposes. It is NOT intended for merging.

The idea there is just to make it easier to test the author-specific part of build_query without having to actually import things -- one can simply see what the result would be.

Sample use:

❯ curl -X POST http://localhost:8080/api/author \
    -H "Content-Type: application/json" \
    -b ~/cookies.txt \
    -d '{
    "authors": [{"name": "Mr. ePictEtus"}]
}'
{'type': {'key': '/type/edition'}, 'authors': [<Author: '/authors/OL5A'>]}

Stakeholders

@tfmorris
@seabelis
@cdrini

cdrini

Looks great! This is going to make such a big difference in avoiding duplicate authors being created!

We should add a grafana metric every time an author is created, cause I want to see a drop in a graph so I can tell people about it. We talked and can't find a simple hook for this infogami (we didn't look that hard), so I'll create a temporary script that walks back through ImportBot's edits. That should suffice for this.

openlibrary/catalog/add_book/load_book.py

openlibrary/catalog/add_book/tests/test_load_book.py

openlibrary/mocks/mock_infobase.py

openlibrary/plugins/importapi/code.py

openlibrary/catalog/add_book/load_book.py

Related: internetarchive#7349 Honorifics (e.g. the French `M.`) can throw off the importer, and librarians will remove them manually. Instead, this commit removes them at the point of import, at least for some titles in English, French, Spanish, and German. There is also a way to create exceptions, such as "Dr. Seuss".

This commit changes the author name resolution order to be: 1. match on name + birth date + death date; 2. match on alternate_names + birth date + death date; 3. match on surname + birth date + death date. It does this by relying on an Infobase change to change the `op` associated with `~` to use `ILIKE` rather than `LIKE`. internetarchive/infogami#217 It also updates `mock_site()` to more accurately approximate the `LIKE` and (and now`ILIKE`) query done by PostgreSQL.

…eath date

cdrini

Looks good to me! Once the infogami version bump is included we can merge 👍 I didn't test this code; the unit tests seem pretty thorough.

scottbarnes changed the title ~~7349/fix/better author resolution~~ Improve author name resolution Apr 2, 2024

cdrini self-assigned this Apr 8, 2024

cdrini reviewed Apr 27, 2024

View reviewed changes

cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Apr 30, 2024

scottbarnes force-pushed the 7349/fix/better-author-resolution branch 2 times, most recently from d1ceaf3 to 3a56c1f Compare May 8, 2024 04:17

scottbarnes added 3 commits May 14, 2024 21:23

fixup! Feature: match authors on alternate_names/surname with birth/d…

f343c08

…eath date

scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 3a56c1f to 9c87553 Compare May 15, 2024 04:23

github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 15, 2024

fixup! Feature: match authors on alternate_names/surname with birth/d…

30c6dc5

…eath date

scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 2135012 to 467b795 Compare May 15, 2024 04:26

scottbarnes added the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label May 15, 2024

scottbarnes force-pushed the 7349/fix/better-author-resolution branch from 467b795 to 30c6dc5 Compare May 15, 2024 04:28

cdrini approved these changes May 22, 2024

View reviewed changes

cdrini removed the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label May 22, 2024

Update infogami submodule commit

ba5e9f7

cdrini merged commit cd46a0a into internetarchive:master May 22, 2024
4 checks passed

hornc mentioned this pull request May 22, 2024

add a failing author name test #9321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve author name resolution #9003

Improve author name resolution #9003

scottbarnes commented Apr 2, 2024 •

edited

cdrini left a comment

cdrini left a comment •

edited

Improve author name resolution #9003

Improve author name resolution #9003

Conversation

scottbarnes commented Apr 2, 2024 • edited

Technical

Testing

Stakeholders

cdrini left a comment

Choose a reason for hiding this comment

cdrini left a comment • edited

Choose a reason for hiding this comment

scottbarnes commented Apr 2, 2024 •

edited

cdrini left a comment •

edited