Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Library Publication Date Mismatches #7

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mgaoann
Copy link

@mgaoann mgaoann commented Apr 2, 2024

First set of mismatches - Open Library works (P648) for publication date (P577)

Copy link

@LofiTea LofiTea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw a small complication but might be ignored. The OL JSON file has "first publication-date" does not match with the publication date when going to the actual website. Ex: "Q100202821","OL8704115W","1960-01-01T00:00:00Z" has February 1981 in JSON but the publish date on https://openlibrary.org/works/OL8704115W/Cuba_para_principiantes?mode=all is 1970 on the website. This might bring in some confusion.

Anyways, there are a couple of other stuff to consider as well. Might need some tweaking but definitely in the right direction and a lot of good work here!

# In[31]:


mismatch_dataframe.to_csv('openlibrary_publication_date_mismatches.csv', index=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just a minor thing but a few items might not be mismatches. Ex: Lines 3, 23, 37, 324, etc. The years for both sets are correct but the OpenLibrary value only has the year while the Wikidata value has month and year. Consider comparing only years when OL only has the year value in the JSON file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing line 23 at least, and this is a good point. If open library only has the year, then we shouldn't be triggering a mismatch on something that has the year and the month :)

@lectrician1
Copy link
Contributor

Just verified something. Maggie is comparing first publication date for an OpenLibary work to "publication date" in a Wikidata work. "publication date" when used on a Wikidata work means the date a work is first published according to https://www.wikidata.org/wiki/Wikidata:WikiProject_Books so we are good with this upload.

@andrewtavis-wmde
Copy link
Collaborator

Some quick things on this:

  • There are a lot of imports here that are not used
    • I'd install the Ruff extension for VS Code, if that's your IDE, and then take a look at the Python file and notebook to see them all
    • Similarly classes like JSONDecodeError, RateLimitException and time are never imported
  • Is there a way at all to get the statement GUIDs?
    • I'm worried that we won't be able to upload the Wikidata values without them as stated in the user guide
  • Removing those entries where we have the year-month vs. same year mismatch as @LofiTea discussed makes sense to me
  • Another thing to note on this is that it seems like we're getting publication date mismatches from different publications?
    • Take Q923416 as an example - Master and Commander (first choice from the bottom)
    • Publication date is 1969
    • External value is October 10, 2000
    • Maybe it's from a different edition, but I can't actually find this exact date on the Open Library page
    • How are we deriving the date from the the external source that we're comparing against Wikidata? Oldest one on Open Library?

@mgaoann
Copy link
Author

mgaoann commented Apr 7, 2024

I'm deriving the publication date mismatches from Open Library's API. I'm not sure why the date external value is not on the Open Library page, but it is the result when I make a request to the API. In this case, it looks like there are also editions of this work published prior to October 10, 2000. This edition page has the same publication date as the item on Wikidata and seems to be the first edition. If I make a request to the API using this edition's ID instead of the general work's ID, it returns the 1969 publication date.

This seems to be the case for many books. For Q5142283, the Open Library page shows 1983 as the publication date of the first edition, but the Open Library API returns 1985. Although, for this work, it seems that there are a few later editions published in 1985 on Open Library's page.

It seems that the attribute 'first_publish_date' may not be reliable in all cases.

@andrewtavis-wmde
Copy link
Collaborator

Thanks for the further explanation on this, @mgaoann! Are we able to merge some of these into a common ID for a specific work? If that could work, then I think there should be a lot of potential here :)

@mgaoann
Copy link
Author

mgaoann commented Apr 11, 2024

@andrewtavis-wmde Could you clarify a bit about what you mean by merging into a common ID for a specific work?

@andrewtavis-wmde
Copy link
Collaborator

Hey @mgaoann 👋 Sorry for the late reply. Let's talk about this in the meeting later, but generally what I'm meaning here is can we find a common ID for all editions of an individual book and then use that to derive the earliest publication date? So can we link the October 10, 2000 Master and Commander to some sort of an entity that we can then follow to the original publication date?

@mgaoann
Copy link
Author

mgaoann commented Apr 18, 2024

Hey @andrewtavis-wmde Sorry I wasn't able to be at the meeting. When I originally wrote the code, I make API requests only to get the works, but I did some digging, and I believe there's a way to get all of the editions associated for a work. Here's the result. It seems that the October 10, 2000 is the publication date of this audio cassette edition.

Either way, since I can see all the publication dates of the editions associated with the work, I may be able to find the original publication date by comparing the dates to determine the original publication date. Is this what you were referring to?

@andrewtavis-wmde
Copy link
Collaborator

No stress, @mgaoann! Hope all's well :)

And the process you suggested makes total sense and is what I was thinking about :) Let's find the original publication date and compare that value 😊

Looking forward to the results, and let me know if there's anything I can do to support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants