Added logic for page dump and commented out test line #9127

merwhite11 · 2024-04-19T23:14:50Z

This is a refactor that allows all dump file types that are NOT

        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list"

to be sorted into a misc category. This category catches all /type/page dump files, in addition to all other types that are not in above list.

This misc files should help provide a comprehensive inventory of pages in the dump that is used to generate the sitemap.

Technical

I only tested these changes with a subset of full data (commented in line 38 of /scripts/oldump.sh) .
When line 38 was commented in, I also had to change the -z in line 133 of /scripts/oldump.sh to -n to avoid an error in /data/dump.py.

Testing

Screenshot

Ran docker compose run --rm home make test

Stakeholders

@jimchamp @RayBB

RayBB · 2024-04-20T20:13:57Z

@merwhite11 Very excited for this and pleasantly surprised how simple the solution is :) Hope Jim can review it soon!

You may want to update the doc string here:

openlibrary/openlibrary/data/dump.py

Line 212 in dda2d57

"""Split dump into authors, editions and works."""

mekarpeles · 2024-04-29T19:43:53Z

@cdrini, blocking, please see #8401 (comment)

jimchamp · 2024-05-10T17:26:35Z

@cdrini, this is blocking @RayBB. Could you review this at the earliest?

mekarpeles · 2024-05-20T19:43:47Z

@cdrini is there any way we can do a test run that doesn't upload to any items?

We want to avoid preferences, anything store related (which already shouldn't be there), PII (personally identifiable info), etc

cdrini · 2024-05-21T15:14:21Z

Taking a look at this now; checking latest dumps to see what types of records we would get. This is currently running. (Written with some help from ChatGPT!)

python3 <<'EOF' | gzip > ol_dump_other.txt.gz
import gzip
import requests
import sys

url = 'https://openlibrary.org/data/ol_dump_latest.txt.gz'

exclude_types = {
    "/type/edition",
    "/type/author",
    "/type/work",
    "/type/redirect",
    "/type/list"
}

with requests.get(url, stream=True) as r:
    r.raise_for_status()
    with gzip.GzipFile(fileobj=r.raw) as f:
        for line in f:
            line = line.decode('utf-8')
            if line.split('\t', 1)[0] not in exclude_types:
                print(line, end='')

EOF

tfmorris · 2024-05-21T17:48:42Z

As I mentioned two months ago #8401 (comment) all this does is split the complete dump, which is already filtered. If there's a question about the contents, the complete dump creation is where it should be reviewed/fixed.

cdrini · 2024-05-21T20:19:47Z

Yes, this is mostly about getting transparency into what's there. Here's the breakdown by type:

$ zcat ol_dump_other.txt.gz | cut -f1 | sort | uniq -c
      1 /type/about
      7 /type/backreference
      3 /type/collection
2583653 /type/delete
      5 /type/doc
     11 /type/home
    966 /type/i18n
     34 /type/i18n_page
    467 /type/language
    324 /type/library
     14 /type/local_id
    126 /type/macro
      5 /type/object
    439 /type/page
     12 /type/permission
      1 /type/place
     47 /type/rawtext
      1 /type/scan_location
      2 /type/scan_record
      3 /type/series
  91400 /type/subject
     14 /type/tag
    300 /type/template
     48 /type/type
      1 /type/uri
      2 /type/user
     19 /type/usergroup
    107 /type/volume

This seems fine. Lots of ancient legacy stuff here :P This other file is a bit problematic for long-term, since we might decide to split these out into separate dumps going forward, which would be a breaking change to this catch-all other dump file. I'm wondering if we might already want to separate out /type/delete, since that's a substantial enough slice at this point. Thoughts? Let me know if you want the file; gzipped it's 78M, but if you're just curious what these things are you can hit up the query.json endpoint eg http://openlibrary.org/query.json?type=/type/user&*=

tfmorris · 2024-05-21T21:11:01Z

78 MB for "other" doesn't seem excessive when compared to the other files sizes. It's certainly a lot better than the 13.6 GB currently required to get any of the data that isn't broken out separately. One might even argue that editions, works, authors, and "other" would be an adequate breakdown. Reading log, redirects, lists, ratings, and everything else would still be less than 200 MB and less than half the size of the authors file.

File	Date	Size
ol_dump_2024-04-30.txt.gz	02-May-2024 02:35	13.6G
ol_dump_editions_2024-04-30.txt.gz	02-May-2024 02:38	9.5G
ol_dump_works_2024-04-30.txt.gz	02-May-2024 02:39	3.0G
ol_dump_authors_2024-04-30.txt.gz	02-May-2024 02:37	579.4M
ol_dump_reading-log_2024-04-30.txt.gz	02-May-2024 02:36	73.5M
ol_dump_redirects_2024-04-30.txt.gz	02-May-2024 02:36	49.5M
ol_dump_lists_2024-04-30.txt.gz	02-May-2024 02:37	26.5M
ol_dump_ratings_2024-04-30.txt.gz	02-May-2024 02:36	4.6M

RayBB · 2024-05-22T09:52:28Z

I have no strong opinion on splitting. Just will be happy once it is easier to get access to these smaller sections of the dump.

Seems ok to put stuff in this other dump even if it could be moved out to another dump years down the line. Maybe it is slightly better for consumers to have just one to get with all this stuff rather than many small ones?

cdrini · 2024-05-22T10:22:27Z

Oh agreed; I mean we could down the line decide to split some of these out; e.g. tags is currently tiny, but if we at some point make a tag for every subject in our db, we'll likely split it out into its own dump, and then this file will have a breaking change. But I guess that's ok, we'll just have to make a public notification. @RayBB would you mind updating the docs to also link to the redirects dump? That'll make the "other" nature of this one clearer.

cdrini

Lgtm! Didn't testing running this since you gave it a test already, just tested what the slice would contain using the above snippet.

openlibrary/data/dump.py

cdrini · 2024-05-22T10:27:11Z

Errr actually I think we should split out the deletes now ; since we already have redirects in a separate, and the "other" dump will be like 95% deletes otherwise, which I think makes it less useful.

RayBB · 2024-05-22T10:35:50Z

@cdrini I don't know how to make a
https://openlibrary.org/data/ol_dump_redirects_latest.txt.gz
like we have for
https://openlibrary.org/data/ol_dump_ratings_latest.txt.gz

However, I updated the docs with placeholders :)

Also, I've gotten confused about this before but how can I find old ratings dumps? I don't see them in this url https://archive.org/details/ol_exports?tab=collection&query=ratings I tried searching IA for the file name but no luck.

cdrini · 2024-05-22T11:08:04Z

Ah they're inside the dump, eg https://archive.org/download/ol_dump_2024-03-31 . Is that what you're looking for?

cdrini · 2024-05-22T11:17:58Z

@RayBB I updated the endpoint to include the short link for redirects and deletes and other 👍

cdrini · 2024-05-22T11:23:00Z

Ok this looks good to me! I've sent it to testing to test the new endpoints. @merwhite11 would you mind giving it another test to make sure my last changes didn't break anything? :P Then should be good to merge!

cdrini · 2024-05-22T11:25:21Z

Confirmed new endpoints work on testing 👍

RayBB · 2024-05-22T11:34:22Z

Docs are updated with the forthcoming dump links
https://openlibrary.org/developers/dumps

for more information, see https://pre-commit.ci

cdrini · 2024-05-29T19:59:18Z

@merwhite11 tested and it correctly generated all the files 👍 Lgtm!

merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from dda2d57 to 91fb9a7 Compare April 22, 2024 18:51

mekarpeles self-assigned this Apr 22, 2024

mekarpeles added Priority: 2 Important, as time permits. [managed] Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Needs: Staff Decision Issues that are blocked on a staff member's decision labels Apr 22, 2024

mekarpeles assigned cdrini and unassigned mekarpeles Apr 29, 2024

mekarpeles removed Needs: Staff Decision Issues that are blocked on a staff member's decision Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. labels Apr 29, 2024

mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels May 20, 2024

cdrini approved these changes May 22, 2024

View reviewed changes

openlibrary/data/dump.py Outdated Show resolved Hide resolved

openlibrary/data/dump.py Outdated Show resolved Hide resolved

openlibrary/data/dump.py Outdated Show resolved Hide resolved

cdrini added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label May 22, 2024

cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 23, 2024

merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from 421d7e3 to e28449e Compare May 29, 2024 19:20

github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 29, 2024

merwhite11 and others added 6 commits May 29, 2024 12:21

Added logic for page dump and commented out test line

3b27c8c

Updated doc string in split_dump for clarity

a9afcbc

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cc3149

for more information, see https://pre-commit.ci

Rename misc dump to other dump

dbe7c29

Add separate dump for deletes

58bbc39

Add new/missing dump catagories to /data endpoint

4198add

merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from e28449e to 4198add Compare May 29, 2024 19:21

cdrini merged commit 1178652 into internetarchive:master May 29, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added logic for page dump and commented out test line #9127

Added logic for page dump and commented out test line #9127

merwhite11 commented Apr 19, 2024

RayBB commented Apr 20, 2024 •

edited

mekarpeles commented Apr 29, 2024 •

edited

jimchamp commented May 10, 2024

mekarpeles commented May 20, 2024

cdrini commented May 21, 2024

tfmorris commented May 21, 2024

cdrini commented May 21, 2024

tfmorris commented May 21, 2024

RayBB commented May 22, 2024

cdrini commented May 22, 2024 •

edited

cdrini left a comment

cdrini commented May 22, 2024 •

edited

RayBB commented May 22, 2024

cdrini commented May 22, 2024

cdrini commented May 22, 2024

cdrini commented May 22, 2024 •

edited

cdrini commented May 22, 2024

RayBB commented May 22, 2024

cdrini commented May 29, 2024

Added logic for page dump and commented out test line #9127

Added logic for page dump and commented out test line #9127

Conversation

merwhite11 commented Apr 19, 2024

Technical

Testing

Screenshot

Stakeholders

RayBB commented Apr 20, 2024 • edited

mekarpeles commented Apr 29, 2024 • edited

jimchamp commented May 10, 2024

mekarpeles commented May 20, 2024

cdrini commented May 21, 2024

tfmorris commented May 21, 2024

cdrini commented May 21, 2024

tfmorris commented May 21, 2024

RayBB commented May 22, 2024

cdrini commented May 22, 2024 • edited

cdrini left a comment

Choose a reason for hiding this comment

cdrini commented May 22, 2024 • edited

RayBB commented May 22, 2024

cdrini commented May 22, 2024

cdrini commented May 22, 2024

cdrini commented May 22, 2024 • edited

cdrini commented May 22, 2024

RayBB commented May 22, 2024

cdrini commented May 29, 2024

RayBB commented Apr 20, 2024 •

edited

mekarpeles commented Apr 29, 2024 •

edited

cdrini commented May 22, 2024 •

edited

cdrini commented May 22, 2024 •

edited

cdrini commented May 22, 2024 •

edited