Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logic for page dump and commented out test line #9127

Merged

Conversation

merwhite11
Copy link
Contributor

Closes #8401

This is a refactor that allows all dump file types that are NOT

        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list"

to be sorted into a misc category. This category catches all /type/page dump files, in addition to all other types that are not in above list.

This misc files should help provide a comprehensive inventory of pages in the dump that is used to generate the sitemap.

Technical

I only tested these changes with a subset of full data (commented in line 38 of /scripts/oldump.sh) .
When line 38 was commented in, I also had to change the -z in line 133 of /scripts/oldump.sh to -n to avoid an error in /data/dump.py.

Testing

Screenshot

Ran docker compose run --rm home make test
Screenshot 2024-04-19 at 4 13 46 PM

Stakeholders

@jimchamp @RayBB

@RayBB
Copy link
Collaborator

RayBB commented Apr 20, 2024

@merwhite11 Very excited for this and pleasantly surprised how simple the solution is :) Hope Jim can review it soon!

You may want to update the doc string here:

"""Split dump into authors, editions and works."""

@merwhite11 merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from dda2d57 to 91fb9a7 Compare April 22, 2024 18:51
@mekarpeles mekarpeles self-assigned this Apr 22, 2024
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Needs: Staff Decision Issues that are blocked on a staff member's decision labels Apr 22, 2024
@mekarpeles mekarpeles assigned cdrini and unassigned mekarpeles Apr 29, 2024
@mekarpeles
Copy link
Member

mekarpeles commented Apr 29, 2024

@cdrini, blocking, please see #8401 (comment)

@mekarpeles mekarpeles removed Needs: Staff Decision Issues that are blocked on a staff member's decision Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. labels Apr 29, 2024
@jimchamp
Copy link
Collaborator

@cdrini, this is blocking @RayBB. Could you review this at the earliest?

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels May 20, 2024
@mekarpeles
Copy link
Member

@cdrini is there any way we can do a test run that doesn't upload to any items?

We want to avoid preferences, anything store related (which already shouldn't be there), PII (personally identifiable info), etc

@cdrini
Copy link
Collaborator

cdrini commented May 21, 2024

Taking a look at this now; checking latest dumps to see what types of records we would get. This is currently running. (Written with some help from ChatGPT!)

python3 <<'EOF' | gzip > ol_dump_other.txt.gz
import gzip
import requests
import sys

url = 'https://openlibrary.org/data/ol_dump_latest.txt.gz'

exclude_types = {
    "/type/edition",
    "/type/author",
    "/type/work",
    "/type/redirect",
    "/type/list"
}

with requests.get(url, stream=True) as r:
    r.raise_for_status()
    with gzip.GzipFile(fileobj=r.raw) as f:
        for line in f:
            line = line.decode('utf-8')
            if line.split('\t', 1)[0] not in exclude_types:
                print(line, end='')

EOF

@tfmorris
Copy link
Contributor

As I mentioned two months ago #8401 (comment) all this does is split the complete dump, which is already filtered. If there's a question about the contents, the complete dump creation is where it should be reviewed/fixed.

@cdrini
Copy link
Collaborator

cdrini commented May 21, 2024

Yes, this is mostly about getting transparency into what's there. Here's the breakdown by type:

$ zcat ol_dump_other.txt.gz | cut -f1 | sort | uniq -c
      1 /type/about
      7 /type/backreference
      3 /type/collection
2583653 /type/delete
      5 /type/doc
     11 /type/home
    966 /type/i18n
     34 /type/i18n_page
    467 /type/language
    324 /type/library
     14 /type/local_id
    126 /type/macro
      5 /type/object
    439 /type/page
     12 /type/permission
      1 /type/place
     47 /type/rawtext
      1 /type/scan_location
      2 /type/scan_record
      3 /type/series
  91400 /type/subject
     14 /type/tag
    300 /type/template
     48 /type/type
      1 /type/uri
      2 /type/user
     19 /type/usergroup
    107 /type/volume

This seems fine. Lots of ancient legacy stuff here :P This other file is a bit problematic for long-term, since we might decide to split these out into separate dumps going forward, which would be a breaking change to this catch-all other dump file. I'm wondering if we might already want to separate out /type/delete, since that's a substantial enough slice at this point. Thoughts? Let me know if you want the file; gzipped it's 78M, but if you're just curious what these things are you can hit up the query.json endpoint eg http://openlibrary.org/query.json?type=/type/user&*=

@tfmorris
Copy link
Contributor

78 MB for "other" doesn't seem excessive when compared to the other files sizes. It's certainly a lot better than the 13.6 GB currently required to get any of the data that isn't broken out separately. One might even argue that editions, works, authors, and "other" would be an adequate breakdown. Reading log, redirects, lists, ratings, and everything else would still be less than 200 MB and less than half the size of the authors file.

File Date Size
ol_dump_2024-04-30.txt.gz 02-May-2024 02:35 13.6G
ol_dump_editions_2024-04-30.txt.gz 02-May-2024 02:38 9.5G
ol_dump_works_2024-04-30.txt.gz 02-May-2024 02:39 3.0G
ol_dump_authors_2024-04-30.txt.gz 02-May-2024 02:37 579.4M
ol_dump_reading-log_2024-04-30.txt.gz 02-May-2024 02:36 73.5M
ol_dump_redirects_2024-04-30.txt.gz 02-May-2024 02:36 49.5M
ol_dump_lists_2024-04-30.txt.gz 02-May-2024 02:37 26.5M
ol_dump_ratings_2024-04-30.txt.gz 02-May-2024 02:36 4.6M

@RayBB
Copy link
Collaborator

RayBB commented May 22, 2024

I have no strong opinion on splitting. Just will be happy once it is easier to get access to these smaller sections of the dump.

Seems ok to put stuff in this other dump even if it could be moved out to another dump years down the line. Maybe it is slightly better for consumers to have just one to get with all this stuff rather than many small ones?

@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

Oh agreed; I mean we could down the line decide to split some of these out; e.g. tags is currently tiny, but if we at some point make a tag for every subject in our db, we'll likely split it out into its own dump, and then this file will have a breaking change. But I guess that's ok, we'll just have to make a public notification. @RayBB would you mind updating the docs to also link to the redirects dump? That'll make the "other" nature of this one clearer.

Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Didn't testing running this since you gave it a test already, just tested what the slice would contain using the above snippet.

openlibrary/data/dump.py Outdated Show resolved Hide resolved
openlibrary/data/dump.py Outdated Show resolved Hide resolved
openlibrary/data/dump.py Outdated Show resolved Hide resolved
@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

Errr actually I think we should split out the deletes now ; since we already have redirects in a separate, and the "other" dump will be like 95% deletes otherwise, which I think makes it less useful.

@RayBB
Copy link
Collaborator

RayBB commented May 22, 2024

@cdrini I don't know how to make a
https://openlibrary.org/data/ol_dump_redirects_latest.txt.gz
like we have for
https://openlibrary.org/data/ol_dump_ratings_latest.txt.gz

However, I updated the docs with placeholders :)

Also, I've gotten confused about this before but how can I find old ratings dumps? I don't see them in this url https://archive.org/details/ol_exports?tab=collection&query=ratings I tried searching IA for the file name but no luck.

@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

Ah they're inside the dump, eg https://archive.org/download/ol_dump_2024-03-31 . Is that what you're looking for?

@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

@RayBB I updated the endpoint to include the short link for redirects and deletes and other 👍

@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

Ok this looks good to me! I've sent it to testing to test the new endpoints. @merwhite11 would you mind giving it another test to make sure my last changes didn't break anything? :P Then should be good to merge!

@cdrini cdrini added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label May 22, 2024
@cdrini
Copy link
Collaborator

cdrini commented May 22, 2024

Confirmed new endpoints work on testing 👍

@RayBB
Copy link
Collaborator

RayBB commented May 22, 2024

Docs are updated with the forthcoming dump links
https://openlibrary.org/developers/dumps

@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 23, 2024
@merwhite11 merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from 421d7e3 to e28449e Compare May 29, 2024 19:20
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 29, 2024
@merwhite11 merwhite11 force-pushed the 8401/Fix/Make-Change-to-oldump branch from e28449e to 4198add Compare May 29, 2024 19:21
@cdrini
Copy link
Collaborator

cdrini commented May 29, 2024

@merwhite11 tested and it correctly generated all the files 👍 Lgtm!
image

@cdrini cdrini merged commit 1178652 into internetarchive:master May 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make data dumps for /type/page
6 participants