Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data dumps for /type/page #8401

Closed
RayBB opened this issue Oct 8, 2023 · 20 comments · Fixed by #9127
Closed

Make data dumps for /type/page #8401

RayBB opened this issue Oct 8, 2023 · 20 comments · Fixed by #9127
Assignees
Labels
Affects: Documentation Issues related to developer or ops or data documentation. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Data dumps Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@RayBB
Copy link
Collaborator

RayBB commented Oct 8, 2023

Describe the problem that you'd like solved

I would like to get a data dump of just the /type/page entities.
I can see that there are some in the all_types_dump but it's too big for me to download.

I was just running:
curl -s -L https://openlibrary.org/data/ol_dump_latest.txt.gz | gunzip -c | grep '/type/page'

Why?

  1. I've been working on cleaning up our docs and I'd like to be able to more easily search the docs that are on openlibrary.org using tools like grep.
  2. As I understand it, the sitemap.xml is generated by these dumps and I'm wondering if we should in the future make the sitemaps have our pages on them for easier searching.
  3. I'm wondering if we can put them into solr and have a nice search for our docs but before doing that I'd like to be able to see what docs we actually have.

Proposal & Constraints

I poked around briefly at the data dump code and I think it could be as simple as adding it here:

types = (
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list",
)

We'd also want to make one of these special links like: https://openlibrary.org/data/ol_dump_authors_latest.txt.gz to redirect to the pages dump.

Additional context

https://github.com/internetarchive/openlibrary/wiki/Sitemap-Generation
https://github.com/internetarchive/openlibrary/wiki/Generating-Data-Dumps
https://openlibrary.org/developers/dumps

Stakeholders

@RayBB RayBB added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Module: Data dumps Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Affects: Documentation Issues related to developer or ops or data documentation. [managed] Needs: Lead labels Oct 8, 2023
@RayBB RayBB changed the title Can we start making data dumps for page? Make data dumps for /type/page Oct 8, 2023
@tfmorris
Copy link
Contributor

tfmorris commented Oct 9, 2023

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

@mekarpeles mekarpeles added Priority: 3 Issues that we can consider at our leisure. [managed] Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Oct 16, 2023
xonx4l added a commit to xonx4l/openlibrary that referenced this issue Dec 11, 2023
@Billa05
Copy link
Contributor

Billa05 commented Mar 18, 2024

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

Is there a way to determine the size of the large dumps, or should we just mention it in a set? If this is clarified, I can attempt to add a function that handles this logic or incorporate it into the split function.

@tfmorris
Copy link
Contributor

I don't think anything super fancy or dynamic is needed. I would look into changing the logic of split_dump() to write files for editions, works, authors, and then everything else (perhaps called "misc" or something similar). Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them. Lists and redirects total about 75MB currently, so the new "everything else" dump should be much less than 100MB and will automatically include new types as they're introduced.

(As an aside, the lists and redirects dumps are not currently mentioned on the wiki page. You can only find them by going to the dump directory.)

@Realmbird
Copy link
Contributor

Realmbird commented Apr 9, 2024

Can this be assigned to me if Meredith doesn't want to

@RayBB
Copy link
Collaborator Author

RayBB commented Apr 9, 2024

@merwhite11 said she's like to work on this so I'll assign her!

@merwhite11
Copy link
Contributor

I’d love to work on this. Please assign to me ! :)

@jimchamp
Copy link
Collaborator

jimchamp commented Apr 9, 2024

@merwhite11, limit the scope of this to /type/page data. An "everything else" data dump will need to be audited before being published, as this will include patron preferences and perhaps other personal information.

@tfmorris
Copy link
Contributor

@jimchamp as I mentioned above:

Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them.

All the secondary dumps are subsets of the full dump generated by split_dump. If you believe there's an exposure here, it already exists in the full dump.

# skip user pages
if key.startswith("/people/") and not re.match(
r"^/people/[^/]+/lists/OL\d+L$", key
):
continue
# skip admin pages
if key.startswith("/admin/"):
continue
# skip obsolete pages. Obsolete pages include volumes, scan_records and users
# marked as spam.
if key.startswith(("/b/", "/scan", "/old/")) or not key.startswith("/"):
continue
if filter and not filter(d):
continue

@merwhite11
Copy link
Contributor

merwhite11 commented Apr 11, 2024

@jimchamp @tfmorris
Paraphrasing to make sure I'm understanding correctly:

We can create a separate type in the split_dump function that grabs all misc files that don't fall into pre-existing types. Due to filters on lines 49 and 55, we don't need to worry about user data getting into dump file. Ideally this will result in a misc file < 100MB that @RayBB can use as an inventory of pages to eventually be included in sitemap.xml

Data dump -> sitemap.xml (with pages) -> sitemap in solr

eg:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",
        #add a catch-all type
        "/type/misc"
    )

#Then add an else block to write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        #else files[misc].write(line)
     

@merwhite11
Copy link
Contributor

@RayBB
In terms of generating the special links (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) to redirect to the pages dump --

My understanding is that these urls are created in the make_index function in dump.py . I would need to add some logic there to account for the misc file.

@merwhite11
Copy link
Contributor

merwhite11 commented Apr 17, 2024

Linking to the slack conversation with my progress and questions here:
https://internetarchive.slack.com/archives/C0ETZV72L/p1712873368854129

@jimchamp
Copy link
Collaborator

jimchamp commented Apr 18, 2024

@merwhite11, you may want to try something like this:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",  # Remove /type/misc
    )

    # Create file for all other types:
    files['misc'] = xopen(format % 'misc', 'wt')

    # In the else block, write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        else:
            files['misc'].write(line)

This should write all other types to a single file. Disregard what I said about limiting this to only /type/page --- I didn't see @tfmorris's comments about this earlier. It wouldn't be trivial to do something like this anyway, as I'm noticing the type for some pages is type: {"key": "/type/page"} (for example, this collections page).

You may want to create a page locally to test for this. Here's how to make a local /collections page:

  1. While logged in, navigate to localhost:8080/collections
  2. Click the "Create it?" link
  3. Fill out the "Title" and "Document Body" fields, then submit the form
  4. Check the type at localhost:8080/collections.json

@merwhite11
Copy link
Contributor

merwhite11 commented Apr 18, 2024

@jimchamp
Thank you for these suggestions! Trying this approach and still unclear. I'm also not able to find the test 'type/page' page that I created in the dump file.

When I zcat the files being written to files['misc'] -- it is majority /type/language with a few type/type , type/object, type/usergroup and type/page.

Are we assuming that all pages already have the type/page associated with them? Or is it possible that a page could be labelled as a type/edition or type/work for example?

Basically, I'm confused as to why there are so few type/pages.

@jimchamp
Copy link
Collaborator

jimchamp commented Apr 19, 2024

Local instances don't have much data pre-loaded. When I checked the other day, there were only three /type/page pages there.

After implementing the above changes, I'm seeing the /collections pages that I created in the misc dump. Without seeing your code, I'm not sure why you can't find the page that you created. Maybe using grep on the file would help you find it? My misc dump has over 500 entries....

@merwhite11
Copy link
Contributor

merwhite11 commented Apr 19, 2024

@jimchamp my bad ! I am getting my test file when I grep. A few more questions...

Can we assume that if we were running this in the prod env, there would be a lot more pages / there would be a type/page for every page in site?

In terms of generating the path to the split dump (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) , is this something I can test for? It doesn't seem to be entering the make_index function in dump.py in test mode.

Would creating the path for /type/page look something like this?

  #add /type/page here
 if type in ("/type/edition", "/type/work", "/type/page"):
            title = data.get("title", "untitled")
            path = key + "/" + urlsafe(title)
        elif type in ("/type/author", "/type/list"):
            title = data.get("name", "unnamed")
            path = key + "/" + urlsafe(title)
        else:
            title = data.get("title", key)
            path = key

thanks again for the help!

@jimchamp
Copy link
Collaborator

There will be more /type/pages in production, but there will not be a /type/page from each page. For example, work pages will have /type/work, edition pages will have /type/edition, author pages will have /type/author, etc (see this, this, and this, respectively).

Some of our pages are /type/i18n_page (like the root /collections page), while others are actually /type/page (like this collection). I'd expect your changes to capture these types and all of the other ones that we don't already have dumps for.

I don't really understand the code snippet that you provided, and I don't understand what "creating the path for /type/page means in the given context. Could you push the code that you have now to your repo? There's no need to create a PR now, I'd just like to test your code to better understand what is happening.

@merwhite11
Copy link
Contributor

Ok, that makes sense. type/page applies to pages that don't already fall into another type. In this 'misc pages dump', we want to get all the type/pages AND all other misc types.

The code snippet is to part of the make_index function in dump.py: here

Here's my last push to my fork. I haven't made many changes..thanks for taking a look!
master...merwhite11:openlibrary:8401/Fix/Make-Change-to-oldump

@jimchamp
Copy link
Collaborator

Thanks. I can't find evidence of make_index being used today, so you can revert those changes.

Make sure to remove unrelated changes before opening a PR.

@RayBB RayBB added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] and removed Priority: 3 Issues that we can consider at our leisure. [managed] labels May 10, 2024
@RayBB
Copy link
Collaborator Author

RayBB commented May 10, 2024

@jimchamp bumping the priority of this because it would be very helpful for me and the PR has been open a few weeks.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label May 11, 2024
@jimchamp jimchamp removed the Needs: Response Issues which require feedback from lead label May 11, 2024
@mekarpeles mekarpeles added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] labels May 13, 2024
@mekarpeles
Copy link
Member

Drini is currently in Albania and it may take some time before we can followup @RayBB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Documentation Issues related to developer or ops or data documentation. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Data dumps Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
7 participants