Make data dumps for /type/page #8401

RayBB · 2023-10-08T22:51:11Z

Describe the problem that you'd like solved

I would like to get a data dump of just the /type/page entities.
I can see that there are some in the all_types_dump but it's too big for me to download.

I was just running:
curl -s -L https://openlibrary.org/data/ol_dump_latest.txt.gz | gunzip -c | grep '/type/page'

Why?

I've been working on cleaning up our docs and I'd like to be able to more easily search the docs that are on openlibrary.org using tools like grep.
As I understand it, the sitemap.xml is generated by these dumps and I'm wondering if we should in the future make the sitemaps have our pages on them for easier searching.
I'm wondering if we can put them into solr and have a nice search for our docs but before doing that I'd like to be able to see what docs we actually have.

Proposal & Constraints

I poked around briefly at the data dump code and I think it could be as simple as adding it here:

openlibrary/openlibrary/data/dump.py

Lines 215 to 221 in 0857026

    
           types = ( 
        
               "/type/edition", 
        
               "/type/author", 
        
               "/type/work", 
        
               "/type/redirect", 
        
               "/type/list", 
        
           )

We'd also want to make one of these special links like: https://openlibrary.org/data/ol_dump_authors_latest.txt.gz to redirect to the pages dump.

Additional context

https://github.com/internetarchive/openlibrary/wiki/Sitemap-Generation
https://github.com/internetarchive/openlibrary/wiki/Generating-Data-Dumps
https://openlibrary.org/developers/dumps

Stakeholders

The text was updated successfully, but these errors were encountered:

tfmorris · 2023-10-09T18:21:21Z

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

issue-:internetarchive#8401

Billa05 · 2024-03-18T22:02:20Z

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

Is there a way to determine the size of the large dumps, or should we just mention it in a set? If this is clarified, I can attempt to add a function that handles this logic or incorporate it into the split function.

tfmorris · 2024-03-19T14:36:17Z

I don't think anything super fancy or dynamic is needed. I would look into changing the logic of split_dump() to write files for editions, works, authors, and then everything else (perhaps called "misc" or something similar). Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them. Lists and redirects total about 75MB currently, so the new "everything else" dump should be much less than 100MB and will automatically include new types as they're introduced.

(As an aside, the lists and redirects dumps are not currently mentioned on the wiki page. You can only find them by going to the dump directory.)

Realmbird · 2024-04-09T16:18:27Z

Can this be assigned to me if Meredith doesn't want to

RayBB · 2024-04-09T16:42:38Z

@merwhite11 said she's like to work on this so I'll assign her!

merwhite11 · 2024-04-09T16:51:11Z

I’d love to work on this. Please assign to me ! :)

jimchamp · 2024-04-09T22:29:22Z

@merwhite11, limit the scope of this to /type/page data. An "everything else" data dump will need to be audited before being published, as this will include patron preferences and perhaps other personal information.

tfmorris · 2024-04-10T14:37:05Z

@jimchamp as I mentioned above:

Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them.

All the secondary dumps are subsets of the full dump generated by split_dump. If you believe there's an exposure here, it already exists in the full dump.

openlibrary/openlibrary/data/dump.py

Lines 48 to 64 in 0857026

    
           # skip user pages 
        
           if key.startswith("/people/") and not re.match( 
        
               r"^/people/[^/]+/lists/OL\d+L$", key 
        
           ): 
        
               continue 
        
           # skip admin pages 
        
           if key.startswith("/admin/"): 
        
               continue 
        
           # skip obsolete pages. Obsolete pages include volumes, scan_records and users 
        
           # marked as spam. 
        
           if key.startswith(("/b/", "/scan", "/old/")) or not key.startswith("/"): 
        
               continue 
        
           if filter and not filter(d): 
        
               continue

merwhite11 · 2024-04-11T20:19:49Z

@jimchamp @tfmorris
Paraphrasing to make sure I'm understanding correctly:

We can create a separate type in the split_dump function that grabs all misc files that don't fall into pre-existing types. Due to filters on lines 49 and 55, we don't need to worry about user data getting into dump file. Ideally this will result in a misc file < 100MB that @RayBB can use as an inventory of pages to eventually be included in sitemap.xml

Data dump -> sitemap.xml (with pages) -> sitemap in solr

eg:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",
        #add a catch-all type
        "/type/misc"
    )

#Then add an else block to write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        #else files[misc].write(line)

merwhite11 · 2024-04-11T20:32:16Z

@RayBB
In terms of generating the special links (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) to redirect to the pages dump --

My understanding is that these urls are created in the make_index function in dump.py . I would need to add some logic there to account for the misc file.

merwhite11 · 2024-04-17T03:25:00Z

Linking to the slack conversation with my progress and questions here:
https://internetarchive.slack.com/archives/C0ETZV72L/p1712873368854129

jimchamp · 2024-04-18T00:36:16Z

@merwhite11, you may want to try something like this:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",  # Remove /type/misc
    )

    # Create file for all other types:
    files['misc'] = xopen(format % 'misc', 'wt')

    # In the else block, write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        else:
            files['misc'].write(line)

This should write all other types to a single file. Disregard what I said about limiting this to only /type/page --- I didn't see @tfmorris's comments about this earlier. It wouldn't be trivial to do something like this anyway, as I'm noticing the type for some pages is type: {"key": "/type/page"} (for example, this collections page).

You may want to create a page locally to test for this. Here's how to make a local /collections page:

While logged in, navigate to localhost:8080/collections
Click the "Create it?" link
Fill out the "Title" and "Document Body" fields, then submit the form
Check the type at localhost:8080/collections.json

merwhite11 · 2024-04-18T22:17:14Z

@jimchamp
Thank you for these suggestions! Trying this approach and still unclear. I'm also not able to find the test 'type/page' page that I created in the dump file.

When I zcat the files being written to files['misc'] -- it is majority /type/language with a few type/type , type/object, type/usergroup and type/page.

Are we assuming that all pages already have the type/page associated with them? Or is it possible that a page could be labelled as a type/edition or type/work for example?

Basically, I'm confused as to why there are so few type/pages.

jimchamp · 2024-04-19T01:01:34Z

Local instances don't have much data pre-loaded. When I checked the other day, there were only three /type/page pages there.

After implementing the above changes, I'm seeing the /collections pages that I created in the misc dump. Without seeing your code, I'm not sure why you can't find the page that you created. Maybe using grep on the file would help you find it? My misc dump has over 500 entries....

merwhite11 · 2024-04-19T17:44:15Z

@jimchamp my bad ! I am getting my test file when I grep. A few more questions...

Can we assume that if we were running this in the prod env, there would be a lot more pages / there would be a type/page for every page in site?

In terms of generating the path to the split dump (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) , is this something I can test for? It doesn't seem to be entering the make_index function in dump.py in test mode.

Would creating the path for /type/page look something like this?

  #add /type/page here
 if type in ("/type/edition", "/type/work", "/type/page"):
            title = data.get("title", "untitled")
            path = key + "/" + urlsafe(title)
        elif type in ("/type/author", "/type/list"):
            title = data.get("name", "unnamed")
            path = key + "/" + urlsafe(title)
        else:
            title = data.get("title", key)
            path = key

thanks again for the help!

jimchamp · 2024-04-19T18:56:28Z

There will be more /type/pages in production, but there will not be a /type/page from each page. For example, work pages will have /type/work, edition pages will have /type/edition, author pages will have /type/author, etc (see this, this, and this, respectively).

Some of our pages are /type/i18n_page (like the root /collections page), while others are actually /type/page (like this collection). I'd expect your changes to capture these types and all of the other ones that we don't already have dumps for.

I don't really understand the code snippet that you provided, and I don't understand what "creating the path for /type/page means in the given context. Could you push the code that you have now to your repo? There's no need to create a PR now, I'd just like to test your code to better understand what is happening.

merwhite11 · 2024-04-19T19:18:00Z

Ok, that makes sense. type/page applies to pages that don't already fall into another type. In this 'misc pages dump', we want to get all the type/pages AND all other misc types.

The code snippet is to part of the make_index function in dump.py: here

Here's my last push to my fork. I haven't made many changes..thanks for taking a look!
master...merwhite11:openlibrary:8401/Fix/Make-Change-to-oldump

jimchamp · 2024-04-19T20:12:55Z

Thanks. I can't find evidence of make_index being used today, so you can revert those changes.

Make sure to remove unrelated changes before opening a PR.

RayBB · 2024-05-10T17:11:21Z

@jimchamp bumping the priority of this because it would be very helpful for me and the PR has been open a few weeks.

mekarpeles · 2024-05-13T19:29:16Z

Drini is currently in Albania and it may take some time before we can followup @RayBB

RayBB changed the title ~~Can we start making data dumps for page?~~ Make data dumps for /type/page Oct 8, 2023

xonx4l added a commit to xonx4l/openlibrary that referenced this issue Dec 11, 2023

Update dump.py

03cf7d3

issue-:internetarchive#8401

xonx4l mentioned this issue Dec 11, 2023

Add:Make data dumps for /type/page. #8611

Closed

RayBB assigned merwhite11 Apr 9, 2024

merwhite11 mentioned this issue Apr 19, 2024

Added logic for page dump and commented out test line #9127

Merged

RayBB added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] and removed Priority: 3 Issues that we can consider at our leisure. [managed] labels May 10, 2024

github-actions bot added the Needs: Response Issues which require feedback from lead label May 11, 2024

jimchamp removed the Needs: Response Issues which require feedback from lead label May 11, 2024

mekarpeles added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] labels May 13, 2024

cdrini closed this as completed in #9127 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make data dumps for /type/page #8401

Make data dumps for /type/page #8401

RayBB commented Oct 8, 2023 •

edited

tfmorris commented Oct 9, 2023

Billa05 commented Mar 18, 2024

tfmorris commented Mar 19, 2024

Realmbird commented Apr 9, 2024 •

edited

RayBB commented Apr 9, 2024

merwhite11 commented Apr 9, 2024

jimchamp commented Apr 9, 2024

tfmorris commented Apr 10, 2024

merwhite11 commented Apr 11, 2024 •

edited

merwhite11 commented Apr 11, 2024

merwhite11 commented Apr 17, 2024 •

edited by jimchamp

jimchamp commented Apr 18, 2024 •

edited

merwhite11 commented Apr 18, 2024 •

edited

jimchamp commented Apr 19, 2024 •

edited

merwhite11 commented Apr 19, 2024 •

edited

jimchamp commented Apr 19, 2024

merwhite11 commented Apr 19, 2024

jimchamp commented Apr 19, 2024

RayBB commented May 10, 2024

mekarpeles commented May 13, 2024

Make data dumps for /type/page #8401

Make data dumps for /type/page #8401

Comments

RayBB commented Oct 8, 2023 • edited

Describe the problem that you'd like solved

Why?

Proposal & Constraints

Additional context

Stakeholders

tfmorris commented Oct 9, 2023

Billa05 commented Mar 18, 2024

tfmorris commented Mar 19, 2024

Realmbird commented Apr 9, 2024 • edited

RayBB commented Apr 9, 2024

merwhite11 commented Apr 9, 2024

jimchamp commented Apr 9, 2024

tfmorris commented Apr 10, 2024

merwhite11 commented Apr 11, 2024 • edited

merwhite11 commented Apr 11, 2024

merwhite11 commented Apr 17, 2024 • edited by jimchamp

jimchamp commented Apr 18, 2024 • edited

merwhite11 commented Apr 18, 2024 • edited

jimchamp commented Apr 19, 2024 • edited

merwhite11 commented Apr 19, 2024 • edited

jimchamp commented Apr 19, 2024

merwhite11 commented Apr 19, 2024

jimchamp commented Apr 19, 2024

RayBB commented May 10, 2024

mekarpeles commented May 13, 2024

RayBB commented Oct 8, 2023 •

edited

Realmbird commented Apr 9, 2024 •

edited

merwhite11 commented Apr 11, 2024 •

edited

merwhite11 commented Apr 17, 2024 •

edited by jimchamp

jimchamp commented Apr 18, 2024 •

edited

merwhite11 commented Apr 18, 2024 •

edited

jimchamp commented Apr 19, 2024 •

edited

merwhite11 commented Apr 19, 2024 •

edited