-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data dumps for /type/page #8401
Comments
A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out. |
Is there a way to determine the size of the large dumps, or should we just mention it in a set? If this is clarified, I can attempt to add a function that handles this logic or incorporate it into the split function. |
I don't think anything super fancy or dynamic is needed. I would look into changing the logic of (As an aside, the lists and redirects dumps are not currently mentioned on the wiki page. You can only find them by going to the dump directory.) |
Can this be assigned to me if Meredith doesn't want to |
@merwhite11 said she's like to work on this so I'll assign her! |
I’d love to work on this. Please assign to me ! :) |
@merwhite11, limit the scope of this to |
@jimchamp as I mentioned above:
All the secondary dumps are subsets of the full dump generated by openlibrary/openlibrary/data/dump.py Lines 48 to 64 in 0857026
|
@jimchamp @tfmorris We can create a separate type in the split_dump function that grabs all misc files that don't fall into pre-existing types. Due to filters on lines 49 and 55, we don't need to worry about user data getting into dump file. Ideally this will result in a misc file < 100MB that @RayBB can use as an inventory of pages to eventually be included in sitemap.xml Data dump -> sitemap.xml (with pages) -> sitemap in solr eg:
|
@RayBB My understanding is that these urls are created in the make_index function in dump.py . I would need to add some logic there to account for the misc file. |
Linking to the slack conversation with my progress and questions here: |
@merwhite11, you may want to try something like this:
This should write all other types to a single file. Disregard what I said about limiting this to only You may want to create a page locally to test for this. Here's how to make a local /collections page:
|
@jimchamp When I zcat the files being written to files['misc'] -- it is majority Are we assuming that all pages already have the Basically, I'm confused as to why there are so few type/pages. |
Local instances don't have much data pre-loaded. When I checked the other day, there were only three After implementing the above changes, I'm seeing the |
@jimchamp my bad ! I am getting my test file when I grep. A few more questions... Can we assume that if we were running this in the prod env, there would be a lot more pages / there would be a type/page for every page in site? In terms of generating the path to the split dump (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) , is this something I can test for? It doesn't seem to be entering the make_index function in dump.py in test mode. Would creating the path for /type/page look something like this?
thanks again for the help! |
There will be more Some of our pages are I don't really understand the code snippet that you provided, and I don't understand what "creating the path for |
Ok, that makes sense. type/page applies to pages that don't already fall into another type. In this 'misc pages dump', we want to get all the The code snippet is to part of the make_index function in dump.py: here Here's my last push to my fork. I haven't made many changes..thanks for taking a look! |
Thanks. I can't find evidence of Make sure to remove unrelated changes before opening a PR. |
@jimchamp bumping the priority of this because it would be very helpful for me and the PR has been open a few weeks. |
Drini is currently in Albania and it may take some time before we can followup @RayBB |
Describe the problem that you'd like solved
I would like to get a data dump of just the
/type/page
entities.I can see that there are some in the
all_types_dump
but it's too big for me to download.I was just running:
curl -s -L https://openlibrary.org/data/ol_dump_latest.txt.gz | gunzip -c | grep '/type/page'
Why?
Proposal & Constraints
I poked around briefly at the data dump code and I think it could be as simple as adding it here:
openlibrary/openlibrary/data/dump.py
Lines 215 to 221 in 0857026
We'd also want to make one of these special links like: https://openlibrary.org/data/ol_dump_authors_latest.txt.gz to redirect to the pages dump.
Additional context
https://github.com/internetarchive/openlibrary/wiki/Sitemap-Generation
https://github.com/internetarchive/openlibrary/wiki/Generating-Data-Dumps
https://openlibrary.org/developers/dumps
Stakeholders
The text was updated successfully, but these errors were encountered: