Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pagination with a --next option #105

Open
simonw opened this issue Feb 8, 2023 · 4 comments
Open

Support pagination with a --next option #105

simonw opened this issue Feb 8, 2023 · 4 comments
Labels
enhancement New feature or request research

Comments

@simonw
Copy link
Owner

simonw commented Feb 8, 2023

Would be neat if you could do pagination when running shot-scraper javascript - by running extra JavaScript that returns the URL of the next page to visit.

@simonw simonw added enhancement New feature or request research labels Feb 8, 2023
@simonw
Copy link
Owner Author

simonw commented Feb 8, 2023

Here's a prototype I built to help me scrape through all of https://news.ycombinator.com/from?site=simonwillison.net following the more links:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 9bc48aa..eb3a80e 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -524,6 +524,21 @@ def accessibility(url, auth, output, javascript, timeout, log_console, skip, fai
     is_flag=True,
     help="Output JSON strings as raw text",
 )
+@click.option(
+    'next_',
+    "--next",
+    help="JavaScript to run to find next page",
+)
+@click.option(
+    "--next-delay",
+    type=int,
+    help="Milliseconds to wait before following --next",
+)
+@click.option(
+    "--next-limit",
+    type=int,
+    help="Maximum number of --after pages",
+)
 @browser_option
 @user_agent_option
 @reduced_motion_option
@@ -536,6 +551,9 @@ def javascript(
     auth,
     output,
     raw,
+    next_,
+    next_delay,
+    next_limit,
     browser,
     user_agent,
     reduced_motion,
@@ -571,6 +589,7 @@ def javascript(
     if not javascript:
         javascript = input.read()
     url = url_or_file_path(url, _check_and_absolutize)
+    next_count = 0
     with sync_playwright() as p:
         context, browser_obj = _browser_context(
             p,
@@ -582,9 +601,27 @@ def javascript(
         page = context.new_page()
         if log_console:
             page.on("console", console_log)
-        response = page.goto(url)
-        skip_or_fail(response, skip, fail)
-        result = _evaluate_js(page, javascript)
+        result = []
+        while url:
+            response = page.goto(url)
+            skip_or_fail(response, skip, fail)
+            evaluated = _evaluate_js(page, javascript)
+            if next_:
+                result.extend(evaluated)
+            else:
+                result = evaluated
+            next_count += 1
+            if next_:
+                if next_limit is not None and next_count >= next_limit:
+                    raise click.ClickException(
+                        f"Reached --after-limit of {next_limit} pages"
+                    )
+                url = _evaluate_js(page, next_)
+                print(url)
+                if next_delay:
+                    time.sleep(next_delay / 1000)
+            else:
+                url = None
         browser_obj.close()
     if raw:
         output.write(str(result))

I ran it like this and it worked!

shot-scraper javascript \
    'https://news.ycombinator.com/from?site=simonwillison.net' \
    -i /tmp/scrape.js \
    --next '() => {
        let el = document.querySelector(".morelink[rel=next]");
        if (el) {
            return el.href;
        }  
    }' -o /tmp/all.json --next-delay 1000

@simonw
Copy link
Owner Author

simonw commented Feb 8, 2023

Needs more thought about how things like concatenating together results from multiple pages should work.

It would also be neat if this could return a {"method": "POST", "body": "..."} object as an alternative to returning a URL, then shot-scraper could hit subsequent pages using other HTTP methods. Maybe persist cookies too!

@daaain
Copy link

daaain commented Aug 11, 2023

I was trying to scrape some Google Maps lists of places, but didn't manage as the first page that loads is a cookie notice that triggers a navigation event when accepted / rejected that results in Error: Execution context was destroyed, most likely because of a navigation., but this sounds like it could solve it?

To your question, maybe it could just return JSON-LD and leave the concat to downstream?

@dynabler
Copy link

Pagination is difficult to wrap your head around. I scrape 1000 of pages on a daily basis, and pagination is something no scraper can get right.

From the script above, --next is supposed to get the next link. “Which” next links are we talking about?

In a nutshell, websites consist of list pages and single pages. List pages “list” the pages a website has, and single pages are the “final” page.

How devs think about scraping

For this type of scraping, think of any list page (IMDb genre pages, Amazon shoes pages), then a “next” is fine. The list page is the final page.

flowchart LR;
start-url --> list-page-1
start-url --> list-page-2
start-url --> list-page-3

What scrapers actually want

But, in reality, list pages have a very different purpose. Lists are a “summary” of a page, not the actual data scrapers want. List pages are designed to “entice” users to click. It doesn't have the actual data a scraper wants (see case below).

flowchart LR;
start-url --> list-page-1-->single-pages-11[single page 1]
list-page-1-->single-pages-12[single page 2]
list-page-1-->single-pages-13[single page 3]
start-url --> list-page-2-->single-pages-21[single page 1]
list-page-2-->single-pages-22[single page 2]
list-page-2-->single-pages-23[single page 3]

Summary

To sum it up, allowing shot-scraper to “follow” links, one has to think about 2 types of links to be followed: pagination links (1,2,3, next etc.), and list items (card, article, col etc.). It also helps to actually call it that:

shot-scraper https://amazon.com/shoes --pagination a[label=next] --follow a.items

A case for pagination + follow

Example: huggingface.co

On the list pages, you got: name, category, update, number of downloads rounded to near 1000 and favorites rounded to near 1000

Let's say you want the growth-rate. On the list page it's listed as 227K, but when you click and visit the actual page it says 226,828

The difference between scraping the list page and the actual page is it takes 1000 downloads before you notice a change. In real life, it means you won't be able to catch “trending” AI models.

Another example: you want to know the sentiment about an AI model. On list pages, you have favorites. That doesn't say much about an AI model, a person can favorite to get updates, view it later, likes the idea, interested in how it works etc. “favorite” doesn't really say much about an AI model.

On the actual page, you have a community tab, which reveals far more about sentiment. The ratio between open and closed issues, for example. 800 open issues and 1 closed one tells a different story then 800 open/1000 closed, 0 open/800 closed or even 800 closed/last update: 1980

Example: rottentomatoes.com

Another example of a list page not having everything you need is rottentomatoes.com.

On the list page, you got title, tomato meter, audience score, openings date.

On the actual page you got MPA rating (G, PG, PG-13), genre, duration, critics consensus, recommendation/similar movies, where to watch, language, synopsis, cast.

Even if you don't require anything complicated (genre, for example), shot-scraper still needs to visit the actual page to get the info, since the list page lacks pretty much anything.

Pagination resources

Most commonly used pagination types

auto (detect which type of pagination is used)
Link (<a href="https://example.com">)
Scripted link (<a href="javascript:window.location="https://example.com">)
Attribute link (<a data-link="https://example">)
Text link(<div>link: https://example.com</div>)
Link from any script (window.location=, window.open)
Click multiple times on next/more button ([Next page][Load More])
Click once on multiple buttons ([1] [2], [3])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research
Projects
None yet
Development

No branches or pull requests

3 participants