Support pagination with a --next option #105

simonw · 2023-02-08T00:57:30Z

Would be neat if you could do pagination when running shot-scraper javascript - by running extra JavaScript that returns the URL of the next page to visit.

The text was updated successfully, but these errors were encountered:

simonw · 2023-02-08T00:58:36Z

Here's a prototype I built to help me scrape through all of https://news.ycombinator.com/from?site=simonwillison.net following the more links:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 9bc48aa..eb3a80e 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -524,6 +524,21 @@ def accessibility(url, auth, output, javascript, timeout, log_console, skip, fai
     is_flag=True,
     help="Output JSON strings as raw text",
 )
+@click.option(
+    'next_',
+    "--next",
+    help="JavaScript to run to find next page",
+)
+@click.option(
+    "--next-delay",
+    type=int,
+    help="Milliseconds to wait before following --next",
+)
+@click.option(
+    "--next-limit",
+    type=int,
+    help="Maximum number of --after pages",
+)
 @browser_option
 @user_agent_option
 @reduced_motion_option
@@ -536,6 +551,9 @@ def javascript(
     auth,
     output,
     raw,
+    next_,
+    next_delay,
+    next_limit,
     browser,
     user_agent,
     reduced_motion,
@@ -571,6 +589,7 @@ def javascript(
     if not javascript:
         javascript = input.read()
     url = url_or_file_path(url, _check_and_absolutize)
+    next_count = 0
     with sync_playwright() as p:
         context, browser_obj = _browser_context(
             p,
@@ -582,9 +601,27 @@ def javascript(
         page = context.new_page()
         if log_console:
             page.on("console", console_log)
-        response = page.goto(url)
-        skip_or_fail(response, skip, fail)
-        result = _evaluate_js(page, javascript)
+        result = []
+        while url:
+            response = page.goto(url)
+            skip_or_fail(response, skip, fail)
+            evaluated = _evaluate_js(page, javascript)
+            if next_:
+                result.extend(evaluated)
+            else:
+                result = evaluated
+            next_count += 1
+            if next_:
+                if next_limit is not None and next_count >= next_limit:
+                    raise click.ClickException(
+                        f"Reached --after-limit of {next_limit} pages"
+                    )
+                url = _evaluate_js(page, next_)
+                print(url)
+                if next_delay:
+                    time.sleep(next_delay / 1000)
+            else:
+                url = None
         browser_obj.close()
     if raw:
         output.write(str(result))

I ran it like this and it worked!

shot-scraper javascript \
    'https://news.ycombinator.com/from?site=simonwillison.net' \
    -i /tmp/scrape.js \
    --next '() => {
        let el = document.querySelector(".morelink[rel=next]");
        if (el) {
            return el.href;
        }  
    }' -o /tmp/all.json --next-delay 1000

simonw · 2023-02-08T00:59:36Z

Needs more thought about how things like concatenating together results from multiple pages should work.

It would also be neat if this could return a {"method": "POST", "body": "..."} object as an alternative to returning a URL, then shot-scraper could hit subsequent pages using other HTTP methods. Maybe persist cookies too!

daaain · 2023-08-11T10:33:30Z

I was trying to scrape some Google Maps lists of places, but didn't manage as the first page that loads is a cookie notice that triggers a navigation event when accepted / rejected that results in Error: Execution context was destroyed, most likely because of a navigation., but this sounds like it could solve it?

To your question, maybe it could just return JSON-LD and leave the concat to downstream?

dynabler · 2024-04-28T15:40:20Z

Pagination is difficult to wrap your head around. I scrape 1000 of pages on a daily basis, and pagination is something no scraper can get right.

From the script above, --next is supposed to get the next link. “Which” next links are we talking about?

In a nutshell, websites consist of list pages and single pages. List pages “list” the pages a website has, and single pages are the “final” page.

How devs think about scraping

For this type of scraping, think of any list page (IMDb genre pages, Amazon shoes pages), then a “next” is fine. The list page is the final page.

flowchart LR;
start-url --> list-page-1
start-url --> list-page-2
start-url --> list-page-3

What scrapers actually want

But, in reality, list pages have a very different purpose. Lists are a “summary” of a page, not the actual data scrapers want. List pages are designed to “entice” users to click. It doesn't have the actual data a scraper wants (see case below).

flowchart LR;
start-url --> list-page-1-->single-pages-11[single page 1]
list-page-1-->single-pages-12[single page 2]
list-page-1-->single-pages-13[single page 3]
start-url --> list-page-2-->single-pages-21[single page 1]
list-page-2-->single-pages-22[single page 2]
list-page-2-->single-pages-23[single page 3]

Summary

To sum it up, allowing shot-scraper to “follow” links, one has to think about 2 types of links to be followed: pagination links (1,2,3, next etc.), and list items (card, article, col etc.). It also helps to actually call it that:

shot-scraper https://amazon.com/shoes --pagination a[label=next] --follow a.items

A case for pagination + follow

Example: huggingface.co

On the list pages, you got: name, category, update, number of downloads rounded to near 1000 and favorites rounded to near 1000

Let's say you want the growth-rate. On the list page it's listed as 227K, but when you click and visit the actual page it says 226,828

The difference between scraping the list page and the actual page is it takes 1000 downloads before you notice a change. In real life, it means you won't be able to catch “trending” AI models.

Another example: you want to know the sentiment about an AI model. On list pages, you have favorites. That doesn't say much about an AI model, a person can favorite to get updates, view it later, likes the idea, interested in how it works etc. “favorite” doesn't really say much about an AI model.

On the actual page, you have a community tab, which reveals far more about sentiment. The ratio between open and closed issues, for example. 800 open issues and 1 closed one tells a different story then 800 open/1000 closed, 0 open/800 closed or even 800 closed/last update: 1980

Example: rottentomatoes.com

Another example of a list page not having everything you need is rottentomatoes.com.

On the list page, you got title, tomato meter, audience score, openings date.

On the actual page you got MPA rating (G, PG, PG-13), genre, duration, critics consensus, recommendation/similar movies, where to watch, language, synopsis, cast.

Even if you don't require anything complicated (genre, for example), shot-scraper still needs to visit the actual page to get the info, since the list page lacks pretty much anything.

Pagination resources

Most commonly used pagination types

auto (detect which type of pagination is used)
Link (<a href="https://example.com">)
Scripted link (<a href="javascript:window.location="https://example.com">)
Attribute link (<a data-link="https://example">)
Text link(<div>link: https://example.com</div>)
Link from any script (window.location=, window.open)
Click multiple times on next/more button ([Next page][Load More])
Click once on multiple buttons ([1] [2], [3])

simonw added enhancement New feature or request research labels Feb 8, 2023

simonw added a commit that referenced this issue Apr 27, 2023

--next prototype for pagination, refs #105

758e4b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pagination with a --next option #105

Support pagination with a --next option #105

simonw commented Feb 8, 2023

simonw commented Feb 8, 2023

simonw commented Feb 8, 2023

daaain commented Aug 11, 2023

dynabler commented Apr 28, 2024

Support pagination with a --next option #105

Support pagination with a --next option #105

Comments

simonw commented Feb 8, 2023

simonw commented Feb 8, 2023

simonw commented Feb 8, 2023

daaain commented Aug 11, 2023

dynabler commented Apr 28, 2024

How devs think about scraping

What scrapers actually want

Summary

A case for pagination + follow

Example: huggingface.co

Example: rottentomatoes.com

Pagination resources