Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to run shot-scraper javascript against several URLs at once #148

Open
simonw opened this issue Apr 2, 2024 · 3 comments
Open

Ability to run shot-scraper javascript against several URLs at once #148

simonw opened this issue Apr 2, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Apr 2, 2024

I found myself wanting to use the Readability trick against multiple URLs, without having to pay the startup cost of launching a new Chromium instance for each one.

Idea: a way to run shot-scraper javascript against more than one URL, returning an array of results.

@simonw simonw added the enhancement New feature or request label Apr 2, 2024
@simonw
Copy link
Owner Author

simonw commented Apr 2, 2024

Challenge: the current UI for that command is:

shot-scraper javascript $URL $JAVASCRIPT

How would passing multiple URLs work? It would be easier if JavaScript came first as then you could tag on multiple URLs as positional options, but that doesn't feel right against the current design.

Some options:

  • A new command, javascript-multi - similar to how shot-scraper multi works in taking multiple screenshots at once
  • Add a -m multi-option to the javascript command and teach it to do those as well as the first one
    • Could have a special case here where shot-scraper javascript $JAVASCRIPT -m $URL1 -m $URL2 works - because it treats that first argument as the JavaScript in the case where there is only one positional argument and at least one -m option
  • shot-scraper javascript $JAVASCRIPT --urls $FILENAME which takes URLS from a file (or - for standard input) rather than expecting them to be passed as -m options

@simonw
Copy link
Owner Author

simonw commented Apr 2, 2024

I built a prototype of that second option:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 3f1245e..86fc7b4 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -653,6 +653,13 @@ def accessibility(
     is_flag=True,
     help="Output JSON strings as raw text",
 )
+@click.option(
+    "multis",
+    "-m",
+    "--multi",
+    help="Run same JavaScript against multiple pages",
+    multiple=True,
+)
 @browser_option
 @browser_args_option
 @user_agent_option
@@ -668,6 +675,7 @@ def javascript(
     auth,
     output,
     raw,
+    multis,
     browser,
     browser_args,
     user_agent,
@@ -704,9 +712,26 @@ def javascript(
 
     If a JavaScript error occurs an exit code of 1 will be returned.
     """
+    # Special case for --multi - if multis are provided but JavaScript
+    # positional option was not set, assume the first argument is JS
+    if multis and not javascript:
+        javascript = url
+        url = None
+
+    # If they didn't provide JavaScript, assume it's being piped in
     if not javascript:
         javascript = input.read()
-    url = url_or_file_path(url, _check_and_absolutize)
+
+    to_process = []
+    if url:
+        to_process.append(url_or_file_path(url, _check_and_absolutize))
+    to_process.extend(url_or_file_path(multi, _check_and_absolutize) for multi in multis)
+
+    results = []
+
+    if len(to_process) > 1 and not raw:
+        output.write("[\n")
+
     with sync_playwright() as p:
         context, browser_obj = _browser_context(
             p,
@@ -719,18 +744,28 @@ def javascript(
             auth_username=auth_username,
             auth_password=auth_password,
         )
-        page = context.new_page()
-        if log_console:
-            page.on("console", console_log)
-        response = page.goto(url)
-        skip_or_fail(response, skip, fail)
-        result = _evaluate_js(page, javascript)
+        for i, url in enumerate(to_process):
+            is_last = i == len(to_process) - 1
+            page = context.new_page()
+            if log_console:
+                page.on("console", console_log)
+            response = page.goto(url)
+            skip_or_fail(response, skip, fail)
+            result = _evaluate_js(page, javascript)
+            if raw:
+                output.write(str(result) + "\n")
+            else:
+                output.write(
+                    json.dumps(result, indent=4, default=str) + ("\n" if is_last else ",\n")
+                )
+
         browser_obj.close()
-    if raw:
-        output.write(str(result))
-        return
-    output.write(json.dumps(result, indent=4, default=str))
-    output.write("\n")
+
+    if len(to_process) > 1 and not raw:
+        output.write("]\n")
+
+    if len(results) == 1:
+        results = results[0]
 
 
 @cli.command()

Then used like this:

shot-scraper javascript "
async () => {
  const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
  return (new readability.Readability(document)).parse();
}" \
-m https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ \
-m https://simonwillison.net/2024/Mar/26/llm-cmd/ \
-m https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ \
-m https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/ \
-m https://simonwillison.net/2024/Mar/16/weeknotes-the-aftermath-of-nicar/ | tee /tmp/all.json

It worked, but I'm not sure if the design is right - in particular it feels inconsistent with how shot-scraper multi works.

@dynabler
Copy link

Here are some idea's I have come across in other scraping tools:

url: https://example.com
urls: [https://example.com/page/{},1,243] # range through pages 1 to 243
urls:[...range(https://example.com/page/{},1,243)] # with an explicit range and some fuction needed
urls: ['https://example.com/', 'https://google.com', 'https://bing.com']

import urls from "./example_page_links.txt"
urls: urls.split("\n"),

Side Note: going through all the research stuff in issues: it's perhaps an idea to allow shot-scraper to use a config file. That way, all arguments you can pass in command line can be put neatly in a config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants