Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a high level Python API #145

Open
simonw opened this issue Mar 9, 2024 · 5 comments
Open

Add a high level Python API #145

simonw opened this issue Mar 9, 2024 · 5 comments

Comments

@simonw
Copy link
Owner

simonw commented Mar 9, 2024

In this NICAR workshop: https://github.com/dwillis/shot-scraper-nicar24

This code: https://github.com/dwillis/shot-scraper-nicar24/blob/main/demo.py

def shotscraper_card(team, season):
    ncaa_id = team['ncaa_id']
    name = team['team']

    # JavaScript to be executed by shot-scraper
    javascript_code = """
    Array.from(document.querySelectorAll('.s-person-card__content'), el => {
        const id = '';
        const name = el.querySelector('.s-person-details__personal-single-line').innerText;
        const year = el.querySelectorAll('.s-person-details__bio-stats-item')[1].childNodes[1].wholeText.trim();
        let ht = el.querySelectorAll('.s-person-details__bio-stats-item')[2].childNodes[1].wholeText;
        const height = ht ? ht.trim() : '';
        const position = el.querySelectorAll('.s-person-details__bio-stats-item')[0].childNodes[1].textContent.trim()
        const hometown = el.querySelectorAll('.s-person-card__content__person__location-item')[0].childNodes[2].textContent.trim();
        let hs_el = el.querySelectorAll('.s-person-card__content__person__location-item')[1].childNodes[1].textContent;
        const high_school = hs_el ? hs_el.trim() : '';
        const previous_school = '';
        let j = el.querySelector('.s-stamp__text');
        const jersey = j ? j.innerText : '';
        const url = el.querySelector('a')['href']
        return {id, name, year, hometown, high_school, previous_school, height, position, jersey, url};
    })
    """

    roster = []
    url = team['url'] + "/roster/" + season
    # Execute shot-scraper with the given JavaScript
    try:
        result = subprocess.check_output(['shot-scraper', 'javascript', url, javascript_code, "--user-agent", "Firefox"])
        parsed_data = json.loads(result)

        for player in parsed_data:
            player['team_id'] = ncaa_id
            player['team'] = name
            player['season'] = season

        return parsed_data
    except:
        raise

It shouldn't be necessary to have to use subprocess to do something this straight-forward in shot-scraper. I'd like to support something like this instead:

import shot_scraper

result = shot_scraper.javascript(url, javascript_code, user_agent="Firefox")
@simonw
Copy link
Owner Author

simonw commented Mar 9, 2024

Might be better to provide a class, so you can instantiate once (loading up the headless browser) and then use it for multiple things.

Or... do that, but still have a shot_scraper.javascript(...) shortcut for quick one-off tasks.

@simonw
Copy link
Owner Author

simonw commented Mar 9, 2024

Initial rough API design:

shot_scraper.javascript(url, javascript_code) -> a JSON decoded result

With keyword arguments for most of these:

Options:
  -i, --input FILENAME            Read input JavaScript from this file
  -a, --auth FILENAME             Path to JSON authentication context file
  -o, --output FILENAME           Save output JSON to this file
  -r, --raw                       Output JSON strings as raw text
  -b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
                                  Which browser to use
  --browser-arg TEXT              Additional arguments to pass to the browser
  --user-agent TEXT               User-Agent header to use
  --reduced-motion                Emulate 'prefers-reduced-motion' media
                                  feature
  --log-console                   Write console.log() to stderr
  --fail                          Fail with an error code if a page returns an
                                  HTTP error
  --skip                          Skip pages that return HTTP errors
  --bypass-csp                    Bypass Content-Security-Policy
  --auth-password TEXT            Password for HTTP Basic authentication
  --auth-username TEXT            Username for HTTP Basic authentication

image_bytes = shot_scraper.shot(url)

With a TON of options, see https://shot-scraper.datasette.io/en/stable/screenshots.html#shot-scraper-shot-help


... etc

@simonw
Copy link
Owner Author

simonw commented Mar 9, 2024

This is going to end up being a pretty big refactor, because I'll want the CLI tool to use the new Python API under the hood.

@simonw
Copy link
Owner Author

simonw commented Mar 14, 2024

Prototyped this with Claude 3 Opus: https://gist.github.com/simonw/a43ee47f528c0d3dc894bb4ba38aa94a

@davidbgk
Copy link

davidbgk commented May 7, 2024

Another use-case where I'd love to be able to call shot-scraper directly from Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants