Scrapex

Introduction

Scrapex is a versatile scraping component designed to efficiently extract content from URLs. Leveraging the power of Playwright and Chrome, it ensures seamless support for Single Page Applications (SPAs) and content dependent on JavaScript execution. Initially developed for internal use by our AI Agents, Scrapex offers robust functionality for a wide range of scraping needs.

Features

Support for Multiple Output Formats: Scrapex can output data in HTML, Markdown, or PDF formats, catering to diverse requirements.
Container Image deployment: For ease of deployment and scalability, Scrapex is fully compatible with Container environments such as Docker or Kubernetes.
Customizable Settings: Through environment variables, as well as parameters in the extraction call, users can tailor the behavior of Scrapex to suit their specific scraping tasks.

Configuration

Scrapex supports the following output formats:

HTML: Direct extraction of HTML content.
Markdown: Conversion of HTML to Markdown using html-to-md.
PDF: Generation of PDF documents utilizing Playwright's PDF functionality.

Environment Variables

Configure Scrapex using the following environment variables:

Variable	Description	Default
`PORT`	Port on which Node.js server listens	`3000`
`DEFAULT_WAIT`	Default milliseconds to wait on page load	`0`
`DEFAULT_USER_AGENT`	Default user agent for requests	`"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"`
`LOG_LEVEL`	Logging level (`debug`, `info`, `warn`, `error`)	`debug`

How to Run

The simplest way to run Scrapex is using Docker. Here's an example docker-compose.yaml:

version: "3"
services:
    app:
        container_name: scrapex
        image: ghcr.io/cloudx-labs/scrapex:latest # it's better to pin down to a specific release version such as v0.1
        environment:
            - TZ=America/Argentina/Buenos_Aires
            - PORT=3000
            - LOG_LEVEL=debug
        ports:
            - "3003:3000"

Usage Example

To test Scrapex, you can send a request using curl as shown below:

curl --location 'http://localhost:3003/extract' \
--header 'Content-Type: application/json' \
--data '{
    "url": "https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon",
    "outputType": "pdf",
    "wait": 0,
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "settings": {
        "pdf": {
            "options": {
                "format": "A4"
            }
        }
    }
}'

Payload Parameters

The following table describes the parameters included in the payload of the curl example:

Parameter	Description	Example
url	URL of the page to scrape	https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
outputType	Desired output format	html / md / pdf
wait	Milliseconds to wait before extraction	2000
userAgent	User agent to use for the request	Mozilla/5.0 (Windows NT 10.0; Win64; x64)...
settings	Additional settings for output formatting	{ "pdf": { "options": { "format": "A4" } } }

Settings per extraction Type

PDF

All available values for settings -> pdf -> options can be found at: https://playwright.dev/docs/api/class-page#page-pdf

Markdown (MD)

All available values for setting -> md -> options can be found at: https://github.com/stonehank/html-to-md/blob/master/README-EN.md

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
scrapex		scrapex
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

scrapex

scrapex

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Scrapex

Introduction

Features

Configuration

Environment Variables

How to Run

Usage Example

Payload Parameters

Settings per extraction Type

PDF

Markdown (MD)

About

Releases 4

Packages 1

Languages

License

cloudx-labs/scrapex

Folders and files

Latest commit

History

Repository files navigation

Scrapex

Introduction

Features

Configuration

Environment Variables

How to Run

Usage Example

Payload Parameters

Settings per extraction Type

PDF

Markdown (MD)

About

Resources

License

Stars

Watchers

Forks

Languages