GitHub - go-awesome/flyscrape: An expressive and elegant web scraper

flyscrape is an expressive and elegant web scraper, combining the speed of Go with the
flexibility of JavaScript. — Focus on data extraction rather than request juggling.

Features

Domains and URL filtering
Depth control
Request caching
Rate limiting
Development mode
Single binary executable

Example script

export const config = {
    url: "https://news.ycombinator.com/",
}

export default function ({ doc, absoluteURL }) {
    const title = doc.find("title");
    const posts = doc.find(".athing");

    return {
        title: title.text(),
        posts: posts.map((post) => {
            const link = post.find(".titleline > a");

            return {
                title: link.text(),
                url: link.attr("href"),
            };
        }),
    }
}

$ flyscrape run hackernews.js
[
  {
    "url": "https://news.ycombinator.com/",
    "data": {
      "title": "Hacker News",
      "posts": [
        {
          "title": "Show HN: flyscrape - An expressive and elegant web scraper",
          "url": "https://flyscrape.com"
        },
        ...
      ]
    }
  }
]

Installation

To install flyscrape, follow these simple steps:

Install Go: Make sure you have Go installed on your system. If not, you can download it from https://golang.org/.
Install flyscrape: Open a terminal and run the following command:
```
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```

Usage

$ flyscrape
flyscrape is an elegant scraping tool for efficiently extracting data from websites.

Usage:

    flyscrape <command> [arguments]

Commands:

    new    creates a sample scraping script
    run    runs a scraping script
    dev    watches and re-runs a scraping script

Create a new sample scraping script

The new command allows you to create a new boilerplate sample script which helps you getting started.

flyscrape new example.js

Watch the script for changes during development

The dev command allows you to watch your scraping script for changes and quickly iterate during development. In development mode, flyscrape will not follow any links and request caching is enabled.

flyscrape dev example.js

Run the scraping script

The run command allows you to run your script.

flyscrape run example.js

Configuration

Below is an example scraping script that showcases the capabilities of flyscrape:

export const config = {
    url: "https://example.com/", // Specify the URL to start scraping from.
    depth: 0,                    // Specify how deep links should be followed.  (default = 0, no follow)
    allowedDomains: [],          // Specify the allowed domains. ['*'] for all. (default = domain from url)
    blockedDomains: [],          // Specify the blocked domains.                (default = none)
    allowedURLs: [],             // Specify the allowed URLs as regex.          (default = all allowed)
    blockedURLs: [],             // Specify the blocked URLs as regex.          (default = none)
    rate: 100,                   // Specify the rate in requests per second.    (default = no rate limit)
    cache: "file",               // Enable file-based request caching.          (default = no cache)
};

export default function ({ doc, url, absoluteURL }) {
    // doc              - Contains the parsed HTML document
    // url              - Contains the scraped URL
    // absoluteURL(...) - Transforms relative URLs into absolute URLs
}

Query API

// <div class="element" foo="bar">Hey</div>
const el = doc.find(".element")
el.text()                                 // "Hey"
el.html()                                 // `<div class="element">Hey</div>`
el.attr("foo")                            // "bar"
el.hasAttr("foo")                         // true
el.hasClass("element")                    // true

// <ul>
//   <li class="a">Item 1</li>
//   <li>Item 2</li>
//   <li>Item 3</li>
// </ul>
const list = doc.find("ul")
list.children()                           // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

const items = list.find("li")
items.length()                            // 3
items.first()                             // <li>Item 1</li>
items.last()                              // <li>Item 3</li>
items.get(1)                              // <li>Item 2</li>
items.get(1).prev()                       // <li>Item 1</li>
items.get(1).next()                       // <li>Item 3</li>
items.get(1).parent()                     // <ul>...</ul>
items.get(1).siblings()                   // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
items.map(item => item.text())            // ["Item 1", "Item 2", "Item 3"]
items.filter(item => item.hasClass("a"))  // [<li class="a">Item 1</li>]

Contributing

We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please submit an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
cmd/flyscrape		cmd/flyscrape
docs		docs
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
js.go		js.go
js_test.go		js_test.go
mock.go		mock.go
module.go		module.go
scrape.go		scrape.go
template.js		template.js
utils.go		utils.go
watch.go		watch.go
watch_test.go		watch_test.go

License

go-awesome/flyscrape

Folders and files

Latest commit

History

Repository files navigation

Features

Example script

Installation

Usage

Create a new sample scraping script

Watch the script for changes during development

Run the scraping script

Configuration

Query API

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages