Skip to content
eugeneotto edited this page Jul 20, 2013 · 11 revisions

Parsley is a simple language for extracting structured data from web pages. Parsley consists of a powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting.

Check out A Simple Tutorial to extract a CSV file of beers by brewery and rating in only a few lines of code.

Parsley has a Command-line Interface, Ruby Bindings, Python Bindings, and a C Interface, and can output to JSON, CSV, and XML.

The following parselet parses a Yelp business listing (no endorsement implied).

{
  "name": "h1",
  "phone": "#bizPhone",
  "address": "address",
  "reviews(.review)": [
    {
      "date": ".date",
      "user_name": ".user-name a",
      "comment": "with-newlines(.review_comment)"
    }
  ]
}

You can get JSON out by typing:

sh$: parsley businesses.let http://www.yelp.com/biz/amnesia-san-francisco

To get a site-wide crawl that will dump a businesses.csv, and a reviews.csv (with foreign key to businesses), run:

sh$: csvget --parselet=businesses.let http://www.yelp.com/biz/amnesia-san-francisco

It’s that easy. Get started with Installation Instructions.

Sites are for example purposes. Please obey robots.txt.