NOTICE!

This project has been archived. The ever changing structure of Amazon reviews and the newer mechansisms in place to prevent web scraping are formidable.

Scrape Amazon Review Pages

Amazon has a system in place to keep you from scraping their pages. What this Python app does is scrape a page from a headless Chrome browser instance using the Selenium WebDriver for Chrome.

This allows you to feed a list of Amazon ASINs in as a .csv (no header) and scrape the number of reviews received and the number of stars as well.

Each page of reviews will be scraped, so if you provide a large number of ASINs and/or ASINs with a large number of reviews, it could take some time.

Fields that will be retrieved are: 'asin', 'product_title', 'rating', 'review_title', 'variation', 'review_text', 'review-links'

Fair Warning

Web scraping is not an exact science at times, so if a web page's structure changes, or even if something as simple as a class is renamed or a data-hook type attribute removed, this code will break. This repo could use some foolproofing and more thought, but for now it works - and we're definitely happy to have any contributions.

Debugging Tip

If you're running/testing and having errors, your chromedriver process is likely still running so make sure to Force quit or kill the process in your OS task/process manager.

Setup with pipenv

Install all dependencies from the pipfile

pipenv install

Usage

Just pass the path to your csv of ASINs (no header) as a command line argument as such

# Windows
py amzreviewscrape.py --asins="C:\PATH\TO\ASINS\FILE.CSV" --driverpath="C:\PATH\TO\CHROMEDRIVER"

# Mac OSx/Linux
py amzreviewscrape.py --asins="/path/to/asins/csv" --driverpath="/path/to/chromedriver"

To pass additional options to chromedriver such as:

--disable-dev-shm-usage
--no-sandbox

You can pass the options with --options and separated by commas:

py amzreviewscrape.py --asins="/path/to/asins/csv" --driverpath="/path/to/chromedriver" --options="disable-dev-shm-usage,no-sandbox"

Dependencies:

Requires >= Python version 3.6.3

This requires the Selenium Web Driver for Google Chrome which can be found here.

You will need to install separately and provide to amzreviewscrape.py via the --driverpath argument or install to either usr/local/bin/chromedriver for OSx/Linux or C:\chromedriver\chromedriver\ for Windows to have it sourced automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
amzreviewscrape.py		amzreviewscrape.py
helpers.py		helpers.py
readme.MD		readme.MD
scrape-output.png		scrape-output.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

amzreviewscrape.py

amzreviewscrape.py

helpers.py

helpers.py

readme.MD

readme.MD

scrape-output.png

scrape-output.png

Repository files navigation

NOTICE!

Scrape Amazon Review Pages

Fair Warning

Debugging Tip

Setup with pipenv

Usage

Dependencies:

The CSV Output currently looks like:

About

Releases

Packages

Contributors 2

Languages

License

aflansburg/amzreviewsscrape

Folders and files

Latest commit

History

Repository files navigation

NOTICE!

Scrape Amazon Review Pages

Fair Warning

Debugging Tip

Setup with pipenv

Usage

Dependencies:

The CSV Output currently looks like:

About

Resources

License

Stars

Watchers

Forks

Languages