Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

A multiprocessing webscraper for Coursera.org to build a dataset for all courses with details like ratings, difficulty, etc.

License

Notifications You must be signed in to change notification settings

mihirs16/Coursera-Web-Scraper

Repository files navigation

Coursera-Web-Scraper

Prerequisites

  • Python (tested for 3.7 and above)
  • PhantomJs or any other headless browser for automation.
  • Selenium Web Driver for Python
  • Other Libraries:
    • Pandas
    • Mutliprocessing

Instructions

  • Make sure you download PhantomJs (included in the dependency folder if you clone) and add it to your PATH.
  • Clone the repository.
git clone https://github.com/mihirs16/Coursera-Web-Scraper
python coursera_scraper.py
  • Now run coursera_deep_scraper.py to retrieve the details of all the courses in the list.
python coursera_deep_scraper.py
  • If met with the connection closed error, the Coursera.org website is blocking your request and thus the script is overreaching the allowed policy of the website. Please exercise and caution and respect.

Disclaimer:-

  1. Data is all around us, but this doesn't mean we own it. Please respect the policies of Coursera and the website Coursera.org. Please be respectful of the website and avoid spamming it with continuous zero delay or parallel requests. You can check the website's scraping policy here.
  2. Since Coursera and all websites take measures to handle multiple parallel requests, one might face "Connection closed" during long period of scraping. You can continue the scraping by restarting the script, if left undisturbed, the script reads the courses scraped from the list of courses, and the courses left. Thus letting you continue from where you left off. (Once again reminding you to respect the website and it's policies).

Dataset

The Coursera Courses Dataset is uploaded here.