Coursera-Web-Scraper

Prerequisites

Python (tested for 3.7 and above)
PhantomJs or any other headless browser for automation.
Selenium Web Driver for Python
Other Libraries:
- Pandas
- Mutliprocessing

Instructions

Make sure you download PhantomJs (included in the dependency folder if you clone) and add it to your PATH.
Clone the repository.

git clone https://github.com/mihirs16/Coursera-Web-Scraper

First run coursera_scraper.py to retrieve a list of all courses from the Coursera course directory.

python coursera_scraper.py

Now run coursera_deep_scraper.py to retrieve the details of all the courses in the list.

python coursera_deep_scraper.py

If met with the connection closed error, the Coursera.org website is blocking your request and thus the script is overreaching the allowed policy of the website. Please exercise and caution and respect.

Disclaimer:-

Data is all around us, but this doesn't mean we own it. Please respect the policies of Coursera and the website Coursera.org. Please be respectful of the website and avoid spamming it with continuous zero delay or parallel requests. You can check the website's scraping policy here.
Since Coursera and all websites take measures to handle multiple parallel requests, one might face "Connection closed" during long period of scraping. You can continue the scraping by restarting the script, if left undisturbed, the script reads the courses scraped from the list of courses, and the courses left. Thus letting you continue from where you left off. (Once again reminding you to respect the website and it's policies).

Dataset

The Coursera Courses Dataset is uploaded here.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
dependency		dependency
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
coursera-course-data.csv		coursera-course-data.csv
coursera-course-detail-data.csv		coursera-course-detail-data.csv
coursera_deep_scraper.py		coursera_deep_scraper.py
coursera_scraper.py		coursera_scraper.py
ghostdriver.log		ghostdriver.log
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dependency

dependency

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

coursera-course-data.csv

coursera-course-data.csv

coursera-course-detail-data.csv

coursera-course-detail-data.csv

coursera_deep_scraper.py

coursera_deep_scraper.py

coursera_scraper.py

coursera_scraper.py

ghostdriver.log

ghostdriver.log

test.py

test.py

Repository files navigation

Coursera-Web-Scraper

Prerequisites

Instructions

Disclaimer:-

Dataset

About

Languages

License

mihirs16/Coursera-Web-Scraper

Folders and files

Latest commit

History

Repository files navigation

Coursera-Web-Scraper

Prerequisites

Instructions

Disclaimer:-

Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Languages