scrapio

A Python library that is helpful in scrapping complete webpages including HTML, JavaScript, CSS, and Favicons. Just plug and play.

Webpage Content Downloader Python Library

A Python library that allows you to download the HTML, JavaScript, CSS, and favicons of a webpage. This library is useful for web scraping, archiving web pages, or analyzing web content locally.

Usage

To use this code you need to install the following libraries on your system.

pandas
BeautifulSoup
Selenium
webdriver-manager
requests
pillow

pip install pandas

pip install beautifulsoup4

pip install selenium

pip install webdriver-manager

pip install requests

pip install pillow

Change the name of the LogFile to whatever name you require (make sure the extension is .xlsx). Change the mainPage_URL to the URL of the PhishTank page containing legitimate URLs if you want to scrape the data for legitimate or to the page containing Phishy URLs if you want to scrape the data for phishy.

# mainPage_URL for the webpage containing list of legitimate URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&valid=n&Search=Search"


# mainPage_URL for the webpage containing list of Phishy URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&active=y&valid=y&Search=Search"

Documentation

The code is designed to perform web scraping on a list of URLs retrieved from the Phishtank database. For each URL in the list, the code conducts comprehensive web scraping, capturing various resources, including:

The HTML code of the landing page.
Javascript content (both inline and external).
CSS content (both inline and external).
Images found on the landing page.
The website's favicon.
A screenshot of the landing page.

This process allows for the extraction and analysis of multiple types of data from each URL, which can be useful for various purposes such as security analysis, content archiving, and data extraction.

Contribution

Contributers:

Patel Shahil Manishbhai (Indian Institute of Technology, Dharwad, India)
Shivam Pradip Tirmare (Indian Institute of Technology, Dharwad, India)
Aditya Kulkarni (Indian Institute of Technology, Dharwad, India)
Vivek Balachandran (Singapore Institute of Technology, Singapore)
Tamal Das (Indian Institute of Technology, Dharwad, India)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Legitimate-Resources		Legitimate-Resources
Phishy-Resources		Phishy-Resources
Correlation-Matrix.png		Correlation-Matrix.png
Phishy-Correlation.png		Phishy-Correlation.png
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legitimate-Resources

Legitimate-Resources

Phishy-Resources

Phishy-Resources

Correlation-Matrix.png

Correlation-Matrix.png

Phishy-Correlation.png

Phishy-Correlation.png

README.md

README.md

main.py

main.py

Repository files navigation

scrapio

Webpage Content Downloader Python Library

Table of Contents

Usage

Documentation

Contribution

License

About

Releases

Packages

ShahilPatel-IITDh/scrapio

Folders and files

Latest commit

History

Repository files navigation

scrapio

Webpage Content Downloader Python Library

Table of Contents

Usage

Documentation

Contribution

License

About

Topics

Resources

Stars

Watchers

Forks