Skip to content

This repository contains a Python code for web scraping. The code is capable of extracting HTML, CSS, JavaScript, Favicons, and Screenshots from web pages. It utilizes the PhishTank library for enhanced functionality. This code was developed as part of a research paper published at IEEE SOLI 2023.

Notifications You must be signed in to change notification settings

ShahilPatel-IITDh/scrapio

Repository files navigation

scrapio

A Python library that is helpful in scrapping complete webpages including HTML, JavaScript, CSS, and Favicons. Just plug and play.

Webpage Content Downloader Python Library

GitHub GitHub release (latest by date) GitHub last commit Python version

A Python library that allows you to download the HTML, JavaScript, CSS, and favicons of a webpage. This library is useful for web scraping, archiving web pages, or analyzing web content locally.

Table of Contents

Usage

To use this code you need to install the following libraries on your system.

  • pandas
  • BeautifulSoup
  • Selenium
  • webdriver-manager
  • requests
  • pillow
pip install pandas
pip install beautifulsoup4
pip install selenium
pip install webdriver-manager
pip install requests
pip install pillow

Change the name of the LogFile to whatever name you require (make sure the extension is .xlsx). Change the mainPage_URL to the URL of the PhishTank page containing legitimate URLs if you want to scrape the data for legitimate or to the page containing Phishy URLs if you want to scrape the data for phishy.

# mainPage_URL for the webpage containing list of legitimate URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&valid=n&Search=Search"


# mainPage_URL for the webpage containing list of Phishy URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&active=y&valid=y&Search=Search"

Documentation

The code is designed to perform web scraping on a list of URLs retrieved from the Phishtank database. For each URL in the list, the code conducts comprehensive web scraping, capturing various resources, including:

  • The HTML code of the landing page.
  • Javascript content (both inline and external).
  • CSS content (both inline and external).
  • Images found on the landing page.
  • The website's favicon.
  • A screenshot of the landing page.

This process allows for the extraction and analysis of multiple types of data from each URL, which can be useful for various purposes such as security analysis, content archiving, and data extraction.

Contribution

Contributers:

  • Patel Shahil Manishbhai (Indian Institute of Technology, Dharwad, India)
  • Shivam Pradip Tirmare (Indian Institute of Technology, Dharwad, India)
  • Aditya Kulkarni (Indian Institute of Technology, Dharwad, India)
  • Vivek Balachandran (Singapore Institute of Technology, Singapore)
  • Tamal Das (Indian Institute of Technology, Dharwad, India)

License

About

This repository contains a Python code for web scraping. The code is capable of extracting HTML, CSS, JavaScript, Favicons, and Screenshots from web pages. It utilizes the PhishTank library for enhanced functionality. This code was developed as part of a research paper published at IEEE SOLI 2023.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published