Skip to content

Python Web Crawler implementing Iterative Deepening Depth Search

License

Notifications You must be signed in to change notification settings

clj0020/python-web-crawler

Repository files navigation

Description

This is a web crawler built in python that implements Iterative Deepening Depth Search to scrape all of the children links of a specified base url up to a specified depth. While scraping, the program saves each page's HTML into a text file and the runs a Unigram Feature Extractor on those files. Once dependencies are installed and the program is run, simply enter an url and the depth that you want to search and press submit. The program will then start scraping sites, and once the site is saved to the html_files folder, it will be added to a list in the user interface that allows you to view that sites unigram features as a graph when you click on them.

Note: Written in Python 3

Usage

Installing dependencies

To install dependency Python modules in the requirements.txt file:

make install

Running tests

To run the test modules inside the src package:

make test

Starting the application

To execute the code in the main.py script:

make start

Creating an .exe file (Windows)

To create an .exe file in the build package:

make executable

Makefile

Makefile contains all the shortcuts for frequently used project commands.

Note: make sure you have Make installed.

Python3

If you have python and run into issues, make sure you're shell is pointing to the right version of Python with python --version.

About

Python Web Crawler implementing Iterative Deepening Depth Search

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages