Description

This is a web crawler built in python that implements Iterative Deepening Depth Search to scrape all of the children links of a specified base url up to a specified depth. While scraping, the program saves each page's HTML into a text file and the runs a Unigram Feature Extractor on those files. Once dependencies are installed and the program is run, simply enter an url and the depth that you want to search and press submit. The program will then start scraping sites, and once the site is saved to the html_files folder, it will be added to a list in the user interface that allows you to view that sites unigram features as a graph when you click on them.

Note: Written in Python 3

Usage

Installing dependencies

To install dependency Python modules in the requirements.txt file:

make install

Running tests

To run the test modules inside the src package:

make test

Starting the application

To execute the code in the main.py script:

make start

Creating an .exe file (Windows)

To create an .exe file in the build package:

make executable

Makefile

Makefile contains all the shortcuts for frequently used project commands.

Note: make sure you have Make installed.

Python3

If you have python and run into issues, make sure you're shell is pointing to the right version of Python with python --version.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
UML		UML
datasets		datasets
src		src
.atomignore		.atomignore
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
Procfile		Procfile
README.md		README.md
__main__.py		__main__.py
accuracies.txt		accuracies.txt
gui.py		gui.py
requirements.txt		requirements.txt
setup.py		setup.py

License

clj0020/python-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Description

Usage

Installing dependencies

Running tests

Starting the application

Creating an .exe file (Windows)

Makefile

Python3

About

Topics

Resources

License

Stars

Watchers

Forks

Languages