MapPTTH

A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.

Dependencies

MapPTTH uses:

libcURL (>= 7.62.0)
Lexbor (see Installation if you don't want to install it on your system)
libxml2
libPCRE
CMake (>= 3.1.0)

Optional

GraphViz (libgvc and libcgraph): generate graphs
libcheck: unit tests

Installation

Dependencies

On Ubuntu (with GraphViz support):

sudo apt install cmake libpcre3-dev libcurl4-openssl-dev libxml2-dev libgraphviz-dev

Cloning and building

If you don't have Lexbor installed and don't want to install it, you can clone Lexbor while cloning MapPTTH and compile without any installation:

git clone --recurse-submodules https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5

If you have all dependencies installed on your system:

git clone https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5

Generate tests

If you want to generate unit tests

GraphViz support

If GraphViz is found on the system when running CMake, you will be able to generate graphs.

If you want to disable it, you can run cmake -DMAPPTTH_NO_GRAPHVIZ=1 .. instead of cmake ...

How to use

Parameters

The only required argument is an URL. This URL specifies where the crawler will start its crawling.

Here is the list of available parameters grouped by category:

Connection

Name	Argument
URL where to start crawling, the last specified will be used. (REQUIRED)	<URL>
String that will be used as user-agent. You can disable sending the user-agent header by giving an empty string. (default='MAPPTTH/')	-U <user-agent>
Timeout in seconds for each connection. If a connection timeout, an error will be printed to standard error but no informations about the URL. (default=3)	-m <timeout>
Only resolve to IPv4 addresses.	-4
Only resolve to IPv6 addresses.	-6
Add headers in the HTTP request, they are like this: "<key>:<value>;", the ':' and the value are optionals and they have to end with a ';'.	-Q <header>
Allow insecure connections when using SSL/TLS.	-i
Add cookies in the HTTP request, they are like this: "<key>:<value>;", you can specify mulitple cookies at once by separating them by a ';'. Note that they won't be modified during the crawl.	-C <cookies>

Controlling where the crawler goes

Name	Argument
Allow the crawler to go into subdomains of the initial URL and allowed domains. (default=false)	-s
Allow the crawler to go to these domains.	-a <domain>
Disallow the crawler to go to these domains.	-d <domain>
Allow the crawler to only fetch URL starting with these paths. Can be a regex (extended and case-sensitive).	-p <path or regex>
Disallow the crawler to fetch URL starting with these paths. Can be a regex (extended and case-sensitive).	-P <path or regex>
Maximum depth of paths. If a path has a longer depth, it won't be fetched.	-D <depth>
Only fetch URLs with HTTP as scheme (Don't forget to add '-r 80' if you start with an 'https://' URL).	-f
Only fetch URLs with HTTPS as scheme (Don't forget to add '-r 443' if you start with an 'http://' URL).	-F
Allow the crawler to only fetch files with these extensions. If no extension is found then this filter won't apply.	-x .<extension>
Disallow the crawler to fetch files with these extensions. If no extension is found then this filter won't apply.	-X .<extension>
Allow the crawler to go to theses ports	-r
Keep the query part of the URL. Note that if two same URLs with a different query is found, both will be fetched.	-q

Parsing

Name	Argument
Only parse the <head> part.	-H
Only parse the <body> part.	-B

Output

Name	Argument
Don't print with colors.	-c
Print the title of the page if there is one when displaying an URL.	-T
File to write output into (without colors).	-o <file name>
Print a summary of what was found as a directory structure	-O
Print when encountering tel: and mailto: URLs.	-I

Graph

MapPTTH must be compiled with GraphViz support.

Name	Argument
Create a graph.	-g
Change the layout of the graph. (default='sfdp')	-L <layout>
Change the output graph file format. (default='png')	-G <format>

Other

Name	Argument
Number of threads that will fetch URLs. (default=5)	-t <number of threads>
Parse the sitemap of the site, this should speeds up the crawler and will maybe provide URLs that couldn't be found without the sitemap.	-S <URL of the sitemap>
Parse the robots.txt of the site, paths found in 'allowed' and 'disallowed' directives are added to the list of found URLs. Other directives are ignored.	-R <URL of the robots.txt file>
URL of the proxy to use.	-z <URL of the proxy>
Print the help.	-h
Print the version.	-V

You can stop the crawler with CTRL-C at any moment, this will gracefully stop the crawler and it will finish as normal.

Exemples

Simple crawl:

mapptth https://google.com

Start crawling at a certain URL:

mapptth https://google.com/some/url/file.html

More threads:

mapptth https://google.com -t 10

Allow to crawl into subdomains (ex: www.google.com, mail.google.com, ww.mail.google.com):

mapptth https://google.com -s

Allow to crawl certain domains and their subdomains (ex: www.google.com, mail.gitlab.com, www.mail.github.com):

mapptth http://google.com -s -a gitlab.com -a github.com -r 443

Disallow some paths:

mapptth https://google.com -P /path -P /some-path

Disallow a path and only fetch .html and .php files:

mapptth https://google.com -P /some-path -x .html -x .php

Only crawl in the /path directory:

mapptth https://google.com -p /path

A more complete and complicated one:

mapptth https://google.com/mail -x .html -P /some-path -t 10 -m 5 -s -q -D 6 -T -o output.txt -H -S http://www.google.com/sitemap.xml

TODO

ASAP:

Handling the <base> tag

Without any priority :

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
lexbor @ 31270b1		lexbor @ 31270b1
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
cli_parser.c		cli_parser.c
cli_parser.h		cli_parser.h
fetcher_thread.c		fetcher_thread.c
fetcher_thread.h		fetcher_thread.h
logger.c		logger.c
logger.h		logger.h
main.c		main.c
robots_txt.c		robots_txt.c
robots_txt.h		robots_txt.h
sitemaps_parser.c		sitemaps_parser.c
sitemaps_parser.h		sitemaps_parser.h
stack_documents.c		stack_documents.c
stack_documents.h		stack_documents.h
stack_urls.c		stack_urls.c
stack_urls.h		stack_urls.h
tests_mapptth.c		tests_mapptth.c
trie_urls.c		trie_urls.c
trie_urls.h		trie_urls.h
utils.c		utils.c
utils.h		utils.h

License

A3onn/mapptth

Folders and files

Latest commit

History

Repository files navigation

A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.

Dependencies

Optional

Installation

Dependencies

Cloning and building

Generate tests

GraphViz support

How to use

Parameters

Connection

Controlling where the crawler goes

Parsing

Output

Graph

Other

Exemples

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages