Web Crawler

Async multi-threaded Web Crawler based on Vert.x framework.

Contains two main commands:

crawl - for crawling the Web
find - for multi-threaded searching word in directory recursively

Known limitations

Crawls only HTML files
Only UTF-8 encoding is supported (for searching too)
The option --resolveLinks is experimental and works only with anchor links (tag <a>) which point the html files

How to package it?

mvn clean package

vertx-crawler-1.0-SNAPSHOT-fat.jar file will be created at target directory

How to run Crawl command?

java -jar target/vertx-crawler-1.0-SNAPSHOT-fat.jar crawl [--conf=<config>] [--dir=<directory>] [--depth=<depth>] 
                                                          [--delay=<delay>] [--downloads=<downloads>] 
                                                          [--loaders=<loaders>] [--parsers=<parsers>] 
                                                          [--linksToFiles=<linksToFiles>] 
                                                          [--storeOriginals=<storeOriginals>] url

Options and Arguments:

--conf <config>                     Specifies configuration that should be
                                    provided to the verticle. <config>
                                    should reference either a text file
                                    containing a valid JSON object which
                                    represents the configuration OR be a
                                    JSON string. There is a sample config
                                    file in project root.
--dir <directory>                   Specifies directory for downloaded
                                    files. Defaults is 'output'.  
--depth <depth>                     Specifies how deeply crawler must dig.
                                    Defaults is 5.  
--delay <delay>                     Specifies how many milliseconds must be
                                    delayed between requests. 
                                    Defaults is 200.  
--downloads <downloads>             Specifies how many simultaneous
                                    downloads can be started for one loader.
                                    Defaults is 10.
--loaders <loaders>                 Specifies how many loaders instances
                                    will be deployed. Defaults is 1.  
--parsers <parsers>                 Specifies how many parsers instances
                                    will be deployed. Defaults is available
                                    processors.
--resolveLinks <resolveLinks>       (Experimental) specifies would be links 
                                    with relative urls resolved to absolute 
                                    ones (otherwise they will not be 
                                    clickable). Defaults is true.
--linksToFiles <linksToFiles>       (Experimental) specifies would be links
                                    in html documents changed to point the
                                    downloaded files. Have effect only 
                                    if --resolvedLinks is true.
                                    Defaults is false.
--storeOriginals <storeOriginals>   Specifies would be original html
                                    documents stored after updating links or
                                    not. Defaults is false.
<url>                               Web site url for crawling.

How to run Find command?

java -jar target/vertx-crawler-1.0-SNAPSHOT-fat.jar find [--dir=<directory>] [--ext=<extension>] 
                                                         [--finders=<finders>] [--sensitive] [--whole] word

Options and Arguments:

--dir <directory>          Specifies directory for searching. Defaults is
                           current directory.
--ext <extension>          Specifies file extension for searching. If is
                           present then word will be searching only in files
                           with same extension. Defaults is * means all
                           extensions.
--finders <finders>        Specifies how many finders instances will be
                           deployed. Defaults is 2.
--sensitive                Will the search case sensitive or not. Defaults
                           is false.
--whole                    Search for whole word only or not. Defaults is
                           false.
<word>                     The word for searching.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
README.md		README.md
config.json		config.json
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

config.json

config.json

pom.xml

pom.xml

Repository files navigation

Web Crawler

Known limitations

How to package it?

How to run Crawl command?

How to run Find command?

About

Releases

Packages

Languages

Megaprog/vertx-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Known limitations

How to package it?

How to run Crawl command?

How to run Find command?

About

Resources

Stars

Watchers

Forks

Languages