WiToKit

Welcome to WiToKit, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.

WiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.

Note: WiToKit currently only supports xx-pages-articles.xml.xx.bz2 Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.

Install

After a git clone, run:

python3 setup.py install

Use

Download

To download a .bz2-compressed Wikipedia XML dump, do:

witokit download ⁠\
  --lang lang_wp_code \
  --date wiki_date \
  --output /abs/path/to/output/dir/where/to/store/bz2/archives \
  --num-threads num_cpu_threads

For example, to download the latest English Wikipedia, do:

witokit download ⁠--lang en --date latest --output /abs/path/to/output/dir --num-threads 2

The --lang parameter expects the WP (language) code corresponding to the desired Wikipedia archive. Check out the full list of Wikipedias with their corresponding WP codes here.

The --date parameter expects a string corresponding to one of the dates found under the Wikimedia dump site corresponding to a given Wikipedia dump (e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).

Important Keep num-threads <= 3 to avoid rejection from Wikimedia servers

Extract

To extract the content of the downloaded .bz2 archives, do:

witokit extract \
  --input /abs/path/to/downloaded/wikipedia/bz2/archives \
  --num-threads num_cpu_threads

Process

To preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:

witokit process \
  --input /abs/path/to/wikipedia/extracted/xml/archives \
  --output /abs/path/to/single/output/txt/file \
  --lower \  # if set, will lowercase text
  --num-threads num_cpu_threads

Preprocessing for all languages is performed with Polyglot.

Sample

You can also use WiToKit to sample the content of a preprocess .txt file, using:

witokit sample \
  --input /abs/path/to/witokit/preprocessed/txt/file \
  --percent \  # percentage of total lines to keep
  --balance  # if set, will balance sampling, otherwise, will take top n sentences only

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
witokit		witokit
.gitignore		.gitignore
.pydocstylerc		.pydocstylerc
.pylintrc		.pylintrc
.travis.yml		.travis.yml
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

witokit

witokit

.gitignore

.gitignore

.pydocstylerc

.pydocstylerc

.pylintrc

.pylintrc

.travis.yml

.travis.yml

INSTALL.md

INSTALL.md

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

setup.py

setup.py

Repository files navigation

WiToKit

Install

Use

Download

Extract

Process

Sample

About

Releases 14

Packages

Languages

License

akb89/witokit

Folders and files

Latest commit

History

Repository files navigation

WiToKit

Install

Use

Download

Extract

Process

Sample

About

Topics

Resources

License

Stars

Watchers

Forks

Languages