wikimedia-history-import

Import all the tsv wikimedia history dump to mongodb

Repository purpose

The purpose of this repo is to import all the italian tsv wikimedia history dump in a mongodb database. The reference to the dump is here.

All the data in the tsv is preserved, but separated in three collections in base of the event_type: revisions, pages and users. The types are correctly parsed before inserting to mongodb, so the timestamps become dates, the comma-separated lists become arrays of strings, ecc. ecc.

How was it made

The repo consists in only two files:

main.py: It is a python script that given a tsv file creates three json files (one for collection) ready to be imported.
lavora.sh: It is a bash script that for each year of the italian history dump, downloads the compressed file, decompresses it, jsonizes it through the python script, imports the json files through mongoimport and deletes the files that are no more needed.

How to use it

Just execute ./lavora.sh, after making it executable through chmod +x lavora.sh.

Notes

You can choose which range of years to download modifing the FROM and TO variables in lavora.sh.

The script could take hours in order to finish.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
utils		utils
LICENSE		LICENSE
README.md		README.md
lavora.sh		lavora.sh
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils

utils

LICENSE

LICENSE

README.md

README.md

lavora.sh

lavora.sh

main.py

main.py

Repository files navigation

wikimedia-history-import

Repository purpose

How was it made

How to use it

Notes

About

Releases 3

Packages

Languages

License

WikiCommunityHealth/wikimedia-history-import

Folders and files

Latest commit

History

Repository files navigation

wikimedia-history-import

Repository purpose

How was it made

How to use it

Notes

About

Resources

License

Stars

Watchers

Forks

Languages