ElasticSearch BigData importer

Imports raw JSON to Elasticsearch in a multi-thread way

We have 5 state here

Only validating data
Import data to ElasticSearch without validation
- Import using single-thread
- Import using multi-thread
Import data to ElasticSearch after validation
- Import using single-thread
- Import using multi-thread

Prerequisites

Install the elasticsearch package with pip :

pip install elasticsearch

Use

Options

--data          : The data file
--check         : Validate data file
--bulk          : ElasticSearch endpoint ( http://localhost:9200 )
--index         : Index name
--type          : Index type
--import        : Import data to ES
--thread        : Threads amount, default = 1
--help          : Display help message

Validate data

I suggest you check your data before ( or during ) import process

python import.py --data test_data.json --check

Single Thread

Import without validation

python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name

Import after validation

python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check

Multi Thread

Import without validation

python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --thread 16

Import after validation

python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check --thread 16

We have much faster process using multi-thread way. It depends on your computer/server resources. This script used linecache to put data in RAM, so you need enough memory capacity too

My test situation :

AMD Ryzen 3800X ( 8 core / 16 thread )
64GB Ram ( 3000MHz / CL16 )
Windows 10
10Gb JSON file with ~24 million objects
Elasticsearch v7

The whole process took about ~30 minutes and the usage of resources were efficient

Support

Contributing

Fork it!
Create your feature branch : git checkout -b my-new-feature
Commit your changes : git commit -am 'Add some feature'
Push to the branch : git push origin my-new-feature
Submit a pull request :D

Issues

Each project may have many problems. Contributing to the better development of this project by reporting them

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Diagram.drawio		Diagram.drawio
Diagram.png		Diagram.png
LICENSE		LICENSE
README.md		README.md
import.py		import.py
test_data.json		test_data.json
threads.png		threads.png
utils.py		utils.py

License

hatamiarash7/elasticsearch-dump

Folders and files

Latest commit

History

Repository files navigation

ElasticSearch BigData importer

Prerequisites

Use

Options

Validate data

Single Thread

Import without validation

Import after validation

Multi Thread

Import without validation

Import after validation

My test situation :

Support

Contributing

Issues

About

Topics

Resources

License

Stars

Watchers

Forks

Languages