Monocorpus: Tatar Language Monocorpus Development Tools

Overview

The Monocorpus project aims to provide tools for developing a Tatar language monocorpus. The project includes functionality to extract texts from books and save them in files.

Features

Extract text from EPUB and PDF files.
Post-processing of extracted text to remove unwanted characters (e.g. OCR artifacts). More precisely, the following steps are performed:
- Remove sudden ASCII chars in the tatar word (e.g. с[0x0063)]у --> с[0x0441]у)
- Remove sudden non-ASCII chars in the non-tatar word (e.g. а[0x0430]rm --> a[0x0061]rm)
- Unify punctuation marks by replacing look-alikes with a single variant (e.g. '»' | '«' | '“' | '”' | '„' --> '"')
- Remove unwanted characters (e.g. '•')
- Remove sudden digits at the end of the word (e.g. башына2 —> башына)

Getting Started

To get started with the project, follow these steps:

Clone the Repository:

git clone https://github.com/neurotatarlar/monocorpus.git
cd monocorpus

Prepare Python Environment:

Make sure you have Python 3.x installed on your system.
Create and activate a virtual environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Extract Texts from Books:

Place your book(s) into the workdir/000_entry_point folder. Currently we support EPUB and PDF formats.
Run the script to extract texts:

python src/main.py extract

Proces dirty extracted texts further:

Run the script to process extracted texts:

python src/main.py process

Explore the Output:

Processed text files will be saved in the workdir/900_artifacts directory.

Project Structure

src/: Contains the main script for text extraction and processing.
workdir/000_entry_point: Place your books here for text extraction.
workdir/900_artifacts: Processed text files will be saved here.
requirements.txt: List of required Python dependencies.

Contributing

Contributions are welcome! If you'd like to contribute to the project, make your changes and submit a pull request detailing the changes made.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
src		src
workdir/900_artifacts		workdir/900_artifacts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.json		index.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

src

src

workdir/900_artifacts

workdir/900_artifacts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

index.json

index.json

requirements.txt

requirements.txt

Repository files navigation

Monocorpus: Tatar Language Monocorpus Development Tools

Overview

Features

Getting Started

Project Structure

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

neurotatarlar/monocorpus

Folders and files

Latest commit

History

Repository files navigation

Monocorpus: Tatar Language Monocorpus Development Tools

Overview

Features

Getting Started

Project Structure

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages