Sensus

This repository contains 3 parts of iPython notebooks, which reveal the whole process of model development for the sentiment analysis from data processing to comparative analysis of different LSTM models. Visualization is accompanied throughout the journey. The model was created for the analysis of the Ukrainian text.

📥 Downloading Data

Before running notebooks, we first need to download all the data we will be using.

As always, the first step is to clone the repository:

>> git clone https://github.com/JackShen1/sensus.git

Learning datasets now include 1,000 positive and 1,000 negative book reviews. Originally, this data was taken from a large dataset with a review from Amazon, you can download it here. And then reviews of books were translated with the help of Google Translator into Ukrainian and slightly edited by me. Raw reviews can be found in the data/ folder.

Since there is no support for the Ukrainian language in the NLTC library, we will take a different path. The most complete list of Ukrainian stop words was found here and they were used in this project.

Also at the processing stage (part 1) a stemmer was used for comparison, for good we would use PorterStemmer from nltk.stem, but for obvious reasons we can't. But this is not a problem, because writing your own PorterStemmer realization is not so difficult, so we wrote it for Python based on this PHP code.

And the last thing we need to download is a Word2Vec model. For simplicity, we will use a pretrained Word2Vec model with Ukrainian words-vectors, each of which has a dimension of 300. We chose the lematized version of this model because we already have our sample, which we processed in the part 1, which would fit perfectly here. The model can be found on this website. After downloading, unzip the bz2 archive (~1Gb), for example using this application;

📝 Requirements

In order to run the iPython notebook, you'll need Python (v3.6+) and the following libraries:

Keras (v2.4+)
Gensim (v3.8+)
Pandas (v1.2+)
NumPy (v1.19.5+)
NLTK (v3.5+)
python-decouple (v3.4+)
pymorphy2-dicts-uk (v2.4.1+)
pymorphy2 (v0.9+)
scikit-learn (v0.24.1)
SciPy (v0.19.1+)
Matplotlib (v2.1.1+)
Jupyter

The commands for installing these libraries will follow. First, let's create a virtual environment.

🐍 Creating a Virtual Environment

The easiest way to install Keras, Gensim, NumPy, Jupyter, matplotlib and our other libraries is to start with the Anaconda Python distribution.

Select your OS and follow the installation instructions for Anaconda Python. We recommend using Python 3.6+ (64-bit).
Install the Python development environment on your system:
```
>> pip install -U pip virtualenv
```
If you haven't done so already, download and unzip this entire repository from GitHub:
```
>> git clone https://github.com/JackShen1/sensus.git
```
Use cd to navigate into the top directory of the repo on your machine.

Open Anaconda Promt and install JupyterLab, also enter the following commands:

>> conda install -c conda-forge jupyterlab    # install JupyterLab
>> conda create -n sensus pip python=3.7  # choose the Python version
>> source activate sensus                 # activate the virtual environment

Alternatively, you can install Jupyter with pip: pip install jupyterlab

Now we can install all the libraries we need:

>> pip install Keras gensim pandas numpy nltk python-decouple scikit-learn scipy matplotlib pymorphy2
>> pip install -U pymorphy2-dicts-uk # dictionary for the Ukrainian language

Launch Jupyter by entering:
```
>> jupyter notebook
```

Once you have everything installed, the next time to activate everything, do the following:

Open Anaconda Prompt and enter the project folder with the cd command. Now enter the following commands:
```
>> conda activate sensus
>> jupyter notebook
```

📋 Overview

In this project in 3 parts the whole process of data preparation and training of our model was described, the comparative analysis of classifiers and various models is carried out. Each stage is accompanied by data visualization. The results are good, as for such small datasets with not very accurate translation. In the future, I will expand the datasets and correct the translation. In everything else, the project works perfectly and can be easily adapted to English or Russian. Read the detailed description in notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
Part I - Data Processing.ipynb		Part I - Data Processing.ipynb
Part II - Sentiment Analysis Classifications - Review and Comparison.ipynb		Part II - Sentiment Analysis Classifications - Review and Comparison.ipynb
Part III - Training and Development of Different LSTM Models.ipynb		Part III - Training and Development of Different LSTM Models.ipynb
README.md		README.md
documents.pql		documents.pql
stemmer_ua.py		stemmer_ua.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

models

models

.gitignore

.gitignore

LICENSE

LICENSE

Part I - Data Processing.ipynb

Part I - Data Processing.ipynb

Part II - Sentiment Analysis Classifications - Review and Comparison.ipynb

Part II - Sentiment Analysis Classifications - Review and Comparison.ipynb

Part III - Training and Development of Different LSTM Models.ipynb

Part III - Training and Development of Different LSTM Models.ipynb

README.md

README.md

documents.pql

documents.pql

stemmer_ua.py

stemmer_ua.py

Repository files navigation

Sensus

📥 Downloading Data

📝 Requirements

🐍 Creating a Virtual Environment

📋 Overview

📫 Get in touch

About

Releases

Packages

Languages

License

jackshendrikov/sensus

Folders and files

Latest commit

History

Repository files navigation

Sensus

📥 Downloading Data

📝 Requirements

🐍 Creating a Virtual Environment

📋 Overview

📫 Get in touch

About

Topics

Resources

License

Stars

Watchers

Forks

Languages