Skip to content

Suite of Python modules to recognise the language of a file

License

Notifications You must be signed in to change notification settings

mc-cat-tty/Language-Classification

Repository files navigation

File Language Analyzer

File Language Analyzer is a suite of Python modules, that provides objects, constants and functions, to recognise the language of a file, analyze its informations and process (elaborate and create) .csv letter frequency tables.


Keep in mind that this project is programmed very poorly, however the logic behind the adopted method is interesting.

Table of Contents

Project Status

License build Version

Features

  • Recognise the language of a file
  • Convert .csv frequency table to Python dictionary
  • Convert Python dictionary to .csv frequency table
  • Generate frequency table starting from a set of Twitter messages

Math behind it

By analyzing the frequency of every single letter is possible to detect the language of a given text.
Once the characters' frequencies have been extracted, this information can be used as a representation of the text.
We want to find out which is its language, so we have to determine which is the table's column that has the nearest values.
To accomplish that, it can be used the Pythagorean theorem extended to 26 dimensions, the number of letters in the Latin alphabet.
By computing the distance between the given text and each language inside the table, it's possible to define which is the nearest language.

Technologies

  • Python 3.x
  • Python built-in libraries
  • Twitter API wrapped by tweepy library
  • wikipedia-api module
  • Flask

Requirements

Use one of the following commands (according to the configuration of your environment):

$ pip install -r requirements.txt

or

$ py -m pip install -r requirements.txt

Launch

If you are in Bash-like environment with Python installed, you can run directly by typing:

$ ./Main.py

Otherwise, depending on your Python interpreter installation and your OS:

$ python Main.py

or

$ py Main.py

After that, go to http://127.0.0.1:5000 or http://localhost:5000 and try out the web interface.

Default frequency table is letters_frequency_twitter.csv

Usage

If you want to use tweetrain.py's functions, you have to insert your personal Twitter tokens. Look at the first four uppercase variables and fill in double quotes with the proper value.