GitHub - gaganmanku96/Gibberish-Detection: This repo contains my experiments related to gibberish detection.

Gibberish Detection

What is it?

Gibberish looks like real words, but it really has no meaning at all. Example - hhduaihd

Requirements

numpy
seaborn

pip install -r requirements.txt

Experiment 1:

Define accepted characters as [a-z ].
Create method to Tokenize at character level.
Create method to generate ngram.
Create a 27x27 matrix with 10 as initial value.
- This matrix will tell us the probability of getting 2 characters simultaneouly.
- Initially it is set to 10 because if new word occurs which we haven't seen then we don't want it's probability to be zero.
The heatmap of probabilities will look like this The heatmap is uniform because the probability is same for every pair.
Take a large corpus and read every line.
- Tokenzie each line and calculate ngram.
- Increase the count in probability matrix with 1 on each occurance of a character pair.
Now, we need some way to normalize these probabilites. For that I've divided every row by its sum and taken log of it.
After getting normalized the heatmap of probabilities looks like this.
How to decide whether the word gibberish or not. For that we can try 2 approaches
- Multiply the probabilites of ngrams in Markov Chain fashion.
- Add the probabilites of ngrams.

Adding probabilies seems good option because if a low probability ngram occurs then it will impact the whole accuracy drastically whereas addition we not cause that much impact.

For prediction, we will generate ngrams and add the probabilities.

Contribution

Please refer to CONTRIBUTIONS.md

Reference

https://github.com/rrenaud/Gibberish-Detector

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets/images		assets/images
training_data		training_data
.gitignore		.gitignore
1. Experiment.ipynb		1. Experiment.ipynb
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets/images

assets/images

training_data

training_data

.gitignore

.gitignore

1. Experiment.ipynb

1. Experiment.ipynb

CONTRIBUTING.md

CONTRIBUTING.md

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Gibberish Detection

What is it?

Requirements

Experiment 1:

Contribution

Reference

About

Languages

gaganmanku96/Gibberish-Detection

Folders and files

Latest commit

History

Repository files navigation

Gibberish Detection

What is it?

Requirements

Experiment 1:

Contribution

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages