Some(NLP)

This project is for me to learn some basic NLP concepts and also I integrated it with Spark. There are two types of Pipelines here in the project, first is the NLP pipeline and the second is ML pipeline.

But in this project, I actually wanted to work on the more on the NLP pipeline, and how to customize it. As I was new to both Stanford NLP and Open NLP, most of my time was consumed in understanding these libraries. I have built two pipelines, one is where we can call these libraries interchangeably, and the second one is the CoreStanfordNLP standard pipeline. The advantage of using one library is I needn't convert the results into primitive data types after every step.

Things to note:

I am only using previously trained models, the data used is a Kaggle IMDB dataset. It's size is 34 MB. Feeding this data set to a random forest of size 10, I am getting accuracy 74 percent.

Features implemented	StanFordNLP	OpenNLP	Comments
Tokenizers	✓	✓
StopWords Removers	✗	✗	Using a custom build List
Stemmers	✓	✓	Snowball from OpenNLP and Porter from Stanford
SentenceDetection	✓	✓
Parts-of-speech tagging	✓	✓
Parsers	✓	✓
Lemmatizers	✓	✓	OpenNLP don't have a dictionary. So, using the elasticsearch dict
Named Entity Recognizer	✓	✓	For OpenNLP, only location identifier is implemented

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main

src/main

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Some(NLP)

Things to note:

About

Releases

Packages

Languages

License

DEK11/MoreNLP

Folders and files

Latest commit

History

Repository files navigation

Some(NLP)

Things to note:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages