VietSeg

A Vietnamese Word Segmentation program using Neural Network

How to use:

Download the source code
Change to src folder
Put a text file into src folder for segmenting, here I use input.txt
Run python3 vietseg.py input.txt output.txt (Yeah, this program use Python3, and Python2 won't work on it, you can fix this, of course)
Now you got output.txt which is segmented

Performance:

Precision, Recall, and F1-measure in the same data as described in this paper:

RESULT:
===================
Run 0: P = 0.9156, R = 0.9294, F = 0.9225
Run 1: P = 0.9015, R = 0.9183, F = 0.9099
Run 2: P = 0.9189, R = 0.9327, F = 0.9258
Run 3: P = 0.9208, R = 0.9339, F = 0.9273
Run 4: P = 0.9166, R = 0.9295, F = 0.9230
===================
Avg.   P = 0.9147, R = 0.9288, F = 0.9217

And here is the best performance in the paper:

P = 94.00, R = 94.45, F = 94.23

The program use some random shuffers, so your result may not be the same as mine.

Train the model yourself:

Get the data (see links below) and put in the dat folder
Change working directory to src folder
Run python3 word2vec.py to get vectorized words for our segmenting model using Word2Vec library (Word2Vec itself is a neural network)
Run python3 learn.py to really train the segmenting model
Run python3 performace.py for examining the peformance of the model
Now you can use python3 vietseg.py <input file> <output file> as described above

Data for training model:

Vietnamese corpus:
- File: VNESEcorpus.txt
- Move the file to dat folder
Vietnamese IOB training data:
- File: trainingdata.tar.gz
- Untar and put 10 files: test1.iob2 -> test5.iob2, train1.iob2 -> train5.iob2 to dat folder, along with VNESEcorpus.txt

Future works:

Speed up the network
Use a professional deep learning package (Theano, Caffe, etc)
Train the model with bigger corpus and training data file, like these
Deal with uppercase characters
Build a web app

Acknowledgment:

This program use some code from wendykan and mnielsen. View the source code for detail.

Similar programs:

JVnSegmenter: Java
vnTokenizer: Java
Dongdu: C++
Roy_VnTokenizer: Python
VLSP: PHP?

Last words:

sophisticated algorithm ≤ simple learning algorithm + good training data

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dat		dat
src		src
var		var
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dat

dat

src

src

var

var

README.md

README.md

Repository files navigation

VietSeg

How to use:

Performance:

Train the model yourself:

Data for training model:

Future works:

Acknowledgment:

Similar programs:

Last words:

About

Releases

Packages

Languages

manhtai/vietseg

Folders and files

Latest commit

History

Repository files navigation

VietSeg

How to use:

Performance:

Train the model yourself:

Data for training model:

Future works:

Acknowledgment:

Similar programs:

Last words:

About

Topics

Resources

Stars

Watchers

Forks

Languages