Skip to content

NattapongSiri/tokenizer_rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenizer_rs

A word tokenizer write purely on Rust. It's currently have two tokenizers.

  1. en - A space based tokenizer where each word is splitted by whitespace
  2. th - A dictionary based tokenizer with "maximum matching" algorithm and some basic unknown word handling by minimizing a number of unknown characters until some known word(s) are found.

It currently support two feature gate:

  • multi-thread - It will attempt to use multi-thread for tokenization.
  • single-thread - It will use single thread.

As currently is, Thai word tokenizer support both features. It use Rayon to do multi-thread tokenization. It simply split text by white space first then on each chunk, attempt tokenization on each chunk on separate thread using Rayon parallel iterator.

English language doesn't actually leverage multi-thread yet but it will work on both feature.

By default, it will use multi-thread

How to use

Put following line in your cargo.toml dependencies section. For example:

[dependencies]
tokenizer = "^0.1"

It will attempt to use multi-thread to do tokenization.

To force single-thread, use single-thread feature.

[dependencies]
tokenizer = { version = "^0.1", features = ["single-thread"] }

An example of Thai text tokenization:

use tokenizer::{Tokenizer, th};
let tokenizer = th::Tokenizer::new("path/to/dictionary.txt").expect("Dictionary file not found");
// Assuming dictinoary contains "ภาษาไทย" and "นิดเดียว" but not "ง่าย"
assert_eq!(tokenizer.tokenize("ภาษาไทยง่ายนิดเดียว"), vec!["ภาษาไทย", "ง่าย", "นิดเดียว"]);

Sample implementation using Lexitron dictionary

I have create a sample of code to calculate F1-score on 10 montecarlo simulation test where each test use a sample size of 200 and keep 10% of that sample out of tokenizer to test the quality of tokenizer when there is 10% unknown word in text.

That repository use Lexitron dictionary from NECTEC. Before you use, you should read their license agreement first.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages