nori-clone

Standalone Nori (Korean Morphological Analyzer in Apache Lucene) written in C++.

Introduction

ElasticSearch provides high-quality/performance Korean morphological analyzer nori. But nori's code is strongly coupled with the Lucene codebase, and nori is written in Java that is the main language in the Lucene project. So, it's hard to use nori standalone in Python or Golang with the same performance. Therefore, I re-implemented almost the same algorithms with nori in Lucene using C++ for the portability and usability.

Usage

This project is written in C++, but also provides Python and Golang binding.

Pre-built dictionaries

A dictionary/ directory is for the pre-built dictionary files that is used for distribtion and test cases. For now, there are two pre-built dictionaries, lagacy and latest.

legacy dictionary does not normalize inputs, and built with mecab-ko-dic-2.0.3-20170922 that is same with original nori.
latest dictionary normalizes the inputs with the form NFKC, and built with mecab-ko-dic-2.1.1-20180720.

Performance

For more details, check out tools/benchmark.

Differences with original nori

Check out tools/comparison.

For the contributors

Check out CONTRIBUTING.md

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.github/workflows		.github/workflows
dictionary		dictionary
nori		nori
testdata		testdata
third_party		third_party
tools		tools
.bazelrc		.bazelrc
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
BUILD		BUILD
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
WORKSPACE		WORKSPACE

License

jeongukjae/nori-clone

Folders and files

Latest commit

History

Repository files navigation

nori-clone

Introduction

Usage

Pre-built dictionaries

Performance

Differences with original nori

For the contributors

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages