[Fork tracking] Chinese Segmenter enhancements #253

ManyTheFish · 2024-01-09T09:39:17Z

This PR is not meant to be merged.
This PR is here to easily follow the enhancement made on https://github.com/lzw65/charabia

Kimeiga · 2024-01-11T07:19:32Z

hi I looked through zhiwu's code and the use of a traditional to simplified normalizer is very smart; I was just wondering if there's a timeline for when this work will get merged? Thanks for your work!

Fix lindera UniDic download error Support traditional_to_simplified update curstom dict

ManyTheFish · 2024-01-16T10:18:49Z

Hello @Kimeiga,
I don't know if the traditional to simplified normalizer is relevant because the kvariant table already makes the relation between these two. Moreover, character_converter has some performance issues and is not maintained by a native Chinese speaker. I'm wondering but for me, the real enhancement is to handle pinyin differently and be able to generate Chinese specialized ngrams. 🤔
The issue is that is not an easy problem to tackle. 😞

ManyTheFish changed the title ~~[Fork tracking]: Chinese Segmenter enhancements~~ [Fork tracking] Chinese Segmenter enhancements Jan 9, 2024

lzw65 force-pushed the main branch 3 times, most recently from 899655d to fa5a268 Compare January 16, 2024 02:00

Disable pinyin normalizer

ba80b8a

Fix lindera UniDic download error Support traditional_to_simplified update curstom dict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fork tracking] Chinese Segmenter enhancements #253

[Fork tracking] Chinese Segmenter enhancements #253

ManyTheFish commented Jan 9, 2024

Kimeiga commented Jan 11, 2024

ManyTheFish commented Jan 16, 2024

[Fork tracking] Chinese Segmenter enhancements #253

Are you sure you want to change the base?

[Fork tracking] Chinese Segmenter enhancements #253

Conversation

ManyTheFish commented Jan 9, 2024

Kimeiga commented Jan 11, 2024

ManyTheFish commented Jan 16, 2024