Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

能否实现更细粒度的分词 #7

Open
suziwen opened this issue Mar 15, 2020 · 3 comments
Open

能否实现更细粒度的分词 #7

suziwen opened this issue Mar 15, 2020 · 3 comments

Comments

@suziwen
Copy link

suziwen commented Mar 15, 2020

比如 操作系统也越来越流行了, 现在这个版本把 操作系统 当成一个完整的词来处理, 我想把 操作系统 再细分出 操作系统 这两个词, 最后生成的结果就是 操作系统,操作,系统 这三个词. 类似于 nodejieba 里的 cutAll / cutForSearch 方法

> jieba.cut('操作系统也越来越流行了')
[ '操作系统', '也', '越来越', '流行', '了' ]
> jieba.cutAll('操作系统也越来越流行了')
[ '操作', '操作系统', '系统', '也', '越来', '越来越', '流行', '了' ]
> jieba.cutForSearch('操作系统也越来越流行了')
[ '操作', '操作系统', '系统', '也', '越来', '越来越', '流行', '了' ]
@linonetwo
Copy link
Owner

应该可以通过写一个 Tokenizer 中间件来实现,在 tokenizer 里你可以拿到 CRF 信息(参考别的 Tokenizer ),然后你可以比其他 Tokenizer 在返回的列表里多返回一个 '操作', '系统',如果你发现有一定概率可以进一步细分这个词的话。

@linonetwo
Copy link
Owner

你可以看到这个 f 就是概率

const f = Number(blocks[2]);

@suziwen
Copy link
Author

suziwen commented Mar 15, 2020

好的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants