Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'是因为' doesn't cut as expected #989

Open
brownbat opened this issue Apr 2, 2023 · 2 comments
Open

'是因为' doesn't cut as expected #989

brownbat opened this issue Apr 2, 2023 · 2 comments

Comments

@brownbat
Copy link

brownbat commented Apr 2, 2023

What's the best way to get jieba to cut '是因为' into '是' and '因为'?

I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.

Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.

Setting HMM to False gives ['影子', '的', '出现', '是因为', '有', '光']

Unsure if this is a bug or by design. Is the right approach here to use a custom user dictionary limited to the top 20k words or so?

Apologies if this is pure user error, I am new to jieba and still trying to figure out all the features. Thanks for any recommendations.

@brynne8
Copy link

brynne8 commented Apr 25, 2023

Since 是因为 is included in jieba dict, the default and HMM will both regard 是因为 as a conjunction (c).

jieba.del_word('是因为')

If you want to split the word, you could remove the word from jieba dict.

@manother
Copy link

manother commented Apr 25, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants