'是因为' doesn't cut as expected #989

brownbat · 2023-04-02T05:30:47Z

What's the best way to get jieba to cut '是因为' into '是' and '因为'?

I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.

Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.

Setting HMM to False gives ['影子', '的', '出现', '是因为', '有', '光']

Unsure if this is a bug or by design. Is the right approach here to use a custom user dictionary limited to the top 20k words or so?

Apologies if this is pure user error, I am new to jieba and still trying to figure out all the features. Thanks for any recommendations.

brynne8 · 2023-04-25T08:06:25Z

Since 是因为 is included in jieba dict, the default and HMM will both regard 是因为 as a conjunction (c).

jieba.del_word('是因为')

If you want to split the word, you could remove the word from jieba dict.

manother · 2023-04-25T08:07:44Z

邮件已收到~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'是因为' doesn't cut as expected #989

'是因为' doesn't cut as expected #989

brownbat commented Apr 2, 2023 •

edited

brynne8 commented Apr 25, 2023

manother commented Apr 25, 2023 via email

'是因为' doesn't cut as expected #989

'是因为' doesn't cut as expected #989

Comments

brownbat commented Apr 2, 2023 • edited

brynne8 commented Apr 25, 2023

manother commented Apr 25, 2023 via email

brownbat commented Apr 2, 2023 •

edited