Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to change the decoder #992

Open
XilaBro opened this issue Jun 13, 2023 · 3 comments
Open

How to change the decoder #992

XilaBro opened this issue Jun 13, 2023 · 3 comments

Comments

@XilaBro
Copy link

XilaBro commented Jun 13, 2023

I am currently trying to use Jieba in combination with learning with texts. What I am attempting to do is for jieba to create a space between each "word" in the cmd. for example. 我想飞去北京, would break it down to 我,想,飞,去,北京. what i tried to do initially was use python -m jieba -d ' ' input.txt >output.txt but it would just keep doing "Prefix dic has been built successfully". I then tried python -m jieba -a file1 > file2 and i would get the error below

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache
Loading model cost 1.173 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\xilab\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\xilab\lib\site-packages\jieba_main
.py", line 52, in
ln = fp.readline()
File "C:\Users\xilab\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 11: character maps to

What do you guys think? sorry for poor formatting, this is my first post.

@manother
Copy link

manother commented Jun 13, 2023 via email

@brynne8
Copy link

brynne8 commented Jun 14, 2023

According to your description, it seems your input file encoding is not UTF-8, which causes Jieba to not decode and segment properly. I would recommend:

Convert your input.txt file to UTF-8 encoding. As mentioned in Jieba's readme, "The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8." So UTF-8 is preferred.

Hope this helps you resolve the issue with using Jieba. Let me know if you have any other questions!

@XilaBro
Copy link
Author

XilaBro commented Jun 14, 2023

Hey AlexanderMisel, turns out I've still got problems with it.

Initially, i tried to have the text in a word docx so i could choose the decoder, but I've got the same problem. in .docx, i selected UTF-8 and in the .txt it says it's UTF-8 BOM. Unfortunately, I've still got the same problem.

C:\Users\xilab>python -m jieba -d'' "C:\Users\xilab\Desktop\g.txt" > "C:\Users\xilab\Desktop\eb.txt"
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache
Loading model cost 0.578 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\xilab\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\xilab\lib\site-packages\jieba_main
.py", line 52, in
ln = fp.readline()
File "C:\Users\xilab\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 14: character maps to

any thoughts? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants