Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows curly quotes trip up cchardet #26

Closed
craigds opened this issue Apr 10, 2017 · 2 comments
Closed

Windows curly quotes trip up cchardet #26

craigds opened this issue Apr 10, 2017 · 2 comments

Comments

@craigds
Copy link
Contributor

craigds commented Apr 10, 2017

Forgive me if this is the wrong place for this - I'm somewhat ignorant of the internal workings of cchardet.

\x92 seems cause strings to be interpreted as this central european encoding:

>>> cchardet.detect('Bob\x92s Burgers')
{'confidence': 0.8183978796005249, 'encoding': u'MacCentralEurope'}
>>> print 'Bob\x92s Burgers'.decode('MacCentralEurope')
Bobís Burgers

I don't know enough about central european languages to comment on whether that's a good choice. I do know that curly quotes as produced by MS Word are quite common, so interpreting them badly seems like a fairly obvious bug.

A correct choice for that string might be windows-1252 which renders curly quotes correctly.

To be fair, this same string trips up chardet too (a bit differently). So I guess this must not be a trivially-obvious situation:

>>> chardet.detect('Bob\x92s Burgers')
{'confidence': 0.846643894804694, 'encoding': 'ISO-8859-2'}
>>> print 'Bob\x92s Burgers'.decode('ISO-8859-2')
Bobs Burgers
@craigds
Copy link
Contributor Author

craigds commented Apr 10, 2017

The original data we found this in came from a bigger dataset. Running it on cchardet 1.1.3 and 2.0 gives quite different results:

# cchardet 1.1.3 
$ ./cchardet-detect.py data.csv
{'confidence': 0.8283712863922119, 'encoding': u'WINDOWS-1252'}

# cchardet 2.0
$ ./cchardet-detect.py data.csv
{'confidence': 0.9261142015457153, 'encoding': u'MacCentralEurope'}

Data is under a proprietary license, but if required to solve the issue I can see if I can get permission to supply it.

@jayvdb
Copy link

jayvdb commented Jul 29, 2019

This bug still occurs, and the chardet package gets this right.

Very likely this is a bug in https://github.com/PyYoshi/uchardet , the underlying library powering cChardet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants