Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better control over encoding used in preview #2895

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

erzoe
Copy link

@erzoe erzoe commented Sep 8, 2023

ISSUE TYPE

  • Bug fix

RUNTIME ENVIRONMENT

  • Operating system and version: Arch Linux
  • Terminal emulator and version: alacritty 0.12.2 (9d9982df)
  • Python version: Python 3.11.5
  • Ranger version/commit (before bugfix): 136416c
  • Locale: en_GB.UTF-8

CHECKLIST

  • The CONTRIBUTING document has been read [REQUIRED]
  • All changes follow the code style [REQUIRED]
  • All new and existing tests pass [REQUIRED]
  • Changes require config files to be updated
    • Config files have been updated
  • Changes require documentation to be updated
    • Documentation has been updated
  • Changes require tests to be updated
    • Tests have been updated

DESCRIPTION

I have added two new settings:

  • preferred_encoding = utf-8
  • preferred_encoding_required_confidence = 0.5

preferred_encoding is used if chardet thinks it's plausible that the file to be previewed is encoded in preferred_encoding.
The required confidence can be configured with preferred_encoding_required_confidence.
Otherwise chardet's best guess is used.

This is based on a proposal by toonn but I have set a lower required confidence because 0.9 still lead to utf-8 files not being detected correctly (see erzoe@c430efc).

The previous behavior can be retained by setting preferred_encoding_required_confidence to a value greater than 1.

(I have also added a third setting show_encoding which I have used for testing.)

MOTIVATION AND CONTEXT

Before there was no way to control which encoding ranger should use when displaying a text file in the preview.
Instead chardet was used to guess the file encoding which in many cases guessed a wrong encoding leading to non-ASCII characters not being displayed correctly.

see #1948

TESTING

I have tested that utf-8 files are now correclty recognized as utf-8 and that non-utf-8 files (1 euc-jp and 1 iso-20220jp file) are also correctly recognized.

to fix ranger#1948

I have added two new settings:
- preferred_encoding [string]
- preferred_encoding_required_confidence [float]

preferred_encoding is used if chardet thinks it's plausible that the file to be previewed is encoded in preferred_encoding.
The required confidence can be configured with preferred_encoding_required_confidence.
Otherwise chardet's best guess is used.

The previous behavior can be retained by setting preferred_encoding_required_confidence to a value greater than 1.
I have used this to test the previous commit
and thought that it might be interesting for some users.
It is disabled by default because most users probably don't care.
I have a utf-8 file which chardet detects as:
[{'confidence': 0.6966666666666667, 'encoding': 'MacRoman', 'language': ''},
 {'confidence': 0.6566507177033493, 'encoding': 'ISO-8859-1', 'language': ''},
 {'confidence': 0.584476406746814, 'encoding': 'ISO-8859-9', 'language': 'Turkish'},
 {'confidence': 0.505, 'encoding': 'utf-8', 'language': ''},
 {'confidence': 0.2698659090271226, 'encoding': 'TIS-620', 'language': 'Thai'}]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant