better control over encoding used in preview #2895

erzoe · 2023-09-08T11:58:04Z

ISSUE TYPE

Bug fix

RUNTIME ENVIRONMENT

Operating system and version: Arch Linux
Terminal emulator and version: alacritty 0.12.2 (9d9982df)
Python version: Python 3.11.5
Ranger version/commit (before bugfix): 136416c
Locale: en_GB.UTF-8

CHECKLIST

The CONTRIBUTING document has been read [REQUIRED]
All changes follow the code style [REQUIRED]
All new and existing tests pass [REQUIRED]
Changes require config files to be updated
- Config files have been updated
Changes require documentation to be updated
- Documentation has been updated
Changes require tests to be updated
- Tests have been updated

DESCRIPTION

I have added two new settings:

preferred_encoding = utf-8
preferred_encoding_required_confidence = 0.5

preferred_encoding is used if chardet thinks it's plausible that the file to be previewed is encoded in preferred_encoding.
The required confidence can be configured with preferred_encoding_required_confidence.
Otherwise chardet's best guess is used.

This is based on a proposal by toonn but I have set a lower required confidence because 0.9 still lead to utf-8 files not being detected correctly (see erzoe@c430efc).

The previous behavior can be retained by setting preferred_encoding_required_confidence to a value greater than 1.

(I have also added a third setting show_encoding which I have used for testing.)

MOTIVATION AND CONTEXT

Before there was no way to control which encoding ranger should use when displaying a text file in the preview.
Instead chardet was used to guess the file encoding which in many cases guessed a wrong encoding leading to non-ASCII characters not being displayed correctly.

see #1948

TESTING

I have tested that utf-8 files are now correclty recognized as utf-8 and that non-utf-8 files (1 euc-jp and 1 iso-20220jp file) are also correctly recognized.

to fix ranger#1948 I have added two new settings: - preferred_encoding [string] - preferred_encoding_required_confidence [float] preferred_encoding is used if chardet thinks it's plausible that the file to be previewed is encoded in preferred_encoding. The required confidence can be configured with preferred_encoding_required_confidence. Otherwise chardet's best guess is used. The previous behavior can be retained by setting preferred_encoding_required_confidence to a value greater than 1.

I have used this to test the previous commit and thought that it might be interesting for some users. It is disabled by default because most users probably don't care.

I have a utf-8 file which chardet detects as: [{'confidence': 0.6966666666666667, 'encoding': 'MacRoman', 'language': ''}, {'confidence': 0.6566507177033493, 'encoding': 'ISO-8859-1', 'language': ''}, {'confidence': 0.584476406746814, 'encoding': 'ISO-8859-9', 'language': 'Turkish'}, {'confidence': 0.505, 'encoding': 'utf-8', 'language': ''}, {'confidence': 0.2698659090271226, 'encoding': 'TIS-620', 'language': 'Thai'}]

erzoe added 5 commits September 8, 2023 11:51

added setting show_encoding

cfb0054

I have used this to test the previous commit and thought that it might be interesting for some users. It is disabled by default because most users probably don't care.

added path to file in HACKING.md

7efc24d

make pylint and flake8 happy

a44cf30

erzoe mentioned this pull request Sep 10, 2023

Non-ASCII characters not shown properly on text preview #1948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better control over encoding used in preview #2895

better control over encoding used in preview #2895

erzoe commented Sep 8, 2023 •

edited

better control over encoding used in preview #2895

Are you sure you want to change the base?

better control over encoding used in preview #2895

Conversation

erzoe commented Sep 8, 2023 • edited

ISSUE TYPE

RUNTIME ENVIRONMENT

CHECKLIST

DESCRIPTION

MOTIVATION AND CONTEXT

TESTING

erzoe commented Sep 8, 2023 •

edited