Add check for text file encodings #685

scop · 2021-11-11T17:25:09Z

scop · 2021-11-11T17:27:48Z

Don't know if I'm too happy with the platform dependent default, perhaps defaulting to utf-8 would be better. No strong opinions between those.

scop · 2021-11-11T17:32:40Z

Also, incremental decoding instead of read() slurp would be nicer.

Can polish once we establish this will be accepted in the first place.

.pre-commit-hooks.yaml

scop · 2021-11-16T21:09:18Z

From #683 (comment)

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

Anyway, I think the opinion what qualifies as a "decent programming language" belongs here.

No matter what one thinks of them, for example C and JavaScript and Perl code can be encoded let's say in ISO-8859-1 without any encoding markers. I don't think there even is a way to convey that at least for C code, not sure for JS; for Perl there is use utf8; but I don't know if something similar exists for other encodings, and there might be reasons why one wants to have files in some particular, other encoding. And such files compile and run just fine.

Chances are that the ISO-8859-1 in them is not the intention though and they now don't behave correctly, perhaps one wanted UTF-8 instead. Or perhaps one wanted code that is not that brittle wrt source encoding in the first place and for that reason wants to restrict everything to ASCII to avoid bugs of this class altogether.

Or maybe it's not a runtime issue, one just wanted some comments/API docs in them in UTF-8 but a badly configured editor did it in ISO-8859-1 or vice versa, and now docs are not being generated right. This check would help.

From #683 (comment)

I understand but the vast majority of programming is writing code so I don't think it's that useful

What I'm trying to say is that just like pre-commit, this check can be useful for repos that don't have anything whatsoever to do with programming. Websites built with markdown, static HTML + CSS, other text content, including plain text, for whatever purpose.

asottile · 2021-11-16T21:23:40Z

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

yeah agreed, but I don't think it's our place to supply those checks when a language-specific tool can provide better diagnostic SyntaxErrors when it's a problem for them

my main concerns with this are:

it's not something I would ever see myself using, or recommending to others
it should default to UTF-8 and not platform specific
it's being proposed as a solution to bidi / homoglyphs / etc. of which I strongly believe it is not a proper solution to this
I think this particular "use X encoding for Y language" is better solved in language-specific tooling
if we accept this it'll be something more to maintain, I can already see the "please read my emacs comment headers and use that encoding" request which I really don't have an interest in supporting

scop · 2021-11-16T22:42:56Z

Some counterarguments:

No problem changing the default, UTF-8 probably is the better choice.
We can forget about the Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683 considerations, and people who know what they're doing can do what they want with this.
I don't think language specific tooling is really possible for a variety of cases this applies to.
There already is a bunch of checks here that hardcode an encoding or do the right thing with some specific ones only, and there's no support for those comments headers or editorconfigs or whatnot in them either.

tspearconquest · 2022-01-29T17:16:11Z

This would be useful in checking any kind of text files are encoded in the desired user-configurable encoding. Regardless of whether it treats or mitigates the CVE, this has value in tons of repositories.

Also, language-specific tooling means a lot of different things to a lot of people. I use vim and write a lot of terraform code and yaml for kubernetes deployments.

At least for Terraform, the cli will error out if, for example, you read a file with the file() function and that file "contains invalid UTF-8 sequences." However, that's not really what I'd call "language-specific tooling" because it depends on the user actually trying to run terraform apply and it operates on files that may exist locally in a CI system or the user's local filesystem where the terraform command is running, but external to the repo itself. So that doesn't confirm that the actual HCL config files for terraform are encoded in UTF8.

What I need is exactly this pre-commit hook to prevent files that have any encoding other than UTF-8 from being committed in the first place.

Respectfully, I request the addition of this hook.

scop mentioned this pull request Nov 11, 2021

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

Open

Add check for text file encodings

69d0dfb

scop force-pushed the check-encoding branch from c99d329 to 69d0dfb Compare November 11, 2021 17:26

yajo reviewed Nov 14, 2021

View reviewed changes

.pre-commit-hooks.yaml Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check for text file encodings #685

Add check for text file encodings #685

scop commented Nov 11, 2021

scop commented Nov 11, 2021

scop commented Nov 11, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

tspearconquest commented Jan 29, 2022 •

edited

Add check for text file encodings #685

Are you sure you want to change the base?

Add check for text file encodings #685

Conversation

scop commented Nov 11, 2021

scop commented Nov 11, 2021

scop commented Nov 11, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

tspearconquest commented Jan 29, 2022 • edited

tspearconquest commented Jan 29, 2022 •

edited