Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for text file encodings #685

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scop
Copy link
Contributor

@scop scop commented Nov 11, 2021

Refs #683

@scop
Copy link
Contributor Author

scop commented Nov 11, 2021

Don't know if I'm too happy with the platform dependent default, perhaps defaulting to utf-8 would be better. No strong opinions between those.

@scop
Copy link
Contributor Author

scop commented Nov 11, 2021

Also, incremental decoding instead of read() slurp would be nicer.

Can polish once we establish this will be accepted in the first place.

@scop
Copy link
Contributor Author

scop commented Nov 16, 2021

From #683 (comment)

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

Anyway, I think the opinion what qualifies as a "decent programming language" belongs here.

No matter what one thinks of them, for example C and JavaScript and Perl code can be encoded let's say in ISO-8859-1 without any encoding markers. I don't think there even is a way to convey that at least for C code, not sure for JS; for Perl there is use utf8; but I don't know if something similar exists for other encodings, and there might be reasons why one wants to have files in some particular, other encoding. And such files compile and run just fine.

Chances are that the ISO-8859-1 in them is not the intention though and they now don't behave correctly, perhaps one wanted UTF-8 instead. Or perhaps one wanted code that is not that brittle wrt source encoding in the first place and for that reason wants to restrict everything to ASCII to avoid bugs of this class altogether.

Or maybe it's not a runtime issue, one just wanted some comments/API docs in them in UTF-8 but a badly configured editor did it in ISO-8859-1 or vice versa, and now docs are not being generated right. This check would help.

From #683 (comment)

I understand but the vast majority of programming is writing code so I don't think it's that useful

What I'm trying to say is that just like pre-commit, this check can be useful for repos that don't have anything whatsoever to do with programming. Websites built with markdown, static HTML + CSS, other text content, including plain text, for whatever purpose.

@asottile
Copy link
Member

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

yeah agreed, but I don't think it's our place to supply those checks when a language-specific tool can provide better diagnostic SyntaxErrors when it's a problem for them


my main concerns with this are:

  • it's not something I would ever see myself using, or recommending to others
  • it should default to UTF-8 and not platform specific
  • it's being proposed as a solution to bidi / homoglyphs / etc. of which I strongly believe it is not a proper solution to this
  • I think this particular "use X encoding for Y language" is better solved in language-specific tooling
  • if we accept this it'll be something more to maintain, I can already see the "please read my emacs comment headers and use that encoding" request which I really don't have an interest in supporting

@scop
Copy link
Contributor Author

scop commented Nov 16, 2021

Some counterarguments:

  • No problem changing the default, UTF-8 probably is the better choice.
  • We can forget about the Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683 considerations, and people who know what they're doing can do what they want with this.
  • I don't think language specific tooling is really possible for a variety of cases this applies to.
  • There already is a bunch of checks here that hardcode an encoding or do the right thing with some specific ones only, and there's no support for those comments headers or editorconfigs or whatnot in them either.

@tspearconquest
Copy link

tspearconquest commented Jan 29, 2022

This would be useful in checking any kind of text files are encoded in the desired user-configurable encoding. Regardless of whether it treats or mitigates the CVE, this has value in tons of repositories.

Also, language-specific tooling means a lot of different things to a lot of people. I use vim and write a lot of terraform code and yaml for kubernetes deployments.

At least for Terraform, the cli will error out if, for example, you read a file with the file() function and that file "contains invalid UTF-8 sequences." However, that's not really what I'd call "language-specific tooling" because it depends on the user actually trying to run terraform apply and it operates on files that may exist locally in a CI system or the user's local filesystem where the terraform command is running, but external to the repo itself. So that doesn't confirm that the actual HCL config files for terraform are encoded in UTF8.

What I need is exactly this pre-commit hook to prevent files that have any encoding other than UTF-8 from being committed in the first place.

Respectfully, I request the addition of this hook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants