Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

yajo · 2021-11-05T12:08:28Z

As explained in https://trojansource.codes/ and https://www.python.org/dev/peps/pep-0672/ it is very likely to receive trojan source attacks, where some UTF-8 strings contain invisible characters that make your program behave in unexpected ways.

Since those characters are proper UTF-8 characters and are invisible, even manual source code revision can bypass it. We need automated tools to do that.

It can affect many languages, so it would be great if a pre-commit hook is provided here that can forbid committing such poisoned sources.

asottile · 2021-11-05T12:58:44Z

it is very likely to receive trojan source attacks

I don't think that's true but it is possible

I don't think this is easy to implement especially if you are to handle homoglyphs (Α vs A) beyond banning all non-ASCII characters which isn't productive

if you can come up with a concrete proposal it may be considered but in the current state it isn't feasible

sirosen · 2021-11-09T18:23:42Z

I was wondering about this myself. Maybe we could do this by looking at unicodedata.category on all characters (possibly too slow?).

I'd default to rejecting any code which contains characters in the control sequence category (C*), but we could also include flags to restrict it to specific categories like Cc and Cf. It's restrictive, but it's the only sensible check I can think of doing.
If this sounds valuable, I'd be happy to put together a PR.

If that's still too aggressive because it rejects legitimate control characters in strings, and we're interested in this specifically for python, I could look at writing something with ast or libcst as a separate hook.

yajo · 2021-11-10T09:31:39Z

Quoting from the paper, chapter V.F. Defenses:

The simplest defense is to ban the use of text directionality
control characters both in language specifications and in
compilers implementing these languages.

In most settings, this simple solution may well be suf-
ficient. If an application wishes to print text that requires
Bidi overrides, developers can generate those characters using
escape sequences rather than embedding potentially dangerous
characters into source code.

So I'd say we can just implement that simplest defense approach, at least as a start. It can serve as a shield that targets any UTF-8 compatible language.

For most projects, this should be just OK, as there is usually a workaround as explained in the quote.

This could be the next iteration, but that depends on what is a string and a comment for each language, so it becomes more complex and specific:

This simple defense can be improved by adding a small
amount of nuance. By banning all directionality-control char-
acters, users with legitimate Bidi-override use cases in com-
ments are penalized. Therefore, a better defense might be to
ban the use of unterminated Bidi override characters within
string literals and comments. By ensuring that each override
is terminated – that is, for example, that every LRI has a
matching PDI – it becomes impossible to distort legitimate
source code outside of string literals and comments.

asottile · 2021-11-10T13:36:58Z

as stated above that is insufficient

sirosen · 2021-11-10T14:56:34Z

@asottile, you mean it's insufficient for protecting against homoglyph confusion? That's true, but those attacks predate the recent realization that BiDi control characters can hide source.

Most of our codebases are unlikely to contain directional control characters, even in strings and comments.
I think a hook which forbids the format control characters would provide value, as long as it's not indicated as a panacea for all Unicode-based issues.

I would like it best if such a hook were in this repo, but if not I can write one separately.

asottile · 2021-11-10T15:00:01Z

just because it isn't "new" doesn't mean it isn't a problem:

if user.type == 'ΑDMIN':
    ...

sirosen · 2021-11-10T15:07:47Z

Sure, I won't argue that it isn't a problem, and maybe I shouldn't have brought up its age.

But, as you said, there's nothing productive to do about that short of banning non-ASCII text.

I don't see these two problems as inherently related. Are you worried that even if the check is named forbid-bidi-controls, users will think it protects against both?

asottile · 2021-11-10T15:16:03Z

I'm stating:

protecting only against bidi characters is insufficient to protect against "trojan source attacks"
"forbid-bidi-controls" is too specific to be generically helpful

scop · 2021-11-10T20:57:29Z

I think a check that verifies that files decode without errors in strict mode with the given encoding supported by Python would be useful to have on its own, not only for this purpose. For the thing at hand, configuring it to accept ASCII only would likely work just fine for many. For example call the hook check-encoding, require encoding to be specified with --encoding foo.

scop · 2021-11-11T17:25:44Z

check-encoding done in #685

asottile · 2021-11-16T01:46:24Z

I don't really think such a thing solves this problem -- most would use utf-8 (which would probably be the default for such a hook). I think requiring ascii would be more of a pain in the butt than the "security" benefits.

I think the correct design for "solving" this issue is to come up with a set of glyph ranges that are banned -- but we need to decide what those are (and perhaps augment them in the future)

I think the best approach forward is to draft a set of characters to "ban" and build a checker around that. (the checker itself would require strict UTF-8 as well)

thoughts?

scop · 2021-11-16T05:38:04Z

Re check-encoding, no other thoughts besides repeating myself: I think such an encoding checker is useful on its own completely irrespective of this issue (I know I've missed one myself at times), and for some who can afford ASCII or want it for other reasons, it would work for catching this one as well. I'm not suggesting it as a complete fix for this issue.

asottile · 2021-11-16T13:33:39Z

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

scop · 2021-11-16T14:00:16Z

This test is not limited to programming languages.

asottile · 2021-11-16T14:35:49Z

I understand but the vast majority of programming is writing code so I don't think it's that useful

sirosen · 2021-11-16T15:41:58Z

Checking file encodings does not seem to be a solution to this issue. I write plenty of code that's all ASCII, but I also write code that uses non-ASCII characters. How would such a check help me?

I think the best approach forward is to draft a set of characters to "ban" and build a checker around that. (the checker itself would require strict UTF-8 as well)

thoughts?

This makes sense to me -- I can imagine how to implement it.
But I'm not certain what kinds of characters you're thinking to ban? You said formatting control characters was too narrow, and you've said homoglyphs are an area of concern. Are there other classes of characters of interest?

Some thoughts:

Is there an existing list from a reputable source for homoglyphs of latin characters? I don't know of one, but finding and listing them all ourselves seems both tedious and error prone.
Turkish, Polish, and several other languages use modified latin characters. Are those close enough to be worrying? if user.role == "admın": ...?
The Cyrillic and Greek alphabets have a wide variety of characters which render identically to latin but are encoded differently. Cyrillic has, e.g. х,а,с,о. Greek has, e.g. Α,Β,ο
Stylistic ligature characters can be used for latin text. e.g. if stage == "ﬁnal": ... (I actually ran into this one, so I wrote a ligature fixer a few months ago which may provide useful source.)
For programmers writing code with a specific language of non-latin text, would we allow them to turn off sets of characters? e.g. check-banned-characters --allow-cyrillic? Perhaps we need # banned-characters: off comment support?
In my experience, many codebases which have only latin text for their main source may have lots of interesting unicode in their tests. Does that impact this? Perhaps it provides more motivation for pragma comments?

scop · 2021-11-16T21:15:43Z

I write plenty of code that's all ASCII, but I also write code that uses non-ASCII characters. How would such a check help me?

As said above, "I'm not suggesting it as a complete fix for this issue.". Even if it wouldn't help you, it could help others. Anyway that's not the primary reason for the check-encoding check, more like a side effect. Let's continue discussing check-encoding in #685, I've already posted my replies there to some comments related to it that are here.

scop mentioned this issue Nov 11, 2021

Add check for text file encodings #685

Open

sirosen mentioned this issue Nov 17, 2021

Add texthooks project, remove hooks it replaces pre-commit/pre-commit.com#582

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

yajo commented Nov 5, 2021

asottile commented Nov 5, 2021

sirosen commented Nov 9, 2021

yajo commented Nov 10, 2021

asottile commented Nov 10, 2021

sirosen commented Nov 10, 2021

asottile commented Nov 10, 2021

sirosen commented Nov 10, 2021 •

edited

asottile commented Nov 10, 2021

scop commented Nov 10, 2021

scop commented Nov 11, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

sirosen commented Nov 16, 2021

scop commented Nov 16, 2021

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

Comments

yajo commented Nov 5, 2021

asottile commented Nov 5, 2021

sirosen commented Nov 9, 2021

yajo commented Nov 10, 2021

asottile commented Nov 10, 2021

sirosen commented Nov 10, 2021

asottile commented Nov 10, 2021

sirosen commented Nov 10, 2021 • edited

asottile commented Nov 10, 2021

scop commented Nov 10, 2021

scop commented Nov 11, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

scop commented Nov 16, 2021

asottile commented Nov 16, 2021

sirosen commented Nov 16, 2021

scop commented Nov 16, 2021

sirosen commented Nov 10, 2021 •

edited