Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

Open
yajo opened this issue Nov 5, 2021 · 17 comments
Open

Proposal: new hook to detect CVE-2021-42574 (Trojan Source Attacks) #683

yajo opened this issue Nov 5, 2021 · 17 comments

Comments

@yajo
Copy link

yajo commented Nov 5, 2021

As explained in https://trojansource.codes/ and https://www.python.org/dev/peps/pep-0672/ it is very likely to receive trojan source attacks, where some UTF-8 strings contain invisible characters that make your program behave in unexpected ways.

Since those characters are proper UTF-8 characters and are invisible, even manual source code revision can bypass it. We need automated tools to do that.

It can affect many languages, so it would be great if a pre-commit hook is provided here that can forbid committing such poisoned sources.

@asottile
Copy link
Member

asottile commented Nov 5, 2021

it is very likely to receive trojan source attacks

I don't think that's true but it is possible

I don't think this is easy to implement especially if you are to handle homoglyphs (Α vs A) beyond banning all non-ASCII characters which isn't productive

if you can come up with a concrete proposal it may be considered but in the current state it isn't feasible

@sirosen
Copy link

sirosen commented Nov 9, 2021

I was wondering about this myself. Maybe we could do this by looking at unicodedata.category on all characters (possibly too slow?).

I'd default to rejecting any code which contains characters in the control sequence category (C*), but we could also include flags to restrict it to specific categories like Cc and Cf. It's restrictive, but it's the only sensible check I can think of doing.
If this sounds valuable, I'd be happy to put together a PR.

If that's still too aggressive because it rejects legitimate control characters in strings, and we're interested in this specifically for python, I could look at writing something with ast or libcst as a separate hook.

@yajo
Copy link
Author

yajo commented Nov 10, 2021

Quoting from the paper, chapter V.F. Defenses:

The simplest defense is to ban the use of text directionality
control characters
both in language specifications and in
compilers implementing these languages.

In most settings, this simple solution may well be suf-
ficient. If an application wishes to print text that requires
Bidi overrides, developers can generate those characters using
escape sequences rather than embedding potentially dangerous
characters into source code.

So I'd say we can just implement that simplest defense approach, at least as a start. It can serve as a shield that targets any UTF-8 compatible language.

For most projects, this should be just OK, as there is usually a workaround as explained in the quote.

This could be the next iteration, but that depends on what is a string and a comment for each language, so it becomes more complex and specific:

This simple defense can be improved by adding a small
amount of nuance. By banning all directionality-control char-
acters, users with legitimate Bidi-override use cases in com-
ments are penalized. Therefore, a better defense might be to
ban the use of unterminated Bidi override characters within
string literals and comments
. By ensuring that each override
is terminated – that is, for example, that every LRI has a
matching PDI – it becomes impossible to distort legitimate
source code outside of string literals and comments.

@asottile
Copy link
Member

as stated above that is insufficient

@sirosen
Copy link

sirosen commented Nov 10, 2021

@asottile, you mean it's insufficient for protecting against homoglyph confusion? That's true, but those attacks predate the recent realization that BiDi control characters can hide source.

Most of our codebases are unlikely to contain directional control characters, even in strings and comments.
I think a hook which forbids the format control characters would provide value, as long as it's not indicated as a panacea for all Unicode-based issues.

I would like it best if such a hook were in this repo, but if not I can write one separately.

@asottile
Copy link
Member

just because it isn't "new" doesn't mean it isn't a problem:

if user.type == 'ΑDMIN':
    ...

@sirosen
Copy link

sirosen commented Nov 10, 2021

Sure, I won't argue that it isn't a problem, and maybe I shouldn't have brought up its age.

But, as you said, there's nothing productive to do about that short of banning non-ASCII text.

I don't see these two problems as inherently related. Are you worried that even if the check is named forbid-bidi-controls, users will think it protects against both?

@asottile
Copy link
Member

I'm stating:

  1. protecting only against bidi characters is insufficient to protect against "trojan source attacks"
  2. "forbid-bidi-controls" is too specific to be generically helpful

@scop
Copy link
Contributor

scop commented Nov 10, 2021

I think a check that verifies that files decode without errors in strict mode with the given encoding supported by Python would be useful to have on its own, not only for this purpose. For the thing at hand, configuring it to accept ASCII only would likely work just fine for many. For example call the hook check-encoding, require encoding to be specified with --encoding foo.

@scop
Copy link
Contributor

scop commented Nov 11, 2021

check-encoding done in #685

@asottile
Copy link
Member

I don't really think such a thing solves this problem -- most would use utf-8 (which would probably be the default for such a hook). I think requiring ascii would be more of a pain in the butt than the "security" benefits.

I think the correct design for "solving" this issue is to come up with a set of glyph ranges that are banned -- but we need to decide what those are (and perhaps augment them in the future)

I think the best approach forward is to draft a set of characters to "ban" and build a checker around that. (the checker itself would require strict UTF-8 as well)

thoughts?

@scop
Copy link
Contributor

scop commented Nov 16, 2021

Re check-encoding, no other thoughts besides repeating myself: I think such an encoding checker is useful on its own completely irrespective of this issue (I know I've missed one myself at times), and for some who can afford ASCII or want it for other reasons, it would work for catching this one as well. I'm not suggesting it as a complete fix for this issue.

@asottile
Copy link
Member

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

@scop
Copy link
Contributor

scop commented Nov 16, 2021

This test is not limited to programming languages.

@asottile
Copy link
Member

I understand but the vast majority of programming is writing code so I don't think it's that useful

@sirosen
Copy link

sirosen commented Nov 16, 2021

Checking file encodings does not seem to be a solution to this issue. I write plenty of code that's all ASCII, but I also write code that uses non-ASCII characters. How would such a check help me?

I think the best approach forward is to draft a set of characters to "ban" and build a checker around that. (the checker itself would require strict UTF-8 as well)

thoughts?

This makes sense to me -- I can imagine how to implement it.
But I'm not certain what kinds of characters you're thinking to ban? You said formatting control characters was too narrow, and you've said homoglyphs are an area of concern. Are there other classes of characters of interest?

Some thoughts:

  • Is there an existing list from a reputable source for homoglyphs of latin characters? I don't know of one, but finding and listing them all ourselves seems both tedious and error prone.
  • Turkish, Polish, and several other languages use modified latin characters. Are those close enough to be worrying? if user.role == "admın": ...?
  • The Cyrillic and Greek alphabets have a wide variety of characters which render identically to latin but are encoded differently. Cyrillic has, e.g. х,а,с,о. Greek has, e.g. Α,Β,ο
  • Stylistic ligature characters can be used for latin text. e.g. if stage == "final": ... (I actually ran into this one, so I wrote a ligature fixer a few months ago which may provide useful source.)
  • For programmers writing code with a specific language of non-latin text, would we allow them to turn off sets of characters? e.g. check-banned-characters --allow-cyrillic? Perhaps we need # banned-characters: off comment support?
  • In my experience, many codebases which have only latin text for their main source may have lots of interesting unicode in their tests. Does that impact this? Perhaps it provides more motivation for pragma comments?

@scop
Copy link
Contributor

scop commented Nov 16, 2021

I write plenty of code that's all ASCII, but I also write code that uses non-ASCII characters. How would such a check help me?

As said above, "I'm not suggesting it as a complete fix for this issue.". Even if it wouldn't help you, it could help others. Anyway that's not the primary reason for the check-encoding check, more like a side effect. Let's continue discussing check-encoding in #685, I've already posted my replies there to some comments related to it that are here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants