Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: safe mode #19

Open
DemiMarie opened this issue Jul 14, 2017 · 13 comments
Open

Feature request: safe mode #19

DemiMarie opened this issue Jul 14, 2017 · 13 comments

Comments

@DemiMarie
Copy link

This is a feature request for a “safe mode”: potentially-malicious content (such as certain URL schemes) is disallowed to prevent XSS attacks.

The output from this should be safely insertable into a webpage, without any further escaping or sanitization.

@mity
Copy link
Owner

mity commented Jul 14, 2017

Ack. This is needed.

The open question is what exactly it should do.

My current blurry idea includes these points:

  • Disable row HTML blocks and spans (at least for now I don't want to do full HTML parser to make some more complex filtering). This is too prevent any <script> etc.
  • Disable all links but those with some whitelisted scheme (http:, https:, mailto:, maybe few others).

Opinions? Other ideas?

@DemiMarie
Copy link
Author

It would be nice to be able to whitelist some HTML tags, like Snudown (Reddit’s non-conformant parser) can. But that would be hard. On the other hand, MD4C already must be able to parse out a tag for inline HTML, and in safe mode we can simply disallow anything that isn’t strictly valid. (Checking for basic syntax correctness isn’t that hard. The hard part in parsing HTML is handling invalid markup. But we can simply reject invalid markup.)

@mity
Copy link
Owner

mity commented Jul 14, 2017

Whitelisting inline (correctly formed) tags is fine.

But the raw HTML block with its start and end conditions is quite different story.

So to be sure, you propose to treat the raw HTML blocks in the safe mode differently and recognize them only if they are a sequence of complete and valid (whitelisted) tags. I.e. treat them almost the same way as paragraph composed of nothing but raw HTML inlines and the only difference would be the block is not enclosed in <p>. Right?

I'll think about it.

@DemiMarie
Copy link
Author

DemiMarie commented Jul 14, 2017

@mity I was thinking of ignoring HTML blocks entirely, which MD4C can already do. Much simpler and less prone to security vulnerabilities, both memory safety violations and whitelist validation failures. And HTML blocks are not needed for most applications that want a safe mode.

@mity
Copy link
Owner

mity commented Jul 14, 2017

Ugh. That means that this:

<div>

Paragraph 1.

Paragraph 2.

</div>

would be translated to this:

<p><div></p>

<p>Paragraph 1.</p>

<p>Paragraph 2.</p>

<p></div></p>

@DemiMarie
Copy link
Author

DemiMarie commented Jul 15, 2017 via email

mity added a commit that referenced this issue Jul 18, 2017
@mity
Copy link
Owner

mity commented Jul 18, 2017

I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-)

The open problems:

  • Raw HTML blocks: Currently not handled at all.

  • filter_url_scheme() callback: Is it the right way? Maybe the app might want to see whole link destination and not just the scheme. (Consider e.g. relative links without any scheme. App might still want to forbid them if they start e.g. with .. or '/'). That way, the interface could be unified with the attribute callback (see below.) On the flip side, app then must parse the scheme on its own if that's what it wants to filter.

  • Tag attributes: Should not be there also filter_raw_html_tag_attribute()? Consider e.g. all those onclick, onload and similar HTML attributes.

  • Markdown construct versus semantically equivalent raw HTML: Consider e.g. [innocent text](dangerous_url) versus <a href="dangerous_url">innocent text</a>. Should not this be somehow unified in the filtering interface? Maybe, for the filtering purposes, the Markdown construct should be translated to raw html tag/attribute firthe purposes of the filtering, so app does not have a chance to screw it up by forbidding one and allowing the 2nd or vice versa?

  • HTML comments or CDATA;? Do we want to support them as well in the safe mode? I tend to say no and to keep it relatively simple.

mity added a commit that referenced this issue Jul 18, 2017
@DemiMarie
Copy link
Author

DemiMarie commented Jul 23, 2017 via email

@DemiMarie
Copy link
Author

DemiMarie commented Jul 23, 2017 via email

@craigbarnes
Copy link
Contributor

craigbarnes commented Jul 23, 2017

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">

@mity
Copy link
Owner

mity commented Jul 23, 2017

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">

Yeah. I have realized that too (see the comment with open comments). I am in process of thinking how the interface of the filtering should look like.

@mity
Copy link
Owner

mity commented Jul 23, 2017

Right now, I tend to discard the attempt above and start again.

As said, I am mostly wondering how the interface should look like. I am especially interested in a feedback for the following questions:

  • Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?

  • If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text? (Well, even allowing the attribute callback both ways could perhaps be quite simple to do. But is it good idea?)

  • It would be great for compactness of the API if markdown links and images are treated the same way (for the purposes of filtering) and through the same callbacks, as raw <a> and <img> tags (including the attributes href, title; and src, alt). However CommonMark allows complicated things (e.g. escapes, entities) in those contexts and especially image contents, allowed to contain any valid Markdown including links and nested images, and which is mapped to alt attribute is huge complication. This already meant quite a big complication and changes of the API in the past (the current struct MD_ATTRIBUTE and change in how images are rendered). So I am wondering whether this unification is good or bad idea.

@craigbarnes
Copy link
Contributor

craigbarnes commented Jul 23, 2017

Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?

The only cases where you need to know the element name, attribute name and attribute value at the same time are a[href], area[href] and img[src].

The large majority of attributes suitable for whitelisting are global attributes (applicable/safe for all elements), where the value is implicitly safe (doesn't need to be known by the filter, doesn't depend on the presence of other attributes).

I've implemented a minimal HTML5 DOM sanitizer in Lua here. The safeness of the code is dependant on the correctness of the underlying HTML5 parser, but the set of whitelisting rules should be similar to what you might want to use here.

If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text?

Just that attribute should be omitted. I can't think of any case where removing a non-whitelisted attribute from a whitelisted element would make it unsafe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants