Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing [[[[[[… takes quadratic time #798

Open
andersk opened this issue Mar 6, 2019 · 25 comments
Open

Parsing [[[[[[… takes quadratic time #798

andersk opened this issue Mar 6, 2019 · 25 comments
Labels
someday-maybe Approved low priority request.

Comments

@andersk
Copy link
Contributor

andersk commented Mar 6, 2019

markdown.markdown('[' * 13000) takes about a minute (in 3.0.1 and master), with the time increasing by about a factor of four every time the length is doubled.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

I expect this is because the link processor allows for nested brackets. On the bright side, it doesn't crash (many such things raised recursion errors back in version <2.0). The fact is, performance is not a high priority for us. If you are providing such a unrealistic input, then I don't really care that it is slow. Although, I'm curious how different this is for version 2.x as we changed the bracket parsing in 3.0.

@facelessuser
Copy link
Collaborator

Makes sense to me. I'm sure it's running through the entire algorithm for each [. And for each [, it runs to the end of the chain of [. So, if you double the size of the string, it goes through the entire string trying to resolve all the [. This allows you to nest [[]], which is a good thing.

Currently, I see no problem with this behavior as the current compliant is based on an unrealistic situation. A document consisting of a chain of 13,000 [ is not a good argument for changing anything. To me it seems like an academic exercise to show you can construct documents that take a long time to parse.

Before, we used regular expression which failed miserably when tasked with nested []. The new algorithm allows it to handle as much nesting as needed. As far as performance is concerned, I would say it behaves as one would expect.

If we are just concerned someone could create a document that freezes up the parser, we could add nested bracket limit just to prevent the parser from getting tied up in ridiculously large nested situations.

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

Thanks for the quick response.

My perspective is that I’m concerned about denial of service attacks on applications parsing user input as Markdown. This isn’t an academic exercise—this very comment is user input that GitHub is parsing as Markdown, and you can imagine that they’d be really sad if I could easily cause their parser to spin for minutes or hours, whether maliciously or by accident. In an even less hypothetical sense, I’m filing this issue in the aftermath of a real denial of service attack that actually took down a large application via a complexity attack on Python-Markdown.

Now, obviously, you’re under no obligation to support my use case. You are free to decide that Python-Markdown is not designed for use on untrusted input. Part of the reason I’m filing this is that I’d like to know how these kinds of issues will be handled. Should I file more of them, or should I work on finding a Markdown library that better suits my requirements?

@facelessuser
Copy link
Collaborator

facelessuser commented Mar 6, 2019

I’m filing this issue in the aftermath of a real denial of service attack that actually took down a large application via a complexity attack on Python-Markdown.

Context is important, and I think pointing this out is important when filing such an issue.

I suspected near the end of my reply that the act of a hanging parser might have been the direction you were heading in, which is why I also suggested the possibility of limiting the recursion. In general, you aren't going to get a performance boost unless you cache the nested brackets that you've already evaluated to avoid reprocessing them again later as you move down the list of [ (not really practical in our current environment), or you simply limit the recursion depth.

By limiting the recursion depth, let's say by 10, instead of processing 13,000 nested [, then 12,999, etc., you only process 10 of them when testing each [.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

This projects goals are documented as follows (emphasis added):

  • Maintain a Python 2 and Python 3 library (with an optional CLI wrapper) suited to use in web server environments (never raise an exception, never write to stdout, etc.) as an implementation of the markdown parser that follows the syntax rules and the behavior of the original (markdown.pl) implementation as reasonably as possible (see differences for a few exceptions).

  • Provide an Extension API which makes it possible to change and/or extend the behavior of the parser.

In general, the way that any Markdown parser informs the document author that they provided bad syntax is by outputting an illformed document. In other words, Markdown always returns output.

That said, we do no input sanitation. If input causes the parser to crash, we consider that a bug and will modify the parser to not crash. However, that is it. Performance is not a high priority (it is not even mentioned in the "goals"). If you are allowing your users to post large documents, then it is your responsibility to sanitize them properly (perhaps limit their size?).

By the way, for a good review of the issues involved in avoiding XSS attacks via Markdown I suggest reading Markdown and XSS by Michel Fortin (the developer of PHP Markdown). There is no Markdown parser which fully addresses that issue on its own. If you need to address this yourself, then I would think you should expect to address other attack vectors on your own as well.

In other words, as a security concern, this is out-of-scope for any Markdown parser to be concerned with. Regardless, there are some parsers out there which put a higher priority on performance. But they don't have the low barrier-to-entry for creating extensions. We simply have chosen a different set of compromises. Only you can decide which set of compromises meets your needs.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

By limiting the recursion depth, let's say by 10

I suppose its noteworthy that in previous versions, we supported 6 levels of brackets. And that was because we used a regex which had six levels hardcoded into it. In over a decade, I don't recall ever getting a complaint about that limit. I suppose we could artificially limit the existing implementation to some arbitrary nesting level.

@facelessuser
Copy link
Collaborator

It's an idea at least. I think the current algorithm is more robust than the regex was, but while it is nice to be able to support any amount of nesting, maybe endless nesting is a bit unrealistic 🤷‍♂️ .

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

If you’re concerned about crashing with recursion errors, what are your thoughts on markdown.markdown('>' * 1000)?

(Personally I’m somewhat less concerned by that, since it’s much easier to detect and recover from an explicit crash.)

@facelessuser
Copy link
Collaborator

There are areas where there may still be recursion, and possibly some areas that are hard for us to eliminate currently. And yes with recursion, you can get crashes when the limit is exceeded. Generally, we try to avoid crashes were possible.

The point that was being made is that the new algorithm for handling [] wasn't introducing exceptions, something we try to avoid. It gave us the benefit of handling nested [] and () in links in a more sane and robust way without using recursion which can cause exceptions (which has been a problem in the past). But yes, if abused, it can cause cycles to get chewed up, something that limiting the nesting depth should solve.

It would still allow nesting horizontally [[] [] []] which should perform much better than the biggest issue of massively deep nesting. [[[[[[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]]]]]].

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

There also seem to be problems in the horizontal direction: markdown.markdown('[]()' * 20000) takes quadratic time (and this case probably has a reasonably large performance impact on “real” documents, too).

@facelessuser
Copy link
Collaborator

There also seem to be problems in the horizontal direction

When I mentioned horizontal, I was referring to horizontal nesting, which does not perform as badly as it doesn't re-evaluate the same things as often.

markdown.markdown('[]()' * 20000)

Unfortunately, I don't agree that there is problems with this. This is simply evaluating 20,000 links. If you give the parser large enough data to parse, it's going to take a while. You can probably use almost any syntax with a big enough number to cause things to freeze up. At that point, maybe you should implement a time out on your server, or limit the size of your input coming in.

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

One would think that careful design would avoid needing 20,000 ⋅ 20,000 operations to parse 20,000 links; there’s no fundamental reason that each link needs to do something with each other link. I’ll grant that perhaps the current design of the parser makes this difficult to avoid, but I think it’s wrong to dismiss the problem as “simply too much data”. The question is whether there’s interest in trying to solve it.

@facelessuser
Copy link
Collaborator

facelessuser commented Mar 6, 2019

If there's a legitimate problem then we'll of course fix it. We'll have to double check what the algorithm is doing. If parsing time exploding then it would most likely be because it is trying to parse the whole chain of links because the algorithm doesn't cut off the parsing after the first link, which might be possible. There might be a case that is causing that.

@mitya57
Copy link
Collaborator

mitya57 commented Mar 6, 2019

If you’re concerned about crashing with recursion errors, what are your thoughts on markdown.markdown('>' * 1000)?

Can you please file a separate issue for this bug? (FWIW '>' * 331 is enough to reproduce it)

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

@mitya57 Sure, filed as #799.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

The question is whether there’s interest in trying to solve it.

That's what I keep saying. This is not a priority for us. Personally, if you feed the parser such a ridiculous input you deserve to have it break. And if the problem is that untrusted users are feeding the parser, then is is your responsibility to sanitize their input.

Regarding the size limit; the parser holds the entire document in memory. So any plain text could break things just by being too big to fit in memory. So yeah, if you are accepting input from untrusted users, you need to have a size limit.

@facelessuser
Copy link
Collaborator

facelessuser commented Mar 6, 2019

markdown.markdown('[]()' * 20000)

I briefly looked at this as I found the statement odd that we were matching this 20000 * 20000. While I won't rule out that maybe something is is getting called that expands the time, I put in a few print statements just to see what was happening. My worry was that we were maybe parsing the entire string every time because maybe the link didn't match because the brackets were empty, this does not seem to be the case.

Here we print every time we perform a match, and we log each character evaluated when looking at the content for text [text] and link (link). It is as I would expect. 10 runs for references that match nothing, and then 10 runs with links that match fine and only match the closing character for text and link:

>>> markdown.markdown('[]()' * 10)
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
---- Reference check
-- Match Text
]
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
----- Link check
-- Match Text
]
-- match link
)
'<p><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a><a href=""></a></p>'

Now I know that Python Markdown will then run the parser on the content of each element returned, I imagine that may be where time expands. The more elements you create, the more times you have to run the full inline processor. That is just how Python Markdown currently does things.

Now there is no content in these elements, so maybe it is wasting time on running patterns on empty strings, or maybe it skips empty content, I don't know as I didn't bother to dig deeper than this. It doesn't appear we are parsing the string via the link patterns in a way that would cause the ballooning of time, so that at least pacifies my initial concern.

If the time increase is only because we are recursively parsing element content, maybe we could do a simple check to test if there is any content to actually evaluate (if we aren't already doing that), but you'd be out of luck if you did this markdown.markdown('[text]()' * 20000) as we'd run the parser on every instance of text after generating the link.

Again this is an unrealistic case generating 20000 links like this. I guess if something specifically was identified as problematic, I think we'd be open. Like "we can improve performance by X by changing code Y to Z". And pull requests are always welcome as well.

Right now, I think the case of markdown.markdown('[' * 13000) has a better chance of being addressed right now than markdown.markdown('[]()' * 20000), and even that I'm meh about, simply because how absurd the pattern is, but limiting allowed nesting isn't unreasonable I guess. Ultimately I won't move on this unless @waylan decides that this is something we want to address.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

I'm willing to entertain any PR as long as it maintains backward compatibility. I'll make an exception to the backward comparability rule if someone wants to submit a PR limiting the nesting to some arbitrary number.

Regarding running patterns on empty content, I suspect the real problem there is the way we handle nesting. Specifically, we can't run a regex pattern across multiple parts of a string when those parts are stored in different elements in the ElementTree object. So we do some "magic" to combine them into a single string without losing the structure. That "magic" complicates the inline parsing by orders of magnitude and is the likely cause of many performance issues. Personally, I never liked the solution, but it is what we have. We discussed removing it for version 3.0 by ultimately decided not to to maintain backward compatibility with existing extensions.

And that is the compromise that we have chosen to make. We are one of the most extensible implementations, but not a performant implementation. If you want both then you need to use a two-part implementation (a parser outputs an abstract syntax tree (AST) and a renderer renders that AST into HTML). Python-Markdown is not that and turning it into that would require starting over from scratch. If you want that, go use Mistune.

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

@facelessuser I want to be clear: Python should be more than capable of doing 20,000 operations on empty or very short strings in a tiny fraction of a second. When I say it takes quadratic time, I mean that I measured markdown.markdown('[]()' * 20000) taking four times as long as markdown.markdown('[]()' * 10000) rather than twice as long. This can only be explained by doing 20,000 · 20,000 work.

The problem isn’t specific to empty text or empty URLs either, you can fill them with real content and observe the same slowdown.

You can observe some of this quadratic work here:

--- a/markdown/treeprocessors.py
+++ b/markdown/treeprocessors.py
@@ -310,6 +310,7 @@ class InlineProcessor(Treeprocessor):
         placeholder = self.__stashNode(node, pattern.type())
 
         if new_style:
+            print(len(data[:start]), len(placeholder), len(data[end:]))
             return "%s%s%s" % (data[:start],
                                placeholder, data[end:]), True, 0
         else:  # pragma: no cover

This gives 20,000 lines of output, summing to a total of about 9.13 · 20,000 · 20,000 bytes concatenated over the entire run.

@waylan
Copy link
Member

waylan commented Mar 6, 2019

@andersk the slowdown is most likely related to the issue in my last comment, not the patterns themselves. I know I've explained this before, but I can't find it so here goes....

Each inline processor returns a ElementTree object. And if a Markdown element spans multiple ElementTree elements we can't find that with a single pattern match without some tricks. For example, consider this input with some nested elements: foo *[bar](#) baz*. The link pattern runs first and now we have the following ElementTree objects (represented as pseudo JSON):

{ 
  'tag': 'p',
  'text': 'foo *'
  'children': [
    {
      'tag': 'a',
      'href': '#',
      'text': 'bar',
      'tail': ' baz*'
    }
  ]
}

And now we need to run the emphasis pattern. But how do we do that when the text is split across 3 separate strings?

The current solution takes the contents of the p tag and packs it up into a single string with placeholders. Then it runs the pattern. Finally, it unpacks the string back into the ElementTree objects. I suspect the extra time is spent with this packing and unpacking placeholders, not running the patterns themselves. Unfortunately, any alternate solution we have considered has been backward incompatible in that any existing third-party extensions would no longer work. The compromise we have made is to overlook the performance hit so we don't loose our most important resource (a large library of third party extensions).

You can argue till your blue in the face that the parser should be faster, but unless you can provide a better (backwards compatible) solution to this problem, we are not going to move forward. And if you find this unacceptable, then I suggest looking at other libraries. That said, unfortunately, there are not many good options in Python based Markdown parsers if performance is of high importance to you (see lepture/mistune#1).

@facelessuser
Copy link
Collaborator

@andersk, yes, I understood your original statement. The opening post is definitely related to the pattern to some extent, but I wanted to make sure markdown.markdown('[]()' * 20000) was not pattern related, which it appears it isn't due to my assessment. By no means was I suggesting a slow down is not present, only it is not due to the way we parse the pattern like the case of markdown.markdown('[' * 13000) is.

Other than that, I think @waylan has highlighted what the real issue most likely is.

@andersk
Copy link
Contributor Author

andersk commented Mar 6, 2019

I understand. I can accept “backwards compatibility with an API designed without realizing the complexity implications” as a good explanation, while vehemently rejecting “too much data”. Hopefully this issue will be considered in any future API redesigns, and if so, I will consider that a small victory.

Meanwhile, there are some complexity issues that probably could be fixed without a major redesign.

  • markdown.markdown('<' * 50000) spends quadratic time matching AUTOMAIL_RE against a single string.
  • markdown.markdown('<a' * 20000) spends quadratic time matching AUTOMAIL_RE and HTML_RE against a single string.
  • markdown.markdown('<ftp://' * 10000) spends quadratic time matching AUTOMAIL_RE, HTML_RE, and AUTOLINK_RE against a single string.
  • markdown.markdown('[](<' + '">' * 20000) spends quadratic time matching LinkInlineProcessor.RE_LINK against a single string.

These are even more potent because individual regex operations hold the GIL and can’t be interrupted, but they could be fixed with more careful regex design. When looking for delimited syntax, you need to make sure never to match multiple opening delimiters. For example,

-AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>'
+AUTOMAIL_RE = r'<([^<> \!]*@[^@<> ]*)>'

andersk added a commit to andersk/python-markdown that referenced this issue Mar 6, 2019
Part of the discussion in Python-Markdown#798.

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@waylan
Copy link
Member

waylan commented Mar 6, 2019

Meanwhile, there are some complexity issues that probably could be fixed without a major redesign.

See also #161, specifically this comment. I suspect that when we closed that we only addressed the first issue listed there and never got to the rest.

andersk added a commit to andersk/python-markdown that referenced this issue Mar 6, 2019
Part of the discussion in Python-Markdown#798.

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@facelessuser
Copy link
Collaborator

Yeah, better patterns is always good.

waylan pushed a commit that referenced this issue Mar 7, 2019
Part of the discussion in #798.

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@waylan waylan added the someday-maybe Approved low priority request. label Nov 6, 2019
@stsewd
Copy link

stsewd commented Feb 17, 2024

In case it's useful to anyone, this regex also takes quadratic time

RE = re.compile(
r'^[ ]{0,3}\[([^\[\]]*)\]:[ ]*\n?[ ]*([^\s]+)[ ]*(?:\n[ ]*)?((["\'])(.*)\4[ ]*|\((.*)\)[ ]*)?$', re.MULTILINE
)

markdown.markdown('[]:' + ' ' * 50000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
someday-maybe Approved low priority request.
Projects
None yet
Development

No branches or pull requests

5 participants