Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System should timeout on badly formed regex's - freezes on adding ignore text filter with a badly formed regex #2268

Open
jgupta opened this issue Mar 20, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@jgupta
Copy link
Contributor

jgupta commented Mar 20, 2024

Describe the bug
Software freezes on adding specific ignore text on a specific website

Version
0.45.16

To Reproduce

Steps to reproduce the behavior:

  1. Add URL 'http://gtu.ac.in/Recruitment.aspx' using Playwright Chromium.
  2. Add following Ignore text in filters.
/.*\bvisit.*\b.*/i
/.*\bhits.*\b.*/i
/.*\bpageviews.*\b.*/i
/.*\btemperature.*\b.*/i
/.*\bhumidity.*\b.*/i
/.*\bआगंतुक.*\b.*/i
/.*\bकॉपीराइट.*\b.*/i
/.*\bकुल विज़िट.*\b.*/i
/^\s*(\d+\s*)+$/
/Downloads:\ \d+/
Originaltext
Diese Übersetzung bewerten
Mit deinem Feedback können wir Google Übersetzer weiter verbessern
Original text
Rate this translation
Your feedback will be used to help improve Google Translate
원본 텍스트
번역 평가
보내주신 의견은 Google 번역을 개선하는 데 사용됩니다
原文
この翻訳を評価してください
いただいたフィードバックは Google 翻訳の改善に役立てさせ

Can't use share button because everything freezes as soon as above filter is added.
Expected behavior
All of the software should not freeze. If there some issue with ignore text or with the URL/website, it should not impact complete software.

Additional context
Note that above filter was added as global filter to thousands of watches but this issue is only with this one URL. It took us a while to zero down to this one URL.

@jgupta jgupta added the triage label Mar 20, 2024
@dgtlmoon dgtlmoon changed the title Software freezes on adding ignore text filter sankrit text? freezes on adding ignore text filter Mar 20, 2024
@dgtlmoon dgtlmoon changed the title sankrit text? freezes on adding ignore text filter sanskrit text? freezes on adding ignore text filter Mar 20, 2024
@jgupta jgupta changed the title sanskrit text? freezes on adding ignore text filter Hindi text? freezes on adding ignore text filter Mar 20, 2024
@dgtlmoon
Copy link
Owner

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

@dgtlmoon
Copy link
Owner

Yes, it's quadratic time.  If the string being searched has N characters, first it fails to find "x" in all N of 'em, then `.*` advances by one and it fails to find "x" in the trailing N-1 characters, then again in the trailing N-2, and so on.  N + N-1 + N-2 + ... + 1 is quadratic in N.

That's how this kind of regexp engine works.  And it's mild, as such things go:  you can also create poor regexps that take time _exponential_ in N that fail to match certain strings.

It's unlikely this will change without replacing Python's regexp implementation entirely.  For why, see Jeffrey Friedl's book "Mastering Regular Expressions" (published by O'Reilly).  That book also spells out techniques for crafting regexps that don't suck ;-)  It's not a small topic, alas.

https://bugs.python.org/issue35915

@dgtlmoon dgtlmoon changed the title Hindi text? freezes on adding ignore text filter System should timeout on badly formed regex's - freezes on adding ignore text filter with a badly formed regex Mar 20, 2024
@dgtlmoon dgtlmoon added the enhancement New feature or request label Mar 20, 2024
@dgtlmoon dgtlmoon reopened this Mar 20, 2024
@dgtlmoon
Copy link
Owner

dgtlmoon commented Mar 20, 2024

the problem is your regex, the other problem is that the system doesnt timeout, the regex works but it takes an exponentially long time to use. your regex is bad and the system doesnt catch it.

the fix is to place the call to

changed_detected, update_obj, contents = update_handler.run_changedetection(uuid,
in a thread and wrap that thread with a timeout, on timeout, it should throw an error that suggests to check all regexs etc

maybe something like


def search_with_timeout(pattern, text, timeout=3):
    result = [None]

    def search_thread():
        result[0] = re.search(pattern, text)

    # Create and start the thread
    search_thread = threading.Thread(target=search_thread)
    search_thread.start()

    # Wait for the thread to finish or timeout
    search_thread.join(timeout)

    # If thread is still alive, it means it has exceeded the timeout
    if search_thread.is_alive():
        print("Search operation timed out!")
        # Terminate the thread
        search_thread.terminate()  # This method doesn't exist, it's just for illustration

    return result[0]

# Example usage
pattern = r'your_pattern_here'
text = 'your_text_here'

result = search_with_timeout(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No match found within the timeout.")

@dgtlmoon dgtlmoon removed the triage label Mar 20, 2024
@jgupta
Copy link
Contributor Author

jgupta commented Mar 21, 2024

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

I am not good at regex. It was created by ChatGPT4. My intention was to ignore lines that has just number or space. Below is entire ChatGPT 4 response.

To match a line containing only numbers that may include spaces between them, such as "8 4 6 2 1 9 6 6", and also match lines that have just a number without spaces, you can use the following regular expression:

/^\s*(\d+\s*)+$/ 

This regular expression does the following:

  • ^ asserts the position at the start of the line.
  • \s* matches any whitespace characters (like spaces or tabs) zero or more times.
  • (\d+\s*) is a group that matches one or more digits followed by zero or more whitespace characters.
  • + after the group (\d+\s*) means that this group can appear one or more times, allowing for multiple numbers separated by spaces.
  • $ asserts the position at the end of the line.

So, this regex will match a line with a single number, as well as lines with multiple numbers separated by spaces.

@jgupta
Copy link
Contributor Author

jgupta commented Mar 21, 2024

Following works and should be better as per ChatGPT4.

The regular expression you've provided /^\s*(\d+\s*)+$/ is intended to match lines that consist solely of numbers with optional whitespace characters between them. However, it is susceptible to what's known as "catastrophic backtracking," which can occur when the regex engine has to evaluate a large number of possible ways to match a pattern. This happens because the pattern (\d+\s*)+ is highly ambiguous: \d+ can match as many digit characters as possible, and \s* can match as many whitespace characters as possible. The + at the end allows this entire group to repeat, creating many possible combinations for the regex engine to try and match.

To improve the performance of this regex, we can try to make the quantifiers less ambiguous and remove unnecessary repetition. A revised version might look like this:

/^\s*\d+(?:\s+\d+)*\s*$/

Here’s what’s changed:

  • Instead of (\d+\s*)+, it now uses \d+(?:\s+\d+)*, which will match a number, followed by zero or more groups of one or more whitespace characters followed by another number. This pattern is less prone to backtracking because the + inside the non-capturing group (?: ... ) requires at least one whitespace character to be present for a match to continue, eliminating the ambiguity of \s*.
  • The non-capturing group (?: ... ) is used with * to match any additional numbers separated by whitespace without capturing them, which is more efficient in many regex engines.

This optimized pattern should perform much better because it guides the regex engine more precisely, reducing the potential for excessive backtracking.

@dgtlmoon
Copy link
Owner

I guess this is the classic old problem of pasting code that you dont understand fully.

@jgupta
Copy link
Contributor Author

jgupta commented Mar 29, 2024

I guess this is the classic old problem of pasting code that you dont understand fully.

So true. Exactly why a good software should have all kinds of safety nets wherever it allows user's to inject code in a input box.

Your proposed solution to timeout and throw error looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants