Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Markup escaping not working as expected #878

Closed
ronf opened this issue Jan 5, 2021 · 10 comments
Closed

[BUG] Markup escaping not working as expected #878

ronf opened this issue Jan 5, 2021 · 10 comments
Labels
accepted Task was accepted bug Something isn't working

Comments

@ronf
Copy link

ronf commented Jan 5, 2021

Describe the bug
It's a bit of an edge case, but right now it's not possible to output a backslash immediately before a square bracket starting a markup token. While it does properly output a single backslash if you feed in r'\\' followed by a markup token, it still disables interpreting the token as markup.

To Reproduce
Working cases:

rich.print(r'[bold]text[/]') prints "text"
rich.print(r'\[not a tag]') prints "[not a tag]"

Not working case:
rich.print(r'\\[bold]some text[/]') gives an error about an unexpected close tag, because it doesn't treat "[bold]" as markup. The expected result here was "\text", a single backslash followed by the text in bold.

Adding more backslashes also produces a surprising result. It seems like the rightmost backslash is always consumed to disable markup on the square bracket, but all the previous backslashes are passed through as-is, rather than having every other backslash escape the following one. So, with three backslashes, I was expecting one to be output (consuming the first two of the input) and one to be left over to escape the square bracket and disable markup. WIth four backslashes, I was expecting two backslashes to be output (consuming all four of the input) and the square bracket after that to be treated as markup.

Platform
Rich 9.5.1 on macOS Mojave (10.14.6) with Apple's Terminal.app.

@ronf
Copy link
Author

ronf commented Jan 5, 2021

Here's a possible fix:

--- markup.py.orig	2020-12-30 06:35:31.000000000 -0800
+++ markup.py	2021-01-04 20:00:50.000000000 -0800
@@ -9,7 +9,7 @@
 
 RE_TAGS = re.compile(
     r"""
-(\\\[)|
+(\\[[\\])|
 \[([a-z#\/].*?)\]
 """,
     re.VERBOSE,
@@ -60,12 +60,12 @@
     """
     position = 0
     for match in RE_TAGS.finditer(markup):
-        (escape_open, tag_text) = match.groups()
+        (escaped, tag_text) = match.groups()
         start, end = match.span()
         if start > position:
             yield start, markup[position:start], None
-        if escape_open:
-            yield start, "[", None
+        if escaped:
+            yield start, escaped[1], None
         else:
             text, equals, parameters = tag_text.partition("=")
             if equals:

Basically, it looks for either \\ or \[ in the first part of RE_TAGS, and then uses that to consume pairs of backslashes before looking for backslash and the bracket. Later, when outputting things, it outputs either a single backslash or the square bracket when that first part of RE_TAGS matches. Otherwise, it does what it did before.

@willmcgugan willmcgugan added accepted Task was accepted bug Something isn't working and removed Needs triage labels Jan 5, 2021
@ronf
Copy link
Author

ronf commented Jan 5, 2021

It occurred to me this morning that the escape() function should also be updated. Here's a diff for that:

@@ -48,7 +48,7 @@
     Returns:
         str: Markup with square brackets escaped.
     """
-    return markup.replace("[", r"\[")
+    return markup.replace("\\", r"\\").replace("[", r"\[")
 
 

Note that the order of the replace calls is important here.

I also wonder whether or not backslash escaping should be allowed for ":" to prevent it from doing emoji conversion (with either additional changes to escape() here, or a separate escape function for that).

Speaking of separating markup & emoji replacement, it doesn't look like there's a way to do ONLY emoji replacement on a string right now. There are some places where I'd like to be able to do that, but the replace function is in a private module right now. The opposite is supported by calling Text.from_markup() with emoji=False, but there's no way to ask for emoji replacement to be enabled while disabling other markup.

@willmcgugan
Copy link
Collaborator

Thanks for the report and suggested fix. I'll take a look at those soon.

re Emoji you can do this:

from rich.emoji import Emoji
print(Emoji.replace(":100:"))

@ronf
Copy link
Author

ronf commented Jan 5, 2021

Ah, I missed that. I'll give it a try - thanks!

@willmcgugan
Copy link
Collaborator

Potential fix in this branch.

Decided against escaping emoji, it complicates things too much. Emoji can be disabled anyway.

@ronf
Copy link
Author

ronf commented Jan 6, 2021

Unfortunately, this fix doesn't cover all the cases. It handles an escaped bracket and a single escaped backslash, but if you put more than two backslashes, it always matches on the rightmost two of them and leaves the square bracket as always being a markup token. With three backslashes, the first two should turn into a single backslash in the output, but the third backslash should escape the bracket and make it a literal. This continues in even/odd pairs for an arbitrary number of backslashes, and my fix took that into account by letting each pair (of either \\ or \[) be matched in a separate match in the finditer.

@willmcgugan
Copy link
Collaborator

Try the branch now...

Note that the escaping is only required for backslashes directly preceding the [. People printing Windows paths with backslashes would hate me otherwise.

@ronf
Copy link
Author

ronf commented Jan 7, 2021

Looks good - thanks!

Preserving backslashes as-is elsewhere in the string makes sense to me. I do something similar in some code I wrote which allows users to input raw unicode characters by entering something like \u203c or \U0001f348 inside their input string, converting those sequences to a single Unicode character when the entered code point is a valid printable character. Entering \\u203c is turned into \u203c (removing one backslash), but entering \\ not followed by a "u" or "U" and the right number of hex digits is left alone.

@willmcgugan
Copy link
Collaborator

Fixed in v9.6.2

@ronf
Copy link
Author

ronf commented Jan 8, 2021

Upgraded here - looks great. Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Task was accepted bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants