Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not treating bare URL enclosed in angle brackets as unconstrained markup #4468

Closed
someth2say opened this issue Jun 13, 2023 · 5 comments
Closed
Assignees
Labels
bug compliance v2.0.21 Issues resolved in the 2.0.21 release
Milestone

Comments

@someth2say
Copy link

As discussed here, bare URLs that are enclosed in angle brackets are intended to use the brackets as unconstrained pairs.
But it seems not to be the case.
If the closing bracket is followed by a word separator the link is delimited correctly:

❯ echo "Hello <https://asciidoctor.org/>." | asciidoctor -b html5 -s -
<div class="paragraph">
<p>Hello <a href="https://asciidoctor.org/" class="bare">https://asciidoctor.org/</a>.</p>
</div>

But when the closing bracket is following by a non-word-separator character, then the delimitation proces fails:

❯ echo "Hello <https://asciidoctor.org/>/news/" | asciidoctor -b html5 -s -
<div class="paragraph">
<p>Hello &lt;<a href="https://asciidoctor.org/&gt;/news/" class="bare">https://asciidoctor.org/&gt;/news/</a></p>
</div>

In the example, you can see how:

  1. The < character (transformed to the &lt; entity) not discarded, and it is placed before the linl.
  2. The > character (transformed to the &gt; entity) and the text after in, until the next word separator, bedome part of the link href

This is specially problematic in some asian languages, such as Japanese, were there is no word separators (and sentence separators are not considered word separators).
As there are no word separators, using autolinks directly is not suitable:

echo "URLはhttp://www.google.com。" | asciidoctor -b html5 -s -
<div class="paragraph">
<p>URLはhttp://www.google.com。</p>
</div>

The URL is not even detected as a link.
Even if it was detected, there would be now way to delimit where the link ends and where the text starts.

The most sensible approach is using angle brackets to delimit the link.
But then we encounter the issue above:

❯ echo "URLは<http://www.google.com>。" | asciidoctor -b html5 -s -
<div class="paragraph">
<p>URLは&lt;<a href="http://www.google.com&gt;。" class="bare">http://www.google.com&gt;。</a></p>
</div>

Note the character (sentence terminator) being included in the href.

FTR:

❯ asciidoctor --version
Asciidoctor 2.0.20 [https://asciidoctor.org]
Runtime Environment (ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
@mojavelinux
Copy link
Member

Here's another example:

URLは<http://www.google.com>。

@mojavelinux mojavelinux changed the title Fail to delimit bare URL enclosed in angle brackets when not followed by a word separator. Not treating bare URL enclosed in angle brackets as unconstrained markup Jun 15, 2023
@mojavelinux mojavelinux self-assigned this Jun 15, 2023
@mojavelinux mojavelinux added this to the v2.0.x milestone Jun 15, 2023
@mojavelinux
Copy link
Member

The issue is the trailing . It's causing the processor to not see the closing > around the URL.

mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 19, 2024
mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 19, 2024
@mojavelinux
Copy link
Member

This fix turned out to be pretty straightforward. If we find a URL that starts with <, the processor will end the URL at the next >, even if there are adjacent characters. In other words, it will treat this as unconstrained markup.

There's a chance that it over-matches the first occurrence if there's more than one in a line without any spaces, but there's really nothing the current parser can do about that case. You'll need to insert something like {zwsp} to tell the parser to stop looking for the URL. This is something we can address in the AsciiDoc Language.

@mojavelinux
Copy link
Member

I found a way to support multiple in one line without workarounds. And it's better this way as it will be matched more precisely.

mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 20, 2024
mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 20, 2024
mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 20, 2024
mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 20, 2024
mojavelinux added a commit to mojavelinux/asciidoctor that referenced this issue Feb 20, 2024
@mojavelinux
Copy link
Member

I think I finally found a matcher that solves this problem while also providing the best compatibility with AsciiDoc.py and has negligible impact on performance, if any at all. This is definitely an area where the syntax is very scantly defined, so we'll be revisiting it to sure it up in the AsciiDoc Language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug compliance v2.0.21 Issues resolved in the 2.0.21 release
Projects
None yet
Development

No branches or pull requests

2 participants