Wrong normalization of message with HTML #125

mgeisler · 2023-11-12T13:52:18Z

This is from google/comprehensive-rust#1471: running mdbook-i18n-normalize on a translated message that contain HTML gives the wrong output.

I can reproduce this with

diff --git a/i18n-helpers/src/normalize.rs b/i18n-helpers/src/normalize.rs
index 7fed94d..4c0fd5c 100644
--- a/i18n-helpers/src/normalize.rs
+++ b/i18n-helpers/src/normalize.rs
@@ -451,6 +451,15 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_normalize_html() {
+        let catalog = create_catalog(&[("foo <span>bar</span>", "FOO <span>BAR</span>")]);
+        assert_normalized_messages_eq(
+            catalog,
+            &[exact("foo <span>bar</span>", "FOO <span>BAR</span>")],
+        );
+    }
+
     #[test]
     fn test_normalize_disappearing_html() {
         // Normalizing "<b>" results in no messages.

This tests normalizing a catalog looking like this:

msgid "foo <span>bar</span>"
msgstr "FOO <span>BAR</span>"

The test fails with

---- normalize::tests::test_normalize_html stdout ----
thread 'normalize::tests::test_normalize_html' panicked at i18n-helpers/src/normalize.rs:457:9:
assertion failed: `(left == right)`

Diff < left / right > :
 [
     (
         false,
<        "foo ",
<        "FOO ",
<    ),
<    (
<        false,
<        "bar",
<        "BAR",
>        "foo <span>bar</span>",
>        "FOO <span>BAR</span>",
     ),
 ]

which tells us that the normalized catalog contain two messages:

msgid "foo"
msgstr "FOO"

msgid "bar"
msgstr "BAR"

In other words, the original message was split into two messages.

In case only the translation contain HTML, the normalization will end up with a different number of messages: 1 for the msgid field, and 2 for the msgstr field. This is seen as an error, so a fallback kicks in: the normalized message is marked fuzzy and we accumulate the "left-over" messages into the final message.

This behavior can be seen in this unit test in normalize.rs:

    #[test]
    fn test_normalize_fuzzy_list_items_too_many() {
        let catalog = create_catalog(&[(
            "* foo\n\
             * bar",
            "* FOO\n\
             * BAR\n\
             * BAZ",
        )]);
        assert_normalized_messages_eq(catalog, &[fuzzy("foo", "FOO"), fuzzy("bar", "BAR\n\nBAZ")]);
    }

This is what happened in google/comprehensive-rust#1471 where we transform

#: src/SUMMARY.md:92
msgid "Double Frees in Modern C++"
msgstr "آزاد سازی مضاعف در<span dir=ltr>C++</span> مدرن"

into

#: src/SUMMARY.md:92
#, fuzzy
msgid "Double Frees in Modern C++"
msgstr ""
"آزاد سازی مضاعف در\n"
"\n"
"C++\n"
"\n"
"مدرن"

The text was updated successfully, but these errors were encountered:

mgeisler · 2023-11-12T14:03:47Z

@moaminsharifi, the underlying problem here is that I thought we could normalize a translation like

msgid "foo bar"
msgstr "FOO BAR"

by running mdbook-xgettext on both of "foo bar" and "FOO BAR" and then matching up the results pairwise. However, this breaks down when either field contain more/less parts on purpose.

We can throw in as much custom logic as we want in the normalization tool: I consider it a bit of free space where we clean up our past problems 🙂 However, it would be nice if we could find a clean way to keep the inline HTML intact in general.

@dyoo, you've run into similar problems with the parsing of HTML comments. Do you have an idea of how we can stop splitting out the inline HTML?

mgeisler · 2024-01-09T09:37:17Z

This is related to #97 as well: when we extract messages from a Markdown string, we get multiple messages if the Markdown contains HTML. So

foo <span>bar</span>"

becomes foo and bar. This causes the normalization problem above.

Since HTML is very rare in my experience, it would probably better to not split it out like that. So maybe we should just let Text and Html nodes become part of the same Group.

mgeisler · 2024-02-20T08:04:02Z

@kdarkhan, did you want to look into this too? I think it's basically a duplicate of #97 so fixing one will probably fix both 😄

michael-kerscher · 2024-05-07T09:21:20Z

With the pulldown-cmark upgrade the span tags are detected as InlineHtml. This allows them to be grouped into the same translation group while other block level Html tags are still handled as before.

The change to InlineHtml also affects these tags:

 - this can be detected and handled as before
<script>alert('there');</script> - this needs further discussion how tags like these should be handled.

There are two ways this could go:
a) Handle all InlineHtml tags (that are not mdbook directives) in the same group (this would then not split the <script> tags)
b) Allowlist or Denylist for tags that should be kept in the same group as text.

With the statement above that HTML is rare in these translations I would go for solution a).
Only if Html tags are already used in a meaningful way in other places, maybe b) with an allowlist for e.g. span (or other "stylistic" tags) would be better to keep everything compatible. From my perspective it would be a explicit decision to support (some) Html in the Markdown, and I'm not sure how useful this feature would be at this moment.

Any thoughts on this @mgeisler?

mgeisler · 2024-05-16T08:23:26Z

Hi @michael-kerscher,

Thanks for looking at this!

With the statement above that HTML is rare in these translations I would go for solution a). Only if Html tags are already used in a meaningful way in other places, maybe b) with an allowlist for e.g. span (or other "stylistic" tags) would be better to keep everything compatible. From my perspective it would be a explicit decision to support (some) Html in the Markdown, and I'm not sure how useful this feature would be at this moment.

I would also lean towards solution a). The less opinionated we have to be the better in my opinion 😄 The amount of HTML allowed in the Markdown is ultimately a decision for the people who use the tooling here, so I don't think we need to guard against JavaScript tags or similar.

 - this can be detected and handled as before

Great! Just for my own curiosity, does a HTML comment like this in its own paragraph show up as InlineHtml or as Html now?

A related question would be what happens to Markdown wrapped in a large div element? Like this:

<div class="warning">

Beware of the dog!

</div>

I haven't tested it, but I expect us to extract "Beware of the dog!" today. It would be a shame if we suddenly end up extracting the whole thing as a self-contained unit.

michael-kerscher · 2024-05-21T15:08:26Z

Great! Just for my own curiosity, does a HTML comment like this in its own paragraph show up as InlineHtml or as Html now?

A HTML comment in its own paragraph is still showing up as HtmlBlock so there is no change: see this link
https://spec.commonmark.org/dingus/?text=%3C!--%20mdbook-xgettext:skip%20--%3E%0AHi%20from%20%3Cspan%20%3ERust%3C/span%3E

But if it is used inline it is detected as an Inline Block.

To give a bigger example:

<!-- mdbook-xgettext:skip -->
This comment is a Html Block and this text would be skipped

This is printed and the comment is an Inline Block <!-- mdbook-xgettext:skip --> this would be skipped
This is not skipped

The pull request has further details and examples on this.

mgeisler added bug Something isn't working good first issue Good for newcomers labels Nov 12, 2023

This was referenced Nov 12, 2023

issue with mdbook-i18n-helpers 3.0 in persian string google/comprehensive-rust#1471

Closed

File normalized with mdbook-i18n-helpers 0.3.0 google/comprehensive-rust#1469

Closed

mgeisler mentioned this issue Jan 9, 2024

HTML tag abbr seems to confuse the parser #97

Closed

mgeisler mentioned this issue Apr 16, 2024

Upgrade pulldown-cmark to 0.10.3 #183

Merged

mgeisler mentioned this issue May 16, 2024

Write tool which can convert translated files back to PO #30

Open

michael-kerscher mentioned this issue May 17, 2024

Keep Inline HTML tags in the translated text group while ignoring block level HTML tags #195

Merged

kdarkhan closed this as completed in #195 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong normalization of message with HTML #125

Wrong normalization of message with HTML #125

mgeisler commented Nov 12, 2023

mgeisler commented Nov 12, 2023

mgeisler commented Jan 9, 2024 •

edited

mgeisler commented Feb 20, 2024

michael-kerscher commented May 7, 2024

mgeisler commented May 16, 2024

michael-kerscher commented May 21, 2024

Wrong normalization of message with HTML #125

Wrong normalization of message with HTML #125

Comments

mgeisler commented Nov 12, 2023

mgeisler commented Nov 12, 2023

mgeisler commented Jan 9, 2024 • edited

mgeisler commented Feb 20, 2024

michael-kerscher commented May 7, 2024

mgeisler commented May 16, 2024

michael-kerscher commented May 21, 2024

mgeisler commented Jan 9, 2024 •

edited