Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong normalization of message with HTML #125

Closed
mgeisler opened this issue Nov 12, 2023 · 6 comments · Fixed by #195
Closed

Wrong normalization of message with HTML #125

mgeisler opened this issue Nov 12, 2023 · 6 comments · Fixed by #195
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@mgeisler
Copy link
Collaborator

This is from google/comprehensive-rust#1471: running mdbook-i18n-normalize on a translated message that contain HTML gives the wrong output.

I can reproduce this with

diff --git a/i18n-helpers/src/normalize.rs b/i18n-helpers/src/normalize.rs
index 7fed94d..4c0fd5c 100644
--- a/i18n-helpers/src/normalize.rs
+++ b/i18n-helpers/src/normalize.rs
@@ -451,6 +451,15 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_normalize_html() {
+        let catalog = create_catalog(&[("foo <span>bar</span>", "FOO <span>BAR</span>")]);
+        assert_normalized_messages_eq(
+            catalog,
+            &[exact("foo <span>bar</span>", "FOO <span>BAR</span>")],
+        );
+    }
+
     #[test]
     fn test_normalize_disappearing_html() {
         // Normalizing "<b>" results in no messages.

This tests normalizing a catalog looking like this:

msgid "foo <span>bar</span>"
msgstr "FOO <span>BAR</span>"

The test fails with

---- normalize::tests::test_normalize_html stdout ----
thread 'normalize::tests::test_normalize_html' panicked at i18n-helpers/src/normalize.rs:457:9:
assertion failed: `(left == right)`

Diff < left / right > :
 [
     (
         false,
<        "foo ",
<        "FOO ",
<    ),
<    (
<        false,
<        "bar",
<        "BAR",
>        "foo <span>bar</span>",
>        "FOO <span>BAR</span>",
     ),
 ]

which tells us that the normalized catalog contain two messages:

msgid "foo"
msgstr "FOO"

msgid "bar"
msgstr "BAR"

In other words, the original message was split into two messages.

In case only the translation contain HTML, the normalization will end up with a different number of messages: 1 for the msgid field, and 2 for the msgstr field. This is seen as an error, so a fallback kicks in: the normalized message is marked fuzzy and we accumulate the "left-over" messages into the final message.

This behavior can be seen in this unit test in normalize.rs:

    #[test]
    fn test_normalize_fuzzy_list_items_too_many() {
        let catalog = create_catalog(&[(
            "* foo\n\
             * bar",
            "* FOO\n\
             * BAR\n\
             * BAZ",
        )]);
        assert_normalized_messages_eq(catalog, &[fuzzy("foo", "FOO"), fuzzy("bar", "BAR\n\nBAZ")]);
    }

This is what happened in google/comprehensive-rust#1471 where we transform

#: src/SUMMARY.md:92
msgid "Double Frees in Modern C++"
msgstr "آزاد سازی مضاعف در<span dir=ltr>C++</span> مدرن"

into

#: src/SUMMARY.md:92
#, fuzzy
msgid "Double Frees in Modern C++"
msgstr ""
"آزاد سازی مضاعف در\n"
"\n"
"C++\n"
"\n"
"مدرن"
@mgeisler mgeisler added bug Something isn't working good first issue Good for newcomers labels Nov 12, 2023
@mgeisler
Copy link
Collaborator Author

@moaminsharifi, the underlying problem here is that I thought we could normalize a translation like

msgid "foo bar"
msgstr "FOO BAR"

by running mdbook-xgettext on both of "foo bar" and "FOO BAR" and then matching up the results pairwise. However, this breaks down when either field contain more/less parts on purpose.

We can throw in as much custom logic as we want in the normalization tool: I consider it a bit of free space where we clean up our past problems 🙂 However, it would be nice if we could find a clean way to keep the inline HTML intact in general.

@dyoo, you've run into similar problems with the parsing of HTML comments. Do you have an idea of how we can stop splitting out the inline HTML?

@mgeisler
Copy link
Collaborator Author

mgeisler commented Jan 9, 2024

This is related to #97 as well: when we extract messages from a Markdown string, we get multiple messages if the Markdown contains HTML. So

foo <span>bar</span>"

becomes foo and bar. This causes the normalization problem above.

Since HTML is very rare in my experience, it would probably better to not split it out like that. So maybe we should just let Text and Html nodes become part of the same Group.

@mgeisler
Copy link
Collaborator Author

@kdarkhan, did you want to look into this too? I think it's basically a duplicate of #97 so fixing one will probably fix both 😄

@michael-kerscher
Copy link
Contributor

With the pulldown-cmark upgrade the span tags are detected as InlineHtml. This allows them to be grouped into the same translation group while other block level Html tags are still handled as before.

The change to InlineHtml also affects these tags:

  • <!-- mdbook-xgettext:skip --> - this can be detected and handled as before
  • <script>alert('there');</script> - this needs further discussion how tags like these should be handled.

There are two ways this could go:
a) Handle all InlineHtml tags (that are not mdbook directives) in the same group (this would then not split the <script> tags)
b) Allowlist or Denylist for tags that should be kept in the same group as text.

With the statement above that HTML is rare in these translations I would go for solution a).
Only if Html tags are already used in a meaningful way in other places, maybe b) with an allowlist for e.g. span (or other "stylistic" tags) would be better to keep everything compatible. From my perspective it would be a explicit decision to support (some) Html in the Markdown, and I'm not sure how useful this feature would be at this moment.

Any thoughts on this @mgeisler?

@mgeisler
Copy link
Collaborator Author

Hi @michael-kerscher,

Thanks for looking at this!

With the statement above that HTML is rare in these translations I would go for solution a). Only if Html tags are already used in a meaningful way in other places, maybe b) with an allowlist for e.g. span (or other "stylistic" tags) would be better to keep everything compatible. From my perspective it would be a explicit decision to support (some) Html in the Markdown, and I'm not sure how useful this feature would be at this moment.

I would also lean towards solution a). The less opinionated we have to be the better in my opinion 😄 The amount of HTML allowed in the Markdown is ultimately a decision for the people who use the tooling here, so I don't think we need to guard against JavaScript tags or similar.

  • <!-- mdbook-xgettext:skip --> - this can be detected and handled as before

Great! Just for my own curiosity, does a HTML comment like this in its own paragraph show up as InlineHtml or as Html now?

A related question would be what happens to Markdown wrapped in a large div element? Like this:

<div class="warning">

Beware of the dog!

</div>

I haven't tested it, but I expect us to extract "Beware of the dog!" today. It would be a shame if we suddenly end up extracting the whole thing as a self-contained unit.

@michael-kerscher
Copy link
Contributor

Great! Just for my own curiosity, does a HTML comment like this in its own paragraph show up as InlineHtml or as Html now?

A HTML comment in its own paragraph is still showing up as HtmlBlock so there is no change: see this link
https://spec.commonmark.org/dingus/?text=%3C!--%20mdbook-xgettext:skip%20--%3E%0AHi%20from%20%3Cspan%20%3ERust%3C/span%3E

But if it is used inline it is detected as an Inline Block.

To give a bigger example:

<!-- mdbook-xgettext:skip -->
This comment is a Html Block and this text would be skipped

This is printed and the comment is an Inline Block <!-- mdbook-xgettext:skip --> this would be skipped
This is not skipped

The pull request has further details and examples on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
2 participants