Markdown: Do not insert spaces between Chinese/Japanese & latin lette…

…rs (#6385) Powered by #8526 & prettier-plugin-md-nocjsp
prettier · Oct 4, 2021 · 394b1d7 · 394b1d7
1 parent 04391a0
commit 394b1d7
Show file tree

Hide file tree

Showing 6 changed files with 151 additions and 49 deletions.
diff --git a/changelog_unreleased/markdown/11597.md b/changelog_unreleased/markdown/11597.md
@@ -0,0 +1,77 @@
+#### No more inserting space between Chinese or Japanese (e.g. hanzi and kana) and western characters (#11597 by @tats-u)
+
+<!-- Optional description if it makes sense. -->
+
+The current behavior of inserting whitespace (U+0020) between Chinese or Japanese (e.g. hanzi/kanji and kana) and western (e.g. alphanumerics) characters is not based on the official layout guidelines in Japanese and Chinese but [non-standard and local one in Chinese](https://github.com/ruanyf/document-style-guide/blob/master/docs/text.md).
+
+Official Japanese guideline (W3C):
+
+> 3.9.1 Differences in Positioning of Characters and Symbols
+>
+> The positioning of characters and symbols may vary depending on the following.
+>
+> d. Are characters and symbols appearing in sequence in solid setting, or will there be a fixed size space between them? For example, sequences of ideographic characters (cl-19) and hiragana (cl-15) are set solid, and for Western characters (cl-27) following hiragana (cl-15) there will be quarter em spacing.
+
+<https://www.w3.org/TR/jlreq/#differences_in_positioning_of_characters_and_symbols>
+
+> “one quarter em” means one quarter of the full-width size. (JIS Z 8125)  
+> “one quarter em space” means amount of space that is one quarter size of em space.
+
+<https://www.w3.org/TR/jlreq/#term.quarter-em>  
+<https://www.w3.org/TR/jlreq/#term.quarter-em-space>
+
+Official Japanese guideline (JIS X 4051:2004):
+
+> 4.7 和欧文混植処理
+>
+> a) 横書きでは，和文と欧文との間の空き量は，四分アキを原則とする。
+>
+> 4.7 Mixed Japanese and Western Text Composition
+>
+> a) In horizontal writing, the space between Japanese and western text should be one quarter em, as a rule.
+>
+> PR Author's Note: Original text is written only in Japanese and translation is based on [DeepL](https://www.deepl.com/translator).
+
+<https://kikakurui.com/x4/X4051-2004-02.html> (Japanese)
+
+Official Chinese guideline (W3C):
+
+> 3.2.2 Mixed Text Composition in Horizontal Writing Mode
+>
+> In principle, there is tracking or spacing between an adjacent Han character and a Western character of up to one quarter of a Han character width, except at the line start or end.
+>
+> NOTE: Another approach is to use a Western word space (U+0020 SPACE), in which case the width depends on the font in use.
+
+<https://www.w3.org/TR/clreq/#mixed_text_composition_in_horizontal_writing_mode>
+
+As mentioned above, whitespace (U+0020) is allowed to be substituted for one quarter em only in Chinese, although they have a similar appearance. Also, even in Chinese, the rule is not adopted even in the W3C guideline page but is mentioned as just one of the options.
+
+Some renderers (e.g. convert to PDF using Pandoc with the backend of LaTeX) can automatically insert genuine one quarter em. The width of whitespace is different from one quarter em, so inserting whitespace (U+0020) takes away the option to leave it to renderers to insert one quarter em. Adding space should be left to renderers and should not be done by Prettier, just a formatter.
+
+Adding whitespace may interfere with searches for text containing both Chinese or Japanese and western characters. For example, you cannot find “第1章” (Chapter 1) in a Markdown document or its derivative just by searching by the string “第1章” but “第 1 章”.
+
+To make matters worst, once whitespace is inserted, it is difficult to remove it. The following sentence cannot be said to be wrong.
+
+> 作る means make in Japanese.
+
+The too simple rule of removing whitespace between Chinese or Japanese characters and alphanumerics removes that between “作る” and “means” unless you modify the sentence, that is, quote “作る”. It is so difficult to create a common rule that can safely remove whitespace from all documents and deserves to be included in Prettier.
+
+In conclusion, the imposition of the non-standard rule by just a formatter must be ended.
+
+<!-- prettier-ignore -->
+```markdown
+<!-- Input -->
+漢字Alphabetsひらがな12345カタカナ67890한글
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890 한글
+
+<!-- Prettier stable -->
+漢字 Alphabets ひらがな 12345 カタカナ 67890한글
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890 한글
+
+<!-- Prettier main -->
+漢字Alphabetsひらがな12345カタカナ67890한글
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890 한글
+```
diff --git a/src/language-markdown/constants.evaluate.js b/src/language-markdown/constants.evaluate.js
@@ -22,10 +22,6 @@ const cjkPattern = `(?:${cjkRegex()
   Block: ["Variation_Selectors", "Variation_Selectors_Supplement"],
 }).toString()})?`;
 
-const kPattern = unicodeRegex({ Script: ["Hangul"] })
-  .union(unicodeRegex({ Script_Extensions: ["Hangul"] }))
-  .toString();
-
 // http://spec.commonmark.org/0.25/#ascii-punctuation-character
 const asciiPunctuationCharset =
   /* prettier-ignore */ regexpUtil.charset(
@@ -53,6 +49,5 @@ const punctuationPattern = punctuationCharset.toString();
 
 module.exports = {
   cjkPattern,
-  kPattern,
   punctuationPattern,
 };
diff --git a/src/language-markdown/utils.js b/src/language-markdown/utils.js
@@ -2,11 +2,7 @@
 
 const { getLast } = require("../common/util.js");
 const { locStart, locEnd } = require("./loc.js");
-const {
-  cjkPattern,
-  kPattern,
-  punctuationPattern,
-} = require("./constants.evaluate.js");
+const { cjkPattern, punctuationPattern } = require("./constants.evaluate.js");
 
 const INLINE_NODE_TYPES = [
   "liquidNode",
@@ -35,7 +31,6 @@ const INLINE_NODE_WRAPPER_TYPES = [
   "heading",
 ];
 
-const kRegex = new RegExp(kPattern);
 const punctuationRegex = new RegExp(punctuationPattern);
 
 /**
@@ -44,8 +39,7 @@ const punctuationRegex = new RegExp(punctuationPattern);
  */
 function splitText(text, options) {
   const KIND_NON_CJK = "non-cjk";
-  const KIND_CJ_LETTER = "cj-letter";
-  const KIND_K_LETTER = "k-letter";
+  const KIND_CJK_LETTER = "cjk-letter";
   const KIND_CJK_PUNCTUATION = "cjk-punctuation";
 
   /** @type {Array<{ type: "whitespace", value: " " | "\n" | "" } | { type: "word", value: string }>} */
@@ -111,7 +105,7 @@ function splitText(text, options) {
           : {
               type: "word",
               value: innerToken,
-              kind: kRegex.test(innerToken) ? KIND_K_LETTER : KIND_CJ_LETTER,
+              kind: KIND_CJK_LETTER,
               hasLeadingPunctuation: false,
               hasTrailingPunctuation: false,
             }
@@ -125,16 +119,8 @@ function splitText(text, options) {
     const lastNode = getLast(nodes);
     if (lastNode && lastNode.type === "word") {
       if (
-        (lastNode.kind === KIND_NON_CJK &&
-          node.kind === KIND_CJ_LETTER &&
-          !lastNode.hasTrailingPunctuation) ||
-        (lastNode.kind === KIND_CJ_LETTER &&
-          node.kind === KIND_NON_CJK &&
-          !node.hasLeadingPunctuation)
-      ) {
-        nodes.push({ type: "whitespace", value: " " });
-      } else if (
         !isBetween(KIND_NON_CJK, KIND_CJK_PUNCTUATION) &&
+        !isBetween(KIND_CJK_PUNCTUATION, KIND_NON_CJK) &&
         // disallow leading/trailing full-width whitespace
         ![lastNode.value, node.value].some((value) => /\u3000/.test(value))
       ) {
@@ -144,10 +130,7 @@ function splitText(text, options) {
     nodes.push(node);
 
     function isBetween(kind1, kind2) {
-      return (
-        (lastNode.kind === kind1 && node.kind === kind2) ||
-        (lastNode.kind === kind2 && node.kind === kind1)
-      );
+      return lastNode.kind === kind1 && node.kind === kind2;
     }
   }
 }

diff --git a/tests/format/markdown/paragraph/__snapshots__/jsfmt.spec.js.snap b/tests/format/markdown/paragraph/__snapshots__/jsfmt.spec.js.snap
@@ -31,16 +31,15 @@ IVS 麻󠄁羽󠄀‼️
 這是一段很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長
 很長的段落
 
-這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段
-Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著
-中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個
-English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段
-Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著
-中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個
-English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段
-Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著
-中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個
-English 混合著中文的一段 Paragraph！
+這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段
+Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的
+一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中
+文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合
+著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English
+混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個
+English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是
+一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！
+這是一個English混合著中文的一段Paragraph！
 
 全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白
 全　　形　空白全　　形　空白全　　形　空白
@@ -90,7 +89,7 @@ IVS 麻󠄁羽󠄀‼️
 =====================================output=====================================
 這是一段很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長的段落
 
-這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！
+這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！
 
 全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白
 
@@ -139,7 +138,7 @@ IVS 麻󠄁羽󠄀‼️
 =====================================output=====================================
 這是一段很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長的段落
 
-這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！這是一個 English 混合著中文的一段 Paragraph！
+這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！這是一個English混合著中文的一段Paragraph！
 
 全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白全　　形　空白