Unwanted space characters in Japanese language #1420

yantarou · 2015-07-16T04:24:18Z

Good Morning,

I'm inserting line breaks into the AsciiDoc source to make long sentences easier readable.

For the English language this works as expected, a line break in the source translates into a space in the output.

AsciiDoc:

This example sentence is split
over multiple lines in order
to make editing easier.

Output:

This example sentence is split over multiple lines in order to make editing easier.

The Japanese language doesn't know spaces between words though, the line breaks should be ignored.

I wonder if there is any configuration option that can influence this behavior?

AsciiDoc:

この例文は、編集作業を
楽にするために複数の行
に分割されています。

Output:

この例文は、編集作業を 楽にするために複数の行 に分割されています。

Desired output:

この例文は、編集作業を楽にするために複数の行に分割されています。

Thanks,
Jan

The text was updated successfully, but these errors were encountered:

mojavelinux · 2015-07-16T07:02:21Z

This is actually HTML adding the spaces in, sort of. You see, Asciidoctor passes the text as you see it (after removing trailing whitespace on the line) to HTML. All whitespace gets consolidated by HTML into a single space. That looks normal in English text (as the endline in the source is most likely at the boundary of a word or sentence). However, in Japanese text you end up with an unwanted space.

One solution to this problem today is to create a Treeprocessor or Postprocessor extension that finds all paragraph text and removes the unwanted space.

For the long term, this is an intriguing question as it affects all similar languages. Poetry-style writing (aka sentence or phrase per line) should be an option when writing in these languages but still get the desired output. I think perhaps the solution is to change the behavior when the lang attribute is one of the languages in which whitespace has no significance.

mojavelinux · 2015-07-16T07:02:45Z

...and we certainly want Asciidoctor to friendly and comfortable for all languages.

chloerei · 2015-10-16T15:08:51Z

Chinese has the same problem. I write a Treeprocessor

require 'asciidoctor/extensions'

class TrailingTreeprocessor < Asciidoctor::Extensions::Treeprocessor
  def process document
    return unless document.blocks?
    process_blocks document
    nil
  end

  def process_blocks node
    node.blocks.each_with_index do |block, index|
      if block.context == :paragraph
        node.blocks[index] = create_paragraph block.document, block.content.gsub("\n", ''), block.attributes
      else
        process_blocks block
      end
    end
  end
end

Asciidoctor::Extensions.register do
  treeprocessor TrailingTreeprocessor
end

Save in config.rb, then:

$ asciidoctor -r ./config.rb filename.adoc

mojavelinux · 2016-01-03T02:16:11Z

The switch that needs to be enabled here in core is what is the character for a prose endline. In Latin-based languages, it is a literal endline. For CJK, it would need to be an empty space.

mojavelinux · 2016-01-03T02:17:09Z

And this would be something that could be controlled through the language or language family.

For now, you need to either take the approach that @chloerei suggested, or don't insert endlines in your prose in the AsciiDoc source document.

lo48576 · 2016-09-06T03:43:04Z

Macros in CJK have the same problem (unwanted spaces).

これは link:http://example.com[リンク]です。

is currently (ver1.5.4) converted to

これは <a href="http://example.com">リンク</a>です。

, which contains unwanted space before <a>.
(Fortunately, space is not necessary after ] and it is good for CJK.)

Proposal

I propose two (exclusive) rules below.

Rule to remove spaces

These are removed in the output:

1: single whitespace right before macro and right after CJK character
2: single breakline right after CJK character or hyphen

Example:

これは link:foo[リンク]です。
// rule 1: a whitespace, right before macro, right after CJK character.

改行を
含みます。
// rule 2: single breakline, right after CJK character ("を").

four-years-
old
// rule 2: single breakline, right after hyphen ("-").

would be converted to:

<p>これは<a href="foo">リンク</a>です。</p>
<p>改行を含みます。</p>
<p>four-years-old</p>

Rule to preserve space(s)

These are converted to single space in output:

3: single breakline right afer non-CJK and non-hyphen character
4: whitespaces at the beginning of the line (except for the first line of the paragraph)
5: one (or both) of the below:
- 5-1: two or more successive whitespace characters right before macro, or
- 5-2: one or more successive whitespace characters right after non-CJK character

Example:

English with
a linebreak.
/// rule 3: single breakline, right after non-CJK and non-hyphen character ("h").

空白のあとに  link:bar[リンク]があります。
// Contains two whitespace characters before `link`.
// rule 5-1: two successive whitespace characters, before macro (`link`).

A link after link:baz[a whitespace].
// rule 5-2: one whitespace character, right after non-CJK character ("r").

この空白は
 保持されます。
// Contains single whitespace before "保持".
// rule 4: a whitespace at the beginning of the second line.

a-
b-
 c
// Contains single whitespace before "c".
// "a-b": rule 2: single breakline, right after hypen ("-").
// "b-c": rule 4:  a whitespace at the beginning of the third line.

would be converted to:

<p>English with a linebreak.</p>
<p>空白のあとに <a href="bar">リンク</a>があります。</p>
<p>A link after <a href="baz">a whitespace</a>.</p>
<p>この空白は 保持されます。</p>
<p>a-b- c</p>

thom4parisot · 2017-02-17T12:00:47Z

Digging a bit in the issues, I found we probably had a similar conversation in #1174.

crocket · 2017-08-04T02:30:04Z

@lo48576 Yo, man.
I found that

これはlink:http://example.com[リンク]です。

is converted to

<p>これは<a href="http://example.com">リンク</a>です。</p>

And,

crocketlink:http://example.com[リンク]です。

is converted to

<p>crocket<a href="http://example.com">リンク</a>です。</p>

If you like to make a visual distinction in adoc, you can use pass:[]

pass:[これは]link:http://example.com[リンク]です。

crocket · 2017-08-05T11:27:47Z

What about letting backslash at the end of a line concatenate the next line to the current line?

Could

この例文は、編集作業を\
楽にするために複数の行\
に分割されています。

be converted to

この例文は、編集作業を楽にするために複数の行に分割されています。

without breaking backward compatibility?

zhangkaizhao · 2018-11-03T17:20:59Z

What about letting backslash at the end of a line concatenate the next line to the current line?

Could
この例文は、編集作業を\
楽にするために複数の行\
に分割されています。
be converted to
この例文は、編集作業を楽にするために複数の行に分割されています。
without breaking backward compatibility?

AFAIK, this is the solution in reStructuredText which is the only markup language supports this feature so far.

zhangkaizhao · 2018-11-09T06:18:19Z

Hi.
I just wrote a Treeprocessor which is a port of markdown-it-cjk-breaks but for Asciidoc.
Feel free to try it if you are interested: https://github.com/zhangkaizhao/asciidoctor_cjk_breaks

Related information in Markdown as far as I know:

https://talk.commonmark.org/t/soft-line-breaks-should-not-introduce-spaces/285

(which came from TryGhost/Ghost#3893 )

Plugin for markdown-it to automatically deal with segment breaks:

https://github.com/markdown-it/markdown-it-cjk-breaks

whose algorithm matches CSS Text Module Level 3

(which came from https://talk.commonmark.org/t/soft-line-breaks-should-not-introduce-spaces/285/9 )

It is said this plugin is similar to one in pandoc:

jgm/pandoc#534

mojavelinux · 2023-10-02T06:42:06Z

Related issue: #4468.

tats-u · 2024-05-20T09:45:20Z

The CSS spec has been changed. (It's handled by CSS, not HTML)
A newline surrounded by only Chinese/Japanese characters must be just removed. Unconditional insertion of a space is a bug today:
Browsers other than Firefox have this bug.

https://wpt.fyi/results/css/css-text/line-breaking?label=experimental&label=master&aligned&q=segment-break-transformation-rules-
https://drafts.csswg.org/css-text-4/#line-break-transform
https://issues.chromium.org/issues/40069685
https://issues.chromium.org/issues/40774934
https://bugs.webkit.org/show_bug.cgi?id=260857

tonytonyjan · 2024-05-20T10:22:51Z

The CSS spec has been changed. (It's handled by CSS, not HTML) A newline surrounded by only Chinese/Japanese characters must be just removed. Unconditional insertion of a space is a bug today: Browsers other than Firefox have this bug.

https://wpt.fyi/results/css/css-text/line-breaking?label=experimental&label=master&aligned&q=segment-break-transformation-rules- https://drafts.csswg.org/css-text-4/#line-break-transform https://issues.chromium.org/issues/40069685 https://issues.chromium.org/issues/40774934 https://bugs.webkit.org/show_bug.cgi?id=260857

@mojavelinux I think we can close the issue because it is a browser's issue rather than asciidoctor's issue?

tats-u · 2024-05-20T13:12:13Z

Does AsciiDoctor just retain newlines without converting them to spaces by itself?
It's not so bad to remove newlines surrounded by only Chinese/Japanese characters to mitigate this problem.

mojavelinux · 2024-05-20T18:47:01Z

Yes, Asciidoctor leaves the space characters as they are written (spaces remain as spaces and newlines remain as newlines). The assumption is that the renderer will normalize them, such as the browser for HTML.

tats-u · 2024-05-20T23:48:53Z

I got it. Do you know where we should discuss the specification shared with the entire of Asciidoctor family?
I remember Asciidoctor-pdf doesn't use any browsers to render the document.
If you have the authority to move this issue there (another repo), could you do it instead of closing?

mojavelinux · 2024-05-21T17:55:30Z

You're free to ask open-ended questions in the project chat at https://chat.asciidoctor.org.

tats-u · 2024-05-21T23:33:45Z

I see. Asciidoctor supports DocBook & EPub ports, too. I don't know how they treat newlines in XML. I have 2 questions about them: are they left to renderes? How many renderers for them use the browser architecture?

mojavelinux · 2024-05-21T23:35:42Z

Please continue this discussion in the chat. The issue tracker is intended to track design decisions. It's not for open-ended discussions.

mojavelinux · 2024-05-21T23:36:40Z

In terms of the HTML converter, it seems this issue has been resolved by CSS and thus no action is needed here.

tats-u · 2024-05-21T23:51:44Z

In terms of the HTML converter,

DocBook is XML based. Are you saying EPub & DocBook both use CSS for styling?

tats-u · 2024-05-21T23:54:17Z

Also Asciidoctor shouldn't trust the CSS implements of web browsers today too much.

mojavelinux · 2024-05-21T23:58:56Z

If this is a behavior you need, you're welcome to extend the converter and add the logic to that extended converter. This is not something we're going to add to Asciidoctor right now.

mojavelinux added this to the discussion milestone Jul 16, 2015

mojavelinux modified the milestones: v1.6.0, discussion Jan 3, 2016

mojavelinux added the enhancement label Jan 3, 2016

mojavelinux self-assigned this Jan 3, 2016

mojavelinux modified the milestones: v1.6.0, M2 Jan 9, 2019

reosablo mentioned this issue Mar 6, 2021

Unexpected output by multiple new window links without space character between #3962

Closed

mojavelinux closed this as completed May 21, 2024

mojavelinux removed this from the M2 milestone May 21, 2024

mojavelinux added invalid Invalid or outdated issue and removed enhancement labels May 21, 2024

asciidoctor locked as resolved and limited conversation to collaborators May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwanted space characters in Japanese language #1420

Unwanted space characters in Japanese language #1420

yantarou commented Jul 16, 2015

mojavelinux commented Jul 16, 2015

mojavelinux commented Jul 16, 2015

chloerei commented Oct 16, 2015

mojavelinux commented Jan 3, 2016

mojavelinux commented Jan 3, 2016

lo48576 commented Sep 6, 2016 •

edited

thom4parisot commented Feb 17, 2017

crocket commented Aug 4, 2017 •

edited

crocket commented Aug 5, 2017 •

edited

zhangkaizhao commented Nov 3, 2018

zhangkaizhao commented Nov 9, 2018

mojavelinux commented Oct 2, 2023

tats-u commented May 20, 2024 •

edited

tonytonyjan commented May 20, 2024

tats-u commented May 20, 2024 •

edited

mojavelinux commented May 20, 2024

tats-u commented May 20, 2024 •

edited

mojavelinux commented May 21, 2024

tats-u commented May 21, 2024 •

edited

mojavelinux commented May 21, 2024

mojavelinux commented May 21, 2024

tats-u commented May 21, 2024 •

edited

tats-u commented May 21, 2024

mojavelinux commented May 21, 2024

Unwanted space characters in Japanese language #1420

Unwanted space characters in Japanese language #1420

Comments

yantarou commented Jul 16, 2015

mojavelinux commented Jul 16, 2015

mojavelinux commented Jul 16, 2015

chloerei commented Oct 16, 2015

mojavelinux commented Jan 3, 2016

mojavelinux commented Jan 3, 2016

lo48576 commented Sep 6, 2016 • edited

Proposal

Rule to remove spaces

Rule to preserve space(s)

thom4parisot commented Feb 17, 2017

crocket commented Aug 4, 2017 • edited

crocket commented Aug 5, 2017 • edited

zhangkaizhao commented Nov 3, 2018

zhangkaizhao commented Nov 9, 2018

mojavelinux commented Oct 2, 2023

tats-u commented May 20, 2024 • edited

tonytonyjan commented May 20, 2024

tats-u commented May 20, 2024 • edited

mojavelinux commented May 20, 2024

tats-u commented May 20, 2024 • edited

mojavelinux commented May 21, 2024

tats-u commented May 21, 2024 • edited

mojavelinux commented May 21, 2024

mojavelinux commented May 21, 2024

tats-u commented May 21, 2024 • edited

tats-u commented May 21, 2024

mojavelinux commented May 21, 2024

lo48576 commented Sep 6, 2016 •

edited

crocket commented Aug 4, 2017 •

edited

crocket commented Aug 5, 2017 •

edited

tats-u commented May 20, 2024 •

edited

tats-u commented May 20, 2024 •

edited

tats-u commented May 20, 2024 •

edited

tats-u commented May 21, 2024 •

edited

tats-u commented May 21, 2024 •

edited