Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted space characters in Japanese language #1420

Closed
yantarou opened this issue Jul 16, 2015 · 24 comments
Closed

Unwanted space characters in Japanese language #1420

yantarou opened this issue Jul 16, 2015 · 24 comments
Assignees
Labels
invalid Invalid or outdated issue

Comments

@yantarou
Copy link

Good Morning,

I'm inserting line breaks into the AsciiDoc source to make long sentences easier readable.

For the English language this works as expected, a line break in the source translates into a space in the output.

AsciiDoc:

This example sentence is split
over multiple lines in order
to make editing easier.

Output:

This example sentence is split over multiple lines in order to make editing easier.

The Japanese language doesn't know spaces between words though, the line breaks should be ignored.

I wonder if there is any configuration option that can influence this behavior?

AsciiDoc:

この例文は、編集作業を
楽にするために複数の行
に分割されています。

Output:

この例文は、編集作業を 楽にするために複数の行 に分割されています。

Desired output:

この例文は、編集作業を楽にするために複数の行に分割されています。

Thanks,
Jan

@mojavelinux
Copy link
Member

This is actually HTML adding the spaces in, sort of. You see, Asciidoctor passes the text as you see it (after removing trailing whitespace on the line) to HTML. All whitespace gets consolidated by HTML into a single space. That looks normal in English text (as the endline in the source is most likely at the boundary of a word or sentence). However, in Japanese text you end up with an unwanted space.

One solution to this problem today is to create a Treeprocessor or Postprocessor extension that finds all paragraph text and removes the unwanted space.

For the long term, this is an intriguing question as it affects all similar languages. Poetry-style writing (aka sentence or phrase per line) should be an option when writing in these languages but still get the desired output. I think perhaps the solution is to change the behavior when the lang attribute is one of the languages in which whitespace has no significance.

@mojavelinux mojavelinux added this to the discussion milestone Jul 16, 2015
@mojavelinux
Copy link
Member

...and we certainly want Asciidoctor to friendly and comfortable for all languages.

@chloerei
Copy link
Member

Chinese has the same problem. I write a Treeprocessor

require 'asciidoctor/extensions'

class TrailingTreeprocessor < Asciidoctor::Extensions::Treeprocessor
  def process document
    return unless document.blocks?
    process_blocks document
    nil
  end

  def process_blocks node
    node.blocks.each_with_index do |block, index|
      if block.context == :paragraph
        node.blocks[index] = create_paragraph block.document, block.content.gsub("\n", ''), block.attributes
      else
        process_blocks block
      end
    end
  end
end

Asciidoctor::Extensions.register do
  treeprocessor TrailingTreeprocessor
end

Save in config.rb, then:

$ asciidoctor -r ./config.rb filename.adoc

@mojavelinux
Copy link
Member

The switch that needs to be enabled here in core is what is the character for a prose endline. In Latin-based languages, it is a literal endline. For CJK, it would need to be an empty space.

@mojavelinux mojavelinux modified the milestones: v1.6.0, discussion Jan 3, 2016
@mojavelinux mojavelinux self-assigned this Jan 3, 2016
@mojavelinux
Copy link
Member

And this would be something that could be controlled through the language or language family.

For now, you need to either take the approach that @chloerei suggested, or don't insert endlines in your prose in the AsciiDoc source document.

@lo48576
Copy link

lo48576 commented Sep 6, 2016

Macros in CJK have the same problem (unwanted spaces).

これは link:http://example.com[リンク]です。

is currently (ver1.5.4) converted to

これは <a href="http://example.com">リンク</a>です。

, which contains unwanted space before <a>.
(Fortunately, space is not necessary after ] and it is good for CJK.)

Proposal

I propose two (exclusive) rules below.

Rule to remove spaces

These are removed in the output:

  • 1: single whitespace right before macro and right after CJK character
  • 2: single breakline right after CJK character or hyphen

Example:

これは link:foo[リンク]です。
// rule 1: a whitespace, right before macro, right after CJK character.

改行を
含みます。
// rule 2: single breakline, right after CJK character ("を").

four-years-
old
// rule 2: single breakline, right after hyphen ("-").

would be converted to:

<p>これは<a href="foo">リンク</a>です。</p>
<p>改行を含みます。</p>
<p>four-years-old</p>

Rule to preserve space(s)

These are converted to single space in output:

  • 3: single breakline right afer non-CJK and non-hyphen character
  • 4: whitespaces at the beginning of the line (except for the first line of the paragraph)
  • 5: one (or both) of the below:
    • 5-1: two or more successive whitespace characters right before macro, or
    • 5-2: one or more successive whitespace characters right after non-CJK character

Example:

English with
a linebreak.
/// rule 3: single breakline, right after non-CJK and non-hyphen character ("h").

空白のあとに  link:bar[リンク]があります。
// Contains two whitespace characters before `link`.
// rule 5-1: two successive whitespace characters, before macro (`link`).

A link after link:baz[a whitespace].
// rule 5-2: one whitespace character, right after non-CJK character ("r").

この空白は
 保持されます。
// Contains single whitespace before "保持".
// rule 4: a whitespace at the beginning of the second line.

a-
b-
 c
// Contains single whitespace before "c".
// "a-b": rule 2: single breakline, right after hypen ("-").
// "b-c": rule 4:  a whitespace at the beginning of the third line.

would be converted to:

<p>English with a linebreak.</p>
<p>空白のあとに <a href="bar">リンク</a>があります。</p>
<p>A link after <a href="baz">a whitespace</a>.</p>
<p>この空白は 保持されます。</p>
<p>a-b- c</p>

@thom4parisot
Copy link
Member

Digging a bit in the issues, I found we probably had a similar conversation in #1174.

@crocket
Copy link

crocket commented Aug 4, 2017

@lo48576 Yo, man.
I found that

これはlink:http://example.com[リンク]です。

is converted to

<p>これは<a href="http://example.com">リンク</a>です。</p>

And,

crocketlink:http://example.com[リンク]です。

is converted to

<p>crocket<a href="http://example.com">リンク</a>です。</p>

If you like to make a visual distinction in adoc, you can use pass:[]

pass:[これは]link:http://example.com[リンク]です。

@crocket
Copy link

crocket commented Aug 5, 2017

What about letting backslash at the end of a line concatenate the next line to the current line?

Could

この例文は、編集作業を\
楽にするために複数の行\
に分割されています。

be converted to

この例文は、編集作業を楽にするために複数の行に分割されています。

without breaking backward compatibility?

@zhangkaizhao
Copy link

What about letting backslash at the end of a line concatenate the next line to the current line?

Could

この例文は、編集作業を\
楽にするために複数の行\
に分割されています。

be converted to

この例文は、編集作業を楽にするために複数の行に分割されています。

without breaking backward compatibility?

AFAIK, this is the solution in reStructuredText which is the only markup language supports this feature so far.

@zhangkaizhao
Copy link

Hi.
I just wrote a Treeprocessor which is a port of markdown-it-cjk-breaks but for Asciidoc.
Feel free to try it if you are interested: https://github.com/zhangkaizhao/asciidoctor_cjk_breaks


Related information in Markdown as far as I know:

https://talk.commonmark.org/t/soft-line-breaks-should-not-introduce-spaces/285

(which came from TryGhost/Ghost#3893 )

Plugin for markdown-it to automatically deal with segment breaks:

https://github.com/markdown-it/markdown-it-cjk-breaks

whose algorithm matches CSS Text Module Level 3

(which came from https://talk.commonmark.org/t/soft-line-breaks-should-not-introduce-spaces/285/9 )

It is said this plugin is similar to one in pandoc:

jgm/pandoc#534

@mojavelinux
Copy link
Member

Related issue: #4468.

@tats-u
Copy link

tats-u commented May 20, 2024

The CSS spec has been changed. (It's handled by CSS, not HTML)
A newline surrounded by only Chinese/Japanese characters must be just removed. Unconditional insertion of a space is a bug today:
Browsers other than Firefox have this bug.

https://wpt.fyi/results/css/css-text/line-breaking?label=experimental&label=master&aligned&q=segment-break-transformation-rules-
https://drafts.csswg.org/css-text-4/#line-break-transform
https://issues.chromium.org/issues/40069685
https://issues.chromium.org/issues/40774934
https://bugs.webkit.org/show_bug.cgi?id=260857

@tonytonyjan
Copy link

The CSS spec has been changed. (It's handled by CSS, not HTML) A newline surrounded by only Chinese/Japanese characters must be just removed. Unconditional insertion of a space is a bug today: Browsers other than Firefox have this bug.

https://wpt.fyi/results/css/css-text/line-breaking?label=experimental&label=master&aligned&q=segment-break-transformation-rules- https://drafts.csswg.org/css-text-4/#line-break-transform https://issues.chromium.org/issues/40069685 https://issues.chromium.org/issues/40774934 https://bugs.webkit.org/show_bug.cgi?id=260857

@mojavelinux I think we can close the issue because it is a browser's issue rather than asciidoctor's issue?

@tats-u
Copy link

tats-u commented May 20, 2024

Does AsciiDoctor just retain newlines without converting them to spaces by itself?
It's not so bad to remove newlines surrounded by only Chinese/Japanese characters to mitigate this problem.

@mojavelinux
Copy link
Member

Yes, Asciidoctor leaves the space characters as they are written (spaces remain as spaces and newlines remain as newlines). The assumption is that the renderer will normalize them, such as the browser for HTML.

@tats-u
Copy link

tats-u commented May 20, 2024

I got it. Do you know where we should discuss the specification shared with the entire of Asciidoctor family?
I remember Asciidoctor-pdf doesn't use any browsers to render the document.
If you have the authority to move this issue there (another repo), could you do it instead of closing?

@mojavelinux
Copy link
Member

You're free to ask open-ended questions in the project chat at https://chat.asciidoctor.org.

@tats-u
Copy link

tats-u commented May 21, 2024

I see. Asciidoctor supports DocBook & EPub ports, too. I don't know how they treat newlines in XML. I have 2 questions about them: are they left to renderes? How many renderers for them use the browser architecture?

@mojavelinux
Copy link
Member

Please continue this discussion in the chat. The issue tracker is intended to track design decisions. It's not for open-ended discussions.

@mojavelinux
Copy link
Member

In terms of the HTML converter, it seems this issue has been resolved by CSS and thus no action is needed here.

@mojavelinux mojavelinux removed this from the M2 milestone May 21, 2024
@mojavelinux mojavelinux added invalid Invalid or outdated issue and removed enhancement labels May 21, 2024
@tats-u
Copy link

tats-u commented May 21, 2024

In terms of the HTML converter,

DocBook is XML based. Are you saying EPub & DocBook both use CSS for styling?

@tats-u
Copy link

tats-u commented May 21, 2024

Also Asciidoctor shouldn't trust the CSS implements of web browsers today too much.

@mojavelinux
Copy link
Member

If this is a behavior you need, you're welcome to extend the converter and add the logic to that extended converter. This is not something we're going to add to Asciidoctor right now.

@asciidoctor asciidoctor locked as resolved and limited conversation to collaborators May 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
invalid Invalid or outdated issue
Projects
None yet
Development

No branches or pull requests

9 participants