Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly better verse block handling? #48

Open
ispringle opened this issue Aug 28, 2022 · 7 comments
Open

Possibly better verse block handling? #48

ispringle opened this issue Aug 28, 2022 · 7 comments

Comments

@ispringle
Copy link
Contributor

ispringle commented Aug 28, 2022

I was a bit surprised when my verse blocks were rendering as <pre>. I looked into the source and saw your annotations as to the why:

This could be /mostly/ resolved with CSS but there are going to be issues still. For example, superscript doesn't render correctly but instead results in ^superscript.

I looked into the rehype-minify that you mentioned and it's doing a fairly naive (arguably on purpose) replace and using a fairly basic definition of whitespace (https://github.com/rehypejs/rehype-minify/blob/1dc9280c341087a40dfaa332792c095f96d41686/packages/rehype-minify-whitespace/index.js#L286). In this case it's looking for literal spaces, tabs, newlines, and carriage returns. With regard to spaces, arguably the only whitespace that /really/ matters for our purposes, it's only looking for the literal space. Additionally, the definition of "whitespace" it searches for in the HAST is equally naive (again, likely purposefully so) and is only looking for the regular expression /[ \t\n\f\r]/g (https://github.com/syntax-tree/hast-util-whitespace/blob/3c765ef9b3fc561976649b97543498cfa7068760/index.js#L16) Thus, I think we can do exactly what org-publish does and wrap each verse block in it's own <p>, replace spaces with the non-breaking space (&nbsp;), and replace newlines with <br>. I examined the rehype-minify plugin and it also will not remove when it comes before or after an element, ie This space at the end of this string " <br> "is preserved because it comes before an element".

I believe this means we can perfectly replicate Org's own output for the verse block without fear of rehype-minify changing the output. Thoughts? If you agree with my understanding of these two rehype plugins, I would be willing to start working on a PR.

@rasendubi
Copy link
Owner

Thanks for digging in! Replacing \n with <br /> seems to be what org is doing indeed. That should be enough to preserve newlines. Preserving inter-word spacing is less of an issue; users that find it important can either disable minification or replace p.verse with pre instead (with a unified plugin)x. (org-html-export-as-html does not replace spaces with &nbsp; so I would avoid doing that.)

@ispringle
Copy link
Contributor Author

Perhaps the verse blocks you're looking at are not using whitespace? Given the following org file:

The following verse block has ample whitespace.
#+begin_verse
Here is a verse
          Where whitespace matters and
                                   Ought be beautiful
#+end_verse
This document was exported with the most basic settings and =org-html-export-as-html=.

I got this output (sans the inner style content as it's verbose to irrelevant):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2022-09-03 Sat 06:30 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>&lrm;</title>
<meta name="author" content="Ian S. Pringle" />
<meta name="generator" content="Org Mode" />
<style>
...
</style>
</head>
<body>
<div id="content" class="content">
<p>
The following verse block has ample whitespace.
</p>
<p class="verse">
Here is a verse<br />
&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;Where whitespace matters and<br />
&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;Ought be beautiful<br />
</p>
<p>
This document was exported with the most basic settings and <code>org-html-export-as-html</code>.</p>
</div>
<div id="postamble" class="status">
<p class="author">Author: Ian S. Pringle</p>
<p class="date">Created: 2022-09-03 Sat 06:30</p>
</div>
</body>
</html>

Now normally my org-export function uses the HTML encoded version of the no-break space, but here it's using the hex value. Not sure if that matters one way or the other, might be htmlize or something playing with things...

@rasendubi
Copy link
Owner

Oh, I see! so, the actual behavior is that verse blocks are first stripped of the common whitespace prefix, then newlines are converted to <br /> and leading whitspace is converted to nbsp.

You can see that from this sample:

#+begin_verse
  hello
    this is a verse   triple whitespace
#+end_verse

that converts to:

<p class="verse">
hello<br />
&#xa0;&#xa0;this is a verse   triple whitespace<br />
</p>

@rasendubi
Copy link
Owner

From ox-html.el:

;;;; Verse Block

(defun org-html-verse-block (_verse-block contents info)
  "Transcode a VERSE-BLOCK element from Org to HTML.
CONTENTS is verse block contents.  INFO is a plist holding
contextual information."
  (format "<p class=\"verse\">\n%s</p>"
	  ;; Replace leading white spaces with non-breaking spaces.
	  (replace-regexp-in-string
	   "^[ \t]+" (lambda (m) (org-html--make-string (length m) "&#xa0;"))
	   ;; Replace each newline character with line break.  Also
	   ;; remove any trailing "br" close-tag so as to avoid
	   ;; duplicates.
	   (let* ((br (org-html-close-tag "br" nil info))
		  (re (format "\\(?:%s\\)?[ \t]*\n" (regexp-quote br))))
	     (replace-regexp-in-string re (concat br "\n") contents)))))

@ispringle
Copy link
Contributor Author

ispringle commented Sep 16, 2022

This'll do the trick

      case "verse-block":
        const interleave = (a, e) => a.flatMap((x) => [x, e]).slice(0, -1);
        const verses = org.children[0].value.split("\n");
        const newChildren = interleave(
          verses
            .map((v) => {
              if (v != "") {
                const value = v.replaceAll(" ", "\u00A0");
                return { type: "text", value };
              }
              return null;
            })
            .filter((v) => v != null),
          h("", "br", {}, [])
        );
        return h(org, "p.verse", {}, toHast(newChildren));

A little naive, as it replaced all spaces with nbsp's but that's because I couldn't come up with a better solution due to replaceAll only accepting globally scoped regexp. Could declare something like const leadingSpaces = /^[ \t]+/ globally but that's up to you, I didn't see a precedent for globals so I just went with this for now.

EDIT

Needs some more work, just realized there are instances of verse blocks that have n+1 children (ie verse blocks with superscript).

@ispringle
Copy link
Contributor Author

ispringle commented Sep 16, 2022

This one properly interacts with all children of a verse block:

      case "verse-block":
        const interleave = (a, e) => a.flatMap((x) => [x, e]).slice(0, -1);
        const verses = org.children.flatMap((n) =>
          n.type != "text"
            ? n
            : interleave(
                n.value
                  .split("\n")
                  .map((v) => {
                    if (v != "") {
                      const value = v.replaceAll(" ", "\u00A0");
                      return { type: "text", value };
                    }
                    return null;
                  })
                  .filter((v) => v != null),
                h("", "br", {}, [])
              )
        );
        return h(org, "p.verse", {}, toHast(verses));

@rasendubi
Copy link
Owner

rasendubi commented Jan 29, 2023

Just looked into this again and it seems to be even more complicated than that. @ispringle your last solution fails on nested children with newlines:

#+begin_verse
  some text *and
    bold* overflowing on the next line
#+end_verse

this should produce 2 nbsp's before "bold":

<p class="verse">
some text <b>and<br />
&#xa0;&#xa0;bold</b> overflowing on the next line<br />
</p>

In uniorg, that example currently parses as:

  - type: "verse-block"
    children:
      - type: "text"
        value: "  some text "
      - type: "bold"
        children:
          - type: "text"
            value: "and\\n    bold"
      - type: "text"
        value: " overflowing on the next line\\n"

So the processing probably needs to happen in two passes:

  1. Traverse all children deeply calculating the common whitespace prefix (2 in the example above). (This is complicated by the fact that first children does not have leading newline, and the last newline does not have whitespace after it)
  2. Traverse all children deeply, and:
    • strip common prefix (in the first child and after every newline)
    • replace whitespace prefix with nbsp
    • replace newlines with <br />

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants