PRE block child nodes not parsed & DOCTYPE not recognized #78

nonara · 2020-10-26T19:41:32Z

Hi! First, thanks for the great work on this library! This is truly the fastest and best out there.

I needed a lightning fast html-to-markdown library, and everything out there was extremely slow. With the help of node-html-parser, I was able to write something that blazes through it! (I'll be writing a readme soon, but it is a full release)

In the process, I discovered two issues, which I'll detail below.

PRE blocks don't parse children

In pre-formatted elements, contents are always being treated as text, so child-nodes do not get parsed.

This causes an issue for multi-line code blocks. Per spec, a multi-line code block is <code> wrapped in <pre>. (see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#Notes)

The following treats <code ... > as a text node, where spec behaviour is to allow HTML child nodes within pre

<pre>
  <code class='language-javascript'>
    console.log('hello');
  </code>
</pre>

You can see how we workaround this, here: https://github.com/crosstype/node-html-markdown/blob/master/src/config.ts#L79-L89

DOCTYPE is parsed as a text node

<!DOCTYPE ...> nodes are being parsed as text nodes. I think the desired behaviour here would be to simply ignore the node.

To work around this, we're temporarily replacing them ahead of time. (see: https://github.com/crosstype/node-html-markdown/blob/master/src/main.ts#L40-L42)

Side-note

This and another of my libraries broke after a recent yarn install, due to the fact that the options we passed were no longer valid. We were specifying { pre: true, style: false }

Not a major issue, but if you're following semver, changing public API options is generally considered a breaking change, as it could cause projects which use it as a dependency to fail to compile.

Thanks again for the great work, and I hope you have a good day!

The text was updated successfully, but these errors were encountered:

taoqf · 2020-10-27T05:49:18Z

tag pre was defined as BlockTextElements, I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases, maybe not. anyway, I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.
1.It will break something if I ignore tag doctype.someone may use this parse a whole page, and convert back to text after they changed something.so I'm sorry I can't add this yet.

nonara · 2020-10-27T19:09:42Z

I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.

That will work. Thanks for such a fast response!

It will break something if I ignore tag doctype. ...

I understand if you don't want to ignore it. That makes sense. I should have probably clarified the problem better.

There are three types of nodes that this parser is creating during parsing.

ELEMENT_NODE
TEXT_NODE
COMMENT_NODE

It turns out that !DOCTYPE isn't any of these. Instead, it's a DOCUMENT_TYPE_NODE (see: https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType#Constants)

The difficulty here is that this parser is treating DOCTYPE as a TEXT_NODE, which makes it output the entire thing (ie. <!DOCTYPE html>) as text.

In this case, if you'd like to preserve it, it would likely be best to create a proper DOCUMENT_TYPE_NODE. Or, worst case an ELEMENT_NODE with the tagName !DOCTYPE.

Otherwise, anything which assumes a TEXT_NODE should always be visible text and never be an HTML tag, will get the side-effect of having HTML code pollute their output.

This should be doable without hurting performance. If you'd like help, let me know and I can look into it and make sure perf numbers aren't affected.

I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases,

In terms of performance, you'd likely see little to no difference. The reason I'd argue for changing it is because an HTML parser should comply with HTML spec. In the case of pre, child elements are allowed and not considered text. If it weren't for <pre><code></code></pre>, it may not matter as much, but multi-line code blocks are pretty common.

But I want to clarify that I'm not trying to nit-pick. Your blockTextElements solution will work fine for me!

I decided that I'd leave the comment here because due to the difference in this and HTML spec, there is a chance that duplicate issues may be opened. So if in the future this is something you want to look into, I'll be interested in following the conversation.

nonara · 2020-10-27T19:13:32Z

As a side note, if you'd ever like help in maintaining this, feel free to send a ping any time. I do a lot of work with parsers and compilers. Thanks again for the fast response and fix!

sebromero · 2021-02-01T21:19:38Z

Hi all! I'm using node-html-parser 2.1.0 with the blockTextElements to get the nodes from code tags inside pre. However, the classNames attribute of those code tags is empty. Is that the expected behaviour?

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        pre: false
      }
});

taoqf · 2021-02-02T06:11:39Z

@sebromero Yes. If you want get code inside pre, remove pre from blocktextelments, and add add code. like this.

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        code: false
      }
});

sebromero · 2021-02-02T12:43:38Z

@taoqf That worked perfectly, thank you! It wasn't fully clear to me from the documentation how to use the config object. I thought those were boolean flags on whether or not to parse those elements as blockTextElements. 😅

dlemfh · 2022-01-13T01:32:37Z

Hi, from my tests on using this library

parse(SOME_HTML_TEXT, { blockTextElements: { pre: true }}) causes

the pre element to be an ELEMENT_NODE
children within pre to be a single TEXT_NODE

parse(SOME_HTML_TEXT, { blockTextElements: { }}) causes

the pre element to be an ELEMENT_NODE
html tags within pre to be parsed as ELEMENT_NODEs as well

parse(SOME_HTML_TEXT, { blockTextElements: { pre: false }}) causes

the pre element to be an ELEMENT_NODE
and the pre element is parsed to have no children at all

Would I be correct to assume this is the intended behavior? If there's anything I'm missing I'd appreciate any heads up!

nonara · 2022-01-16T18:11:51Z

@dlemfh It looks like the logic is that the presence of the key indicates that it's block text, and the value (true or false) indicates whether its contents should be ignored or preserved.

In other words, it seems that it's behaving as expected. If you want to not treat pre as block text, you should omit it entirely from the object. Hope that helps!

dlemfh · 2022-01-17T02:44:42Z

@nonara Seems that way. Thanks!!

bogdanionitabp · 2024-03-13T09:41:16Z

But it seems you must still pass an empty object as the blockTextElements, otherwise the pre elements will not be parsed. If this was intended, then it's too confusing, otherwise it's too buggy.

taoqf added a commit that referenced this issue Oct 27, 2020

add option blockTextElements #78

3e32856

taoqf pinned this issue Oct 29, 2020

taoqf closed this as completed Feb 23, 2021

jjcaballero mentioned this issue Sep 8, 2021

feat: create datasource for artifactory registry renovatebot/renovate#11602

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRE block child nodes not parsed & DOCTYPE not recognized #78

PRE block child nodes not parsed & DOCTYPE not recognized #78

nonara commented Oct 26, 2020 •

edited

taoqf commented Oct 27, 2020

nonara commented Oct 27, 2020 •

edited

nonara commented Oct 27, 2020

sebromero commented Feb 1, 2021

taoqf commented Feb 2, 2021

sebromero commented Feb 2, 2021

dlemfh commented Jan 13, 2022

nonara commented Jan 16, 2022 •

edited

dlemfh commented Jan 17, 2022

bogdanionitabp commented Mar 13, 2024

PRE block child nodes not parsed & DOCTYPE not recognized #78

PRE block child nodes not parsed & DOCTYPE not recognized #78

Comments

nonara commented Oct 26, 2020 • edited

PRE blocks don't parse children

DOCTYPE is parsed as a text node

Side-note

taoqf commented Oct 27, 2020

nonara commented Oct 27, 2020 • edited

nonara commented Oct 27, 2020

sebromero commented Feb 1, 2021

taoqf commented Feb 2, 2021

sebromero commented Feb 2, 2021

dlemfh commented Jan 13, 2022

nonara commented Jan 16, 2022 • edited

dlemfh commented Jan 17, 2022

bogdanionitabp commented Mar 13, 2024

nonara commented Oct 26, 2020 •

edited

nonara commented Oct 27, 2020 •

edited

nonara commented Jan 16, 2022 •

edited