Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRE block child nodes not parsed & DOCTYPE not recognized #78

Closed
nonara opened this issue Oct 26, 2020 · 10 comments
Closed

PRE block child nodes not parsed & DOCTYPE not recognized #78

nonara opened this issue Oct 26, 2020 · 10 comments

Comments

@nonara
Copy link
Collaborator

nonara commented Oct 26, 2020

Hi! First, thanks for the great work on this library! This is truly the fastest and best out there.

I needed a lightning fast html-to-markdown library, and everything out there was extremely slow. With the help of node-html-parser, I was able to write something that blazes through it! (I'll be writing a readme soon, but it is a full release)

In the process, I discovered two issues, which I'll detail below.

PRE blocks don't parse children

In pre-formatted elements, contents are always being treated as text, so child-nodes do not get parsed.

This causes an issue for multi-line code blocks. Per spec, a multi-line code block is <code> wrapped in <pre>. (see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#Notes)

The following treats <code ... > as a text node, where spec behaviour is to allow HTML child nodes within pre

<pre>
  <code class='language-javascript'>
    console.log('hello');
  </code>
</pre>

You can see how we workaround this, here: https://github.com/crosstype/node-html-markdown/blob/master/src/config.ts#L79-L89

DOCTYPE is parsed as a text node

<!DOCTYPE ...> nodes are being parsed as text nodes. I think the desired behaviour here would be to simply ignore the node.

To work around this, we're temporarily replacing them ahead of time. (see: https://github.com/crosstype/node-html-markdown/blob/master/src/main.ts#L40-L42)

Side-note

This and another of my libraries broke after a recent yarn install, due to the fact that the options we passed were no longer valid. We were specifying { pre: true, style: false }

Not a major issue, but if you're following semver, changing public API options is generally considered a breaking change, as it could cause projects which use it as a dependency to fail to compile.

Thanks again for the great work, and I hope you have a good day!

taoqf added a commit that referenced this issue Oct 27, 2020
@taoqf
Copy link
Owner

taoqf commented Oct 27, 2020

  1. tag pre was defined as BlockTextElements, I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases, maybe not. anyway, I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.
    1.It will break something if I ignore tag doctype.someone may use this parse a whole page, and convert back to text after they changed something.so I'm sorry I can't add this yet.

@nonara
Copy link
Collaborator Author

nonara commented Oct 27, 2020

I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.

That will work. Thanks for such a fast response!

It will break something if I ignore tag doctype. ...

I understand if you don't want to ignore it. That makes sense. I should have probably clarified the problem better.

There are three types of nodes that this parser is creating during parsing.

  1. ELEMENT_NODE
  2. TEXT_NODE
  3. COMMENT_NODE

It turns out that !DOCTYPE isn't any of these. Instead, it's a DOCUMENT_TYPE_NODE (see: https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType#Constants)

The difficulty here is that this parser is treating DOCTYPE as a TEXT_NODE, which makes it output the entire thing (ie. <!DOCTYPE html>) as text.

In this case, if you'd like to preserve it, it would likely be best to create a proper DOCUMENT_TYPE_NODE. Or, worst case an ELEMENT_NODE with the tagName !DOCTYPE.

Otherwise, anything which assumes a TEXT_NODE should always be visible text and never be an HTML tag, will get the side-effect of having HTML code pollute their output.

This should be doable without hurting performance. If you'd like help, let me know and I can look into it and make sure perf numbers aren't affected.

I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases,

In terms of performance, you'd likely see little to no difference. The reason I'd argue for changing it is because an HTML parser should comply with HTML spec. In the case of pre, child elements are allowed and not considered text. If it weren't for <pre><code></code></pre>, it may not matter as much, but multi-line code blocks are pretty common.

But I want to clarify that I'm not trying to nit-pick. Your blockTextElements solution will work fine for me!

I decided that I'd leave the comment here because due to the difference in this and HTML spec, there is a chance that duplicate issues may be opened. So if in the future this is something you want to look into, I'll be interested in following the conversation.

@nonara
Copy link
Collaborator Author

nonara commented Oct 27, 2020

As a side note, if you'd ever like help in maintaining this, feel free to send a ping any time. I do a lot of work with parsers and compilers. Thanks again for the fast response and fix!

@taoqf taoqf pinned this issue Oct 29, 2020
@sebromero
Copy link

Hi all! I'm using node-html-parser 2.1.0 with the blockTextElements to get the nodes from code tags inside pre. However, the classNames attribute of those code tags is empty. Is that the expected behaviour?

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        pre: false
      }
});

@taoqf
Copy link
Owner

taoqf commented Feb 2, 2021

@sebromero Yes. If you want get code inside pre, remove pre from blocktextelments, and add add code. like this.

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        code: false
      }
});

@sebromero
Copy link

@taoqf That worked perfectly, thank you! It wasn't fully clear to me from the documentation how to use the config object. I thought those were boolean flags on whether or not to parse those elements as blockTextElements. 😅

@dlemfh
Copy link

dlemfh commented Jan 13, 2022

Hi, from my tests on using this library

parse(SOME_HTML_TEXT, { blockTextElements: { pre: true }}) causes

  • the pre element to be an ELEMENT_NODE
  • children within pre to be a single TEXT_NODE

parse(SOME_HTML_TEXT, { blockTextElements: { }}) causes

  • the pre element to be an ELEMENT_NODE
  • html tags within pre to be parsed as ELEMENT_NODEs as well

parse(SOME_HTML_TEXT, { blockTextElements: { pre: false }}) causes

  • the pre element to be an ELEMENT_NODE
  • and the pre element is parsed to have no children at all

Would I be correct to assume this is the intended behavior? If there's anything I'm missing I'd appreciate any heads up!

@nonara
Copy link
Collaborator Author

nonara commented Jan 16, 2022

@dlemfh It looks like the logic is that the presence of the key indicates that it's block text, and the value (true or false) indicates whether its contents should be ignored or preserved.

In other words, it seems that it's behaving as expected. If you want to not treat pre as block text, you should omit it entirely from the object. Hope that helps!

@dlemfh
Copy link

dlemfh commented Jan 17, 2022

@nonara Seems that way. Thanks!!

@bogdanionitabp
Copy link

But it seems you must still pass an empty object as the blockTextElements, otherwise the pre elements will not be parsed. If this was intended, then it's too confusing, otherwise it's too buggy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants