Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leading/trailing single spaces not preserved in node text #115

Closed
nonara opened this issue Apr 23, 2021 · 9 comments
Closed

Leading/trailing single spaces not preserved in node text #115

nonara opened this issue Apr 23, 2021 · 9 comments

Comments

@nonara
Copy link
Collaborator

nonara commented Apr 23, 2021

Issue

Example:

<p>Hello <strong>world</strong>!</p>

This, unfortunately produces a TextNode whose text is Hello as opposed to Hello<space>
Examining the whole text of the parent p is Helloworld!

Unfortunately this is irreparably breaking node-html-markdown, as I have no way to determine whether there is a leading or trailing space.

see: crosstype/node-html-markdown#9

Proposal

In effort to be consistent with HTML spec, a single leading or trailing space should be captured, when present, in the text for any TextNode.

I believe this proposal would be consistent with the behaviour of other parsers. However, if you decide against that route, an alternate proposal would be to add properties to the node hasLeadingSpace / hasTrailingSpace.

Would be interested to hear your thoughts on these. Thanks!

@nonara nonara changed the title Leading/trailing single spaces not preserved in TextNode Leading/trailing single spaces not preserved in node text Apr 24, 2021
@serjant
Copy link

serjant commented Apr 27, 2021

I have the same issue, it just trims leading and trailing spaces. Seems like the code which causes the issue is in html.d.ts:

return blocks.map((block) => {
			// Normalize each line's whitespace
			return block.join('').trim().replace(/\s{2,}/g, ' ');
		})
			.join('\n').replace(/\s+$/, '');	// trimRight;

@nonara
Copy link
Collaborator Author

nonara commented Apr 27, 2021

@serjant Good find. Thanks!

Any maintainers around to weigh in on this? I'd propose a PR which removes the trim, effectively fixing the issue of removing legitimate leading/trailing whitespace. I'm happy to do this.

If there are any potential issues or side effects with this method, please let me know.

@serjant
Copy link

serjant commented May 13, 2021

@nonara Likewise the trim() method is dangerous in that piece of code, because the trim() method removes whitespace from both ends of a string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.). See the specification

@nonara
Copy link
Collaborator Author

nonara commented May 19, 2021

@taoqf Just checking up on this. It's been close to a month without a reply, so I need to make a determination on what to do.

At this point there are two proposals -

  1. Remove trim() to correct the issue
  2. Add hasLeadingWhiteSpace and hasTrailingWhiteapace properties to TextNode

I would suggest the former, as it complies with spec and makes the most sense.

I'm ready to do a PR for either with all necessary tests.

I don't mean to rush, but as this is a dependency, I will either need to write a new parser, fork this one, or (ideally) get a PR through. Whatever you decide is fine, but I would greatly appreciate a response to advise on which route I should take to get this fixed in my library.

Thanks!

@taoqf
Copy link
Owner

taoqf commented May 22, 2021

Sorry, I thought you would do a pr, now, I'll remove trim now.

taoqf added a commit that referenced this issue May 22, 2021
@taoqf
Copy link
Owner

taoqf commented May 22, 2021

@nonara
Copy link
Collaborator Author

nonara commented May 22, 2021

No worries. It's working well! Thanks!

@nonara nonara closed this as completed May 22, 2021
@nonara
Copy link
Collaborator Author

nonara commented May 22, 2021

Looks like I was wrong here. I didn't notice that I had been using HTMLElement#removeWhitespace, whose primary function is to remove empty whitespace nodes.

Unfortunately, it also calls trim() on the text for each TextNode.

Details and fix are in #123 - I took an approach which should also help restore some trim functionality that was removed in v3.2.2

@nonara nonara reopened this May 22, 2021
nonara added a commit to nonara/node-html-parser that referenced this issue May 22, 2021
@nonara nonara mentioned this issue May 22, 2021
taoqf added a commit that referenced this issue May 25, 2021
@taoqf
Copy link
Owner

taoqf commented May 25, 2021

@nonara nonara closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants