Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep buffer for long text during transformation #357

Closed
wants to merge 1 commit into from
Closed

Keep buffer for long text during transformation #357

wants to merge 1 commit into from

Conversation

qnighy
Copy link

@qnighy qnighy commented Aug 18, 2021

In RewritingStream, buffers are sometimes dropped too early to get raw HTML source. It leads to long text node being truncated (only a suffix being emitted).

This PR introduces "buffer safekeeping" extension to keep a buffer at a specified location. It allows RewritingStream to always retrieve correct raw HTML source.

Fixes #292.

In RewritingStream, buffers are sometimes dropped too early to get raw
HTML source. It leads to long text node being truncated (only a suffix
being emitted).

This commit introduces "buffer safekeeping" extension to keep a buffer
at a specified location. It allows RewritingStream to always retrieve
correct raw HTML source.

Fixes #292.
@petercmuc
Copy link

Hi
I ran into the same problem... what needs to be done in order to have your fix merged?

@fb55
Copy link
Collaborator

fb55 commented Jan 18, 2022

This is currently blocked by #362. Once that PR is merged, here is the approach I would take instead:

This issue is caused by dropParsedChunk calls in states such as the DATA state. If we delay these calls until after character tokens are emitted, we fix this issue. Because the tokenizer has the indirection of the token queue, we could either (1) move the dropParsedChunk call to the beginning of getNextToken, or (2) remove the token queue by switching to a callback interface.

(1) will always waste a bit of memory, as parsed chunks will only be dropped prior to the next token being generated. We might also want to look into dropping parsed chunks whenever we add to the buffer in the preprocessor, to reduce the impact of this change.

(2) is a much bigger architectural change. It would reliably drop parsed chunks as soon as they are emitted, but might have other unintended consequences (eg. reduced perf).

To fix this specific issue, I would lean towards implementing (1).

fb55 added a commit that referenced this pull request Mar 3, 2022
Fixes #357

Co-Authored-By: Masaki Hara <41755+qnighy@users.noreply.github.com>
@fb55
Copy link
Collaborator

fb55 commented Mar 4, 2022

Fixed in #432

@fb55 fb55 closed this Mar 4, 2022
@qnighy qnighy deleted the buffer-safekeeping branch March 26, 2022 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewriter: text content longer than 65536 chars is truncated
3 participants