Keep buffer for long text during transformation #357

qnighy · 2021-08-18T10:26:07Z

In RewritingStream, buffers are sometimes dropped too early to get raw HTML source. It leads to long text node being truncated (only a suffix being emitted).

This PR introduces "buffer safekeeping" extension to keep a buffer at a specified location. It allows RewritingStream to always retrieve correct raw HTML source.

Fixes #292.

In RewritingStream, buffers are sometimes dropped too early to get raw HTML source. It leads to long text node being truncated (only a suffix being emitted). This commit introduces "buffer safekeeping" extension to keep a buffer at a specified location. It allows RewritingStream to always retrieve correct raw HTML source. Fixes #292.

petercmuc · 2022-01-13T10:17:28Z

Hi
I ran into the same problem... what needs to be done in order to have your fix merged?

fb55 · 2022-01-18T00:20:30Z

This is currently blocked by #362. Once that PR is merged, here is the approach I would take instead:

This issue is caused by dropParsedChunk calls in states such as the DATA state. If we delay these calls until after character tokens are emitted, we fix this issue. Because the tokenizer has the indirection of the token queue, we could either (1) move the dropParsedChunk call to the beginning of getNextToken, or (2) remove the token queue by switching to a callback interface.

(1) will always waste a bit of memory, as parsed chunks will only be dropped prior to the next token being generated. We might also want to look into dropping parsed chunks whenever we add to the buffer in the preprocessor, to reduce the impact of this change.

(2) is a much bigger architectural change. It would reliably drop parsed chunks as soon as they are emitted, but might have other unintended consequences (eg. reduced perf).

To fix this specific issue, I would lean towards implementing (1).

Fixes #357 Co-Authored-By: Masaki Hara <41755+qnighy@users.noreply.github.com>

fb55 · 2022-03-04T09:35:36Z

Fixed in #432

fb55 added a commit that referenced this pull request Mar 3, 2022

Add test case

a041b61

Fixes #357 Co-Authored-By: Masaki Hara <41755+qnighy@users.noreply.github.com>

fb55 mentioned this pull request Mar 3, 2022

fix(tokenizer): Drop chunks after emitting tokens #432

Merged

fb55 closed this Mar 4, 2022

qnighy deleted the buffer-safekeeping branch March 26, 2022 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep buffer for long text during transformation #357

Keep buffer for long text during transformation #357

qnighy commented Aug 18, 2021

petercmuc commented Jan 13, 2022

fb55 commented Jan 18, 2022

fb55 commented Mar 4, 2022

Keep buffer for long text during transformation #357

Keep buffer for long text during transformation #357

Conversation

qnighy commented Aug 18, 2021

petercmuc commented Jan 13, 2022

fb55 commented Jan 18, 2022

fb55 commented Mar 4, 2022