refactor(tokenizer): Use explicit offsets for locations #402

fb55 · 2022-02-11T13:19:34Z

Supersedes #397

Previously, ctLoc was overwritten on every new character. #397 attempted to improve this by maintaining the locations in an updated object, and creating copies when needed.

With this PR, we instead use explicit offsets when creating tokens. There is no more need to update locations; instead, we exploit the fact that a *DATA state is entered only after emitting the previous token.

This change should lead to a great reduction in allocations when using location information.

Supersedes inikulin#397 Previously, `ctLoc` was overwritten on every new character. inikulin#397 attempted to improve this by maintaining the locations in an updated object, and creating copies when needed. With this PR, we instead use explicit offsets when creating tokens. There is no more need to update locations; instead, we exploit the fact that a *DATA state is entered only after emitting the previous token. This change should lead to a great reduction in allocations when using location information.

wooorm · 2022-02-11T14:32:52Z

packages/parse5/lib/tokenizer/index.ts

    private currentCharacterToken: CharacterToken | null = null;
    private currentToken: Token | null = null;
    private currentAttr: Attribute = { name: '', value: '' };

+    // NOTE: Doctypes tokens are created with several different offset. We keep track of the moment a doctype *might* start here.


Weird? Should examples of differences perhaps be documented here?

It just occurred to me that the ctLoc property can be used here; updated accordingly.

The issue with doctypes is that they don't start at fixed offsets. Eg. whitespace can push the token creation further back.

how is that different from elements, or anything else, that is also pushed back by whitespace?

Doctype tokens are a bit weird in the way they are created in the spec. All other tokens are created as soon as you can be certain that a specific type of token is needed. With Doctypes, this is pushed back until we get the beginning of something that needs to be saved. Eg. <!doctype html> would only have its token created on the h of html.

Do you have a link to the spec that mentions this behavior?
I don’t understand, from the tokenizing chapter, what makes DOCTYPE, [CDATA[, or -- different?

The Before DOCTYPE name state is probably the best example here. The token is only created if a non-whitespace character is found.

This is in contrast to the Markup declaration open state, which creates comment tokens after reading a specific number of characters. This allows us to set specific offsets. The same applies to the Tag open state and all of the * end tag open states. (These are all the states in which non-doctype tokens are created.)

packages/parse5/lib/tokenizer/index.ts

wooorm · 2022-02-11T14:34:32Z

packages/parse5/lib/tokenizer/index.ts

    private addLocationInfo;
    private onParseError;

    constructor(options: { sourceCodeLocationInfo?: boolean; onParseError?: ParserErrorHandler | null }) {
        this.addLocationInfo = !!options.sourceCodeLocationInfo;
        this.onParseError = options.onParseError ?? null;
        this.preprocessor = new Preprocessor(options);
+        this.ctLoc = this.getCurrentLocation(-1);


The doubly negated offset makes things a bit difficult here. An offset of +1 means "start one character ago". -1 means "start one character after our last seen one".

The preprocessor is initiated at position -1. We access the location here before we read anything, so have to give a negative offset of -1 to get the position of the first character in the input. The same is done again in _emitCurrentToken.

I have played with flipping the sign of getCurrentLocation's offset, but then other places are hard to read.

wooorm · 2022-02-11T14:35:16Z

packages/parse5/lib/tokenizer/preprocessor.ts

@@ -17,7 +17,8 @@ const DEFAULT_BUFFER_WATERLINE = 1 << 16;
 export class Preprocessor {
    public html = '';
    private pos = -1;
-    private lastGapPos = -1;
+    // NOTE: Initial `lastGapPos` is -2, to ensure `col` on initialisation is 0
+    private lastGapPos = -2;


-2 seems weird, undefined or so perhaps? Or would that make the types too complex?

That would complicate things quite a bit. The issue here is the check Number(this.lastGapPos !== this.pos) below. -2 is weird, but does make things work as expected.

Infinity (or -Infinity) perhaps?

-Infinity seems even weirder to me tbh 😄

-2 still has a resemblance to the not found return value of indexOf, with -Infinity I am taken aback by the sudden floating point number for something that is supposed to be an index.

Merge `ctLoc` and `currentAttrLocation` as `currentLocation`

fb55 mentioned this pull request Feb 11, 2022

refactor(tokenizer): Re-use object for current location #397

Closed

wooorm reviewed Feb 11, 2022

View reviewed changes

fb55 added 3 commits February 11, 2022 14:40

Remove doctypeStartLoc property

48a04d0

Add comment about offset not running across line boundaries

cb22db8

Remove unnecessary location property

f6c21b4

Merge `ctLoc` and `currentAttrLocation` as `currentLocation`

fb55 merged commit 5c1f156 into inikulin:master Feb 24, 2022

fb55 deleted the refactor/loc-offsets branch February 26, 2022 21:11

fb55 added a commit to parse5/parse5-fork that referenced this pull request Mar 2, 2022

refactor(tokenizer): Use explicit offsets for locations (inikulin#402)

c61905e

jmbpwtw pushed a commit to jmbpwtw/parse5 that referenced this pull request Feb 16, 2023

refactor(tokenizer): Use explicit offsets for locations (inikulin#402)

68f90b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(tokenizer): Use explicit offsets for locations #402

refactor(tokenizer): Use explicit offsets for locations #402

fb55 commented Feb 11, 2022

wooorm Feb 11, 2022

fb55 Feb 11, 2022

wooorm Feb 11, 2022

fb55 Feb 11, 2022

wooorm Feb 13, 2022

fb55 Feb 13, 2022 •

edited

wooorm Feb 11, 2022

fb55 Feb 12, 2022

wooorm Feb 11, 2022

fb55 Feb 11, 2022

wooorm Feb 13, 2022

fb55 Feb 13, 2022

refactor(tokenizer): Use explicit offsets for locations #402

refactor(tokenizer): Use explicit offsets for locations #402

Conversation

fb55 commented Feb 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fb55 Feb 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fb55 Feb 13, 2022 •

edited