Select correct sizes and types for stack/tree ArrayBuffers #139

ZauberNerd · 2020-12-31T11:18:11Z

We have several typed arrays in our code whose size depends on the input.

visited
In our generated lexer is an array visited which always has the size 1024 but it should potentially be able to contain the entire input string.

par-gen/parsers/json/generated/lexer-initial.js

Lines 1359 to 1381 in fe610c3

    
           const visited = new Uint16Array(1024); 
        
           // the currently matched lexeme 
        
           const lexeme = { 
        
             state: -1, 
        
             start: -1, 
        
             end: -1, 
        
           }; 
        
           const next = (input, offset) => { 
        
             // 78 
        
             let state = 19968; 
        
             visited[0] = 19968; 
        
             // try to find match 
        
             let i = offset; 
        
             let j = 0; 
        
             let l = input.length; 
        
             while (state !== 25344 && i < l) { 
        
               state = table[state + input[i]]; 
        
               i++; 
        
               j++; 
        
               visited[j] = state;

I see several options here:

we could allocate the array inside the next function which would be much more expensive
we could allocate the array inside the parse function and pass it to the next function (would be less expensive, but still has overhead)
we could allocate a much larger size (max length is Math.pow(2, 32) - 1 according to MDN) because it allocates only virtual memory until we fill it

stack
At the moment we generate either an Uint8Array or Uint16Array dependending on the number of states that we have.
The stack can grow larger than the number of states we have, for example deeply nested arrays / objects would stay on the stack until their matching closing braces are found. Additionally, since we store two stacks in that array, we have double the size.

par-gen/parsers/json/generated/parser.js

Line 3272 in fe610c3

const stack = new Uint8Array(512);

I see several options here:

we could potentially calculate the max size based on the input (for a regular stack, I'd say input bytes * 2 would suffice), but then we have allocations inside our parse function
we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 32k stack items)

tree
The tree array has the same problem as the stack array above: It can get very large for long input strings, even larger than the stack, since we treat it as append only.

par-gen/parsers/json/generated/parser.js

Line 3288 in fe610c3

const tree = new Uint16Array(8192);

I see several ideas for improvements here:

we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 64k nodes)
we should probably also clone / slice the parsed tree when we're in the done state, so that the next parse doesn't overwrite a previous parse (for example: const result1 = parse('...'); const result 2 = parse('...'); would overwrite result1)
or, instead of cloning/slicing we could continue writing in the array by not resetting the tree pointer inside the parse() function, but I suppose that will get problematic for long running processes (such as a watcher task) since we never free the memory of unused trees

The text was updated successfully, but these errors were encountered:

KnisterPeter · 2020-12-31T12:26:48Z

visited
[...]

I see several options here:

we could allocate the array inside the next function which would be much more expensive

we could allocate the array inside the parse function and pass it to the next function (would be less expensive, but still has overhead)

we could allocate a much larger size (max length is Math.pow(2, 32) - 1 according to MDN) because it allocates only virtual memory until we fill it

I think it's very unlikely to have a visited array which need to hold the entire input stream of characters.
The visited array holds a state per visited character and we stop processing at the first known error state from the current offset. That means the full stream is required to fit into the visited array only when the first lexeme in the input stream (offset 0) is a full match and the only lexeme that we have in our input stream.

In theory this is a valid option, but in practice this is more than unlikely.
I see a third option:

We should parameterize the lexer code generator with options to define the visited array length. Like some kind of pragma statements in the grammar. In case of grammar were this case is more likely, the grammar author can define the required side of the array.

stack
At the moment we generate either an Uint8Array or Uint16Array dependending on the number of states that we have.
The stack can grow larger than the number of states we have, for example deeply nested arrays / objects would stay on the stack until their matching closing braces are found. Additionally, since we store two stacks in that array, we have double the size.

par-gen/parsers/json/generated/parser.js

Line 3272 in fe610c3

const stack = new Uint8Array(512);

I see several options here:

we could potentially calculate the max size based on the input (for a regular stack, I'd say input bytes * 2 would suffice), but then we have allocations inside our parse function

we could allocate it statically to the max addressable space Math.pow(2, 32) - 1

we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 32k stack items)

I cannot imagine a stack with more than 64k items on it. That sound very huge.
Same reasoning as above, I would say, lets put this into the hand of the grammar author. There is no one-size-fits-all solution here.
A grammar with more recursive structures will probably result in more stack items. Others might not need this.

tree
The tree array has the same problem as the stack array above: It can get very large for long input strings, even larger than the stack, since we treat it as append only.

par-gen/parsers/json/generated/parser.js

Line 3288 in fe610c3

const tree = new Uint16Array(8192);

I see several ideas for improvements here:

we could allocate it statically to the max addressable space Math.pow(2, 32) - 1

we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 64k nodes)

This could indeed be true. The tree should be able to grow really huge. A Uint32Array would be a good solution here I think.

we should probably also clone / slice the parsed tree when we're in the done state, so that the next parse doesn't overwrite a previous parse (for example: const result1 = parse('...'); const result 2 = parse('...'); would overwrite result1)

We should add a clone operation to the top level.
We don't need this behavior by default for cases where only some input should be parsed, then processed and afterwards the next input is processes.
But in case of parallel processing our parser isn't safe to use, because we rely on module state. A new parse should (as it does right now) reset the pointers and overwrite everything, but one need to be able to 'save' the AST.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select correct sizes and types for stack/tree ArrayBuffers #139

Select correct sizes and types for stack/tree ArrayBuffers #139

ZauberNerd commented Dec 31, 2020

KnisterPeter commented Dec 31, 2020

Select correct sizes and types for stack/tree ArrayBuffers #139

Select correct sizes and types for stack/tree ArrayBuffers #139

Comments

ZauberNerd commented Dec 31, 2020

KnisterPeter commented Dec 31, 2020