Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select correct sizes and types for stack/tree ArrayBuffers #139

Open
ZauberNerd opened this issue Dec 31, 2020 · 1 comment
Open

Select correct sizes and types for stack/tree ArrayBuffers #139

ZauberNerd opened this issue Dec 31, 2020 · 1 comment

Comments

@ZauberNerd
Copy link
Member

We have several typed arrays in our code whose size depends on the input.

visited
In our generated lexer is an array visited which always has the size 1024 but it should potentially be able to contain the entire input string.

const visited = new Uint16Array(1024);
// the currently matched lexeme
const lexeme = {
state: -1,
start: -1,
end: -1,
};
const next = (input, offset) => {
// 78
let state = 19968;
visited[0] = 19968;
// try to find match
let i = offset;
let j = 0;
let l = input.length;
while (state !== 25344 && i < l) {
state = table[state + input[i]];
i++;
j++;
visited[j] = state;

I see several options here:

  • we could allocate the array inside the next function which would be much more expensive
  • we could allocate the array inside the parse function and pass it to the next function (would be less expensive, but still has overhead)
  • we could allocate a much larger size (max length is Math.pow(2, 32) - 1 according to MDN) because it allocates only virtual memory until we fill it

stack
At the moment we generate either an Uint8Array or Uint16Array dependending on the number of states that we have.
The stack can grow larger than the number of states we have, for example deeply nested arrays / objects would stay on the stack until their matching closing braces are found. Additionally, since we store two stacks in that array, we have double the size.

const stack = new Uint8Array(512);

I see several options here:

  • we could potentially calculate the max size based on the input (for a regular stack, I'd say input bytes * 2 would suffice), but then we have allocations inside our parse function
  • we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
  • we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 32k stack items)

tree
The tree array has the same problem as the stack array above: It can get very large for long input strings, even larger than the stack, since we treat it as append only.

const tree = new Uint16Array(8192);

I see several ideas for improvements here:

  • we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
  • we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 64k nodes)
  • we should probably also clone / slice the parsed tree when we're in the done state, so that the next parse doesn't overwrite a previous parse (for example: const result1 = parse('...'); const result 2 = parse('...'); would overwrite result1)
  • or, instead of cloning/slicing we could continue writing in the array by not resetting the tree pointer inside the parse() function, but I suppose that will get problematic for long running processes (such as a watcher task) since we never free the memory of unused trees
@KnisterPeter
Copy link
Member

visited
[...]

I see several options here:

  • we could allocate the array inside the next function which would be much more expensive
  • we could allocate the array inside the parse function and pass it to the next function (would be less expensive, but still has overhead)
  • we could allocate a much larger size (max length is Math.pow(2, 32) - 1 according to MDN) because it allocates only virtual memory until we fill it

I think it's very unlikely to have a visited array which need to hold the entire input stream of characters.
The visited array holds a state per visited character and we stop processing at the first known error state from the current offset. That means the full stream is required to fit into the visited array only when the first lexeme in the input stream (offset 0) is a full match and the only lexeme that we have in our input stream.

In theory this is a valid option, but in practice this is more than unlikely.
I see a third option:

  • We should parameterize the lexer code generator with options to define the visited array length. Like some kind of pragma statements in the grammar. In case of grammar were this case is more likely, the grammar author can define the required side of the array.

stack
At the moment we generate either an Uint8Array or Uint16Array dependending on the number of states that we have.
The stack can grow larger than the number of states we have, for example deeply nested arrays / objects would stay on the stack until their matching closing braces are found. Additionally, since we store two stacks in that array, we have double the size.

const stack = new Uint8Array(512);

I see several options here:

  • we could potentially calculate the max size based on the input (for a regular stack, I'd say input bytes * 2 would suffice), but then we have allocations inside our parse function
  • we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
  • we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 32k stack items)

I cannot imagine a stack with more than 64k items on it. That sound very huge.
Same reasoning as above, I would say, lets put this into the hand of the grammar author. There is no one-size-fits-all solution here.
A grammar with more recursive structures will probably result in more stack items. Others might not need this.

tree
The tree array has the same problem as the stack array above: It can get very large for long input strings, even larger than the stack, since we treat it as append only.

const tree = new Uint16Array(8192);

I see several ideas for improvements here:

  • we could allocate it statically to the max addressable space Math.pow(2, 32) - 1
  • we should probably also switch to a Uint32Array, because a Uint16Array can only contain up to 65535 (which means we could have a maximum of 64k nodes)

This could indeed be true. The tree should be able to grow really huge. A Uint32Array would be a good solution here I think.

  • we should probably also clone / slice the parsed tree when we're in the done state, so that the next parse doesn't overwrite a previous parse (for example: const result1 = parse('...'); const result 2 = parse('...'); would overwrite result1)

We should add a clone operation to the top level.
We don't need this behavior by default for cases where only some input should be parsed, then processed and afterwards the next input is processes.
But in case of parallel processing our parser isn't safe to use, because we rely on module state. A new parse should (as it does right now) reset the pointers and overwrite everything, but one need to be able to 'save' the AST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants