Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lexer#stream() #36

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add Lexer#stream() #36

wants to merge 4 commits into from

Conversation

nathan
Copy link
Collaborator

@nathan nathan commented Mar 17, 2017

The reason we shouldn't just make Lexer a transform stream is so you can reuse a single lexer on multiple input streams.

However: with this implementation, horrible things will happen if you call stream() again before the first one is closed; specifically, the two streams will be interleaved based on which input stream is readable first. If you don't use states, you might not notice this unless you look at the offset/line/column information.

Should stream() automatically clone the lexer to avoid this behavior? Should we make Lexer a transform stream so you can just call clone() to get a new, independent stream?

stream() automatically calls reset() because switching to streaming halfway through your input doesn't sound like a good idea.

Closes #20.

moo.js Outdated
@@ -273,6 +273,23 @@
this.reset()
}

if (typeof require !== 'undefined') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't typeof module === 'object' && module.exports the Cool Way to do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that depends on what you're checking. But in this case, probably.

@tjvr
Copy link
Collaborator

tjvr commented Mar 17, 2017

switching to streaming halfway through your input doesn't sound like a good idea.

Agreed!


This doesn't do anything about token boundaries, does it? I recall you raised some concerns about a stream API from #18:

the API is too generic to allow a useful stream wrapper.

it's incompatible with a stream API that doesn't re-lex the stream when it gets new data.

If you leave the task of pushing data only at token boundaries up to the user, then it's easy to add a stream interface (that's what feed() is, essentially). Packages like stream-split are good for, e.g., pushing the data line-by-line.

It's also possible, I guess, to re-lex the last token when we get new data, but you can still break this if you have two tokens which, when concatenated, form a different token; and this makes the API weird because you need to sometimes return a token twice.

@nathan
Copy link
Collaborator Author

nathan commented Mar 17, 2017

This doesn't do anything about token boundaries, does it?

Correct. It is up to the user to push data at token boundaries.

@tjvr
Copy link
Collaborator

tjvr commented Mar 17, 2017

As for

Should stream() automatically clone the lexer to avoid this behavior?

Probably, yes. We haven't needed to you yet, but it probably would make sense for Lexers to keep track of lastIndex themselves; then clone() is cheap since we won't have to re-compile the RegExp.

@nathan
Copy link
Collaborator Author

nathan commented Mar 18, 2017

Probably, yes.

Done.

@tjvr
Copy link
Collaborator

tjvr commented Mar 18, 2017

Are streams compelling? Would you use them? :-)

@nathan
Copy link
Collaborator Author

nathan commented Mar 18, 2017

It is quite useful to be able to do this:

fs.createReadStream(wherever)
.pipe(split(/(\n)/))
.pipe(lexer.clone())
.pipe(new Parser())
.on('data', console.log)

(modulo disgusting .on('error', …)s everywhere because the stream API is broken)

But it's somewhat more difficult—or perhaps just not conventional—to write Parser like this than it is to write a recursive descent parser.

EDIT: nearley might make this easier since it retains parse state between feed()s.

Here's an example that parses a sequence of s-expression-like things such as (foo (bar baz (qux))), and emits each one as it's parsed.

class Parser extends Transform {
  constructor() {
    super({objectMode: true})
    this.stack = []
    this.result = null
  }
  process(tok) {
    switch (tok.type) {
      case 'lpar':
        const inner = []
        if (this.result) {
          this.result.push(inner)
          this.stack.push(this.result)
        }
        this.result = inner
        break
      case 'rpar':
        if (!this.stack.length) {
          if (!this.result) {
            throw new Error("I'm not gonna match your parentheses for you")
          }
          this.push(this.result)
        }
        this.result = this.stack.pop()
        break
      case 'word':
        if (!this.result) {
          throw new Error("They go inside the parentheses")
        }
        this.result.push(tok.value)
        break
      case 'space': break
    }
  }
  flush() {
    if (this.result) {
      throw new Error("Aren't you forgetting something?")
    }
  }
  _transform(chunk, _, cb) {
    try {
      this.process(chunk)
      cb()
    } catch(e) {cb(e)}
  }
  _flush(cb) {
    try {
      this.flush()
      cb()
    } catch(e) {cb(e)}
  }
}

@nathan
Copy link
Collaborator Author

nathan commented Mar 18, 2017

@tjvr Another option for the stream API is to buffer input until we get a regex match that doesn't extend to the end of the buffer—with an optional maximum buffer size like 1MB—and to have a method that signals no more input, or perhaps better a flag to feed() like TextDecoder#decode:

lexer.feed('some inp', {stream: true})
lexer.next() // --> word some
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed('ut some more input', {stream: true}
lexer.next() // --> word input
// ...
lexer.next() // --> word more
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed(' the last of the input')
// ...
lexer.next() // -> word input

Using streams that look like this:

const rs = new Readable()
rs.push('some inp')
rs.push('ut more input')
rs.end(' the last of the input')

rs.pipe(lexer.clone())

@tjvr
Copy link
Collaborator

tjvr commented Mar 20, 2017

buffer input until we get a regex match that doesn't extend to the end of the buffer

I thought you said that still isn't correct? #18 (comment)

@nathan
Copy link
Collaborator Author

nathan commented Mar 20, 2017

I thought you said that still isn't correct?

It's not correct, though it would give the correct result in the example I gave. Where it wouldn't give the correct result is if you had regexes like /[a-z]+/ and /\d+(?:[Ee]\d+)?/ and fed the input as 103e followed by 23, because the latter regex would match 103, which stops before the end of the buffer.

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine. You can always let the number regex match 103e and do the validation at parse time.

@tjvr
Copy link
Collaborator

tjvr commented Mar 21, 2017

for every valid token, all non-empty prefixes of that token must also parse as a single token

Okay. That seems like a good way to do this. :-)

@tjvr
Copy link
Collaborator

tjvr commented Mar 29, 2017

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine.

Actually, this is less reasonable than I first thought :(

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

@tjvr tjvr mentioned this pull request Mar 29, 2017
@nathan
Copy link
Collaborator Author

nathan commented Apr 5, 2017

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

Correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streams API
2 participants