Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising lexer behaviour when input is shorter than the shortest token #349

Open
lierdakil opened this issue Nov 8, 2023 · 4 comments

Comments

@lierdakil
Copy link

Apparently, lexer behaviour differs when there are callbacks and when there aren't any. Specifically, when there aren't any callbacks, and there's an error at the start of the input, the lexer returns None, instead of the expected Some(Err(())).

Reproducer:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("TEST")]
    Test,
}

#[test]
fn test() {
    assert_eq!(Foo::lexer("Bar").next(), Some(Err(()))); // this fails
}

This succeeds:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("TEST")]
    Test,
    #[token("FOO", |_| ())]
    Bar(()),
}

#[test]
fn test() {
    assert_eq!(Foo::lexer("Bar").next(), Some(Err(()))); // this succeeds
}

This only happens at the start of the input, e.g. this works as expected:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("TEST")]
    Test,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("TEST Bar");
    assert_eq!(lex.next(), Some(Ok(Foo::Test)));
    assert_eq!(lex.next(), Some(Err(())));
}
@lierdakil
Copy link
Author

Correction: doesn't only happen on the start of the input, apparently the space messed with the results, this still fails:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("TEST")]
    Test,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("TESTSET");
    assert_eq!(lex.next(), Some(Ok(Foo::Test)));
    assert_eq!(lex.next(), Some(Err(()))); // fails here
}

Adding a callback makes the test pass.

@jeertmans
Copy link
Collaborator

Hello, that's strange!

But indeed you have to be aware that the space in important and will definitely cause an error if not handled :-)

Could you post the error message here?

Also, does adding a callback to Foo::Test work? Here you are adding a variant and a callback, so I would try to understand which of those two modifications is actually making it "pass".

@lierdakil
Copy link
Author

the space in important

That's very surprising, too, actually. Consider that there are no skip annotations here. What's the difference between TESTSET and TEST Bar in the absence of skip annotations, and when the only defined token is TEST? There shouldn't be. Both SET and Bar (with the space at the start) should be treated as an error. Yet, the former returns None on the second next(), and the latter returns Some(Err(())). I wouldn't expect that.

Could you post the error message here?

Everything compiles fine, so unclear what error message you expect me to post. It's the runtime behaviour which is surprising.

Also, does adding a callback to Foo::Test work?

I did some more testing, and apparently callbacks are a red herring entirely. Apologies for the misleading report.

So this works as expected, returning an error on ZAP:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("FOOBAR")]
    Foo,
    #[token("Q")]
    Bar,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("FOOBARZAP");
    assert_eq!(lex.next(), Some(Ok(Foo::Foo)));
    assert_eq!(lex.next(), Some(Err(())));
}

This, however, does not:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("FOOBAR")]
    Foo,
    #[token("QFOOBAR")]
    Bar,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("FOOBARZAP");
    assert_eq!(lex.next(), Some(Ok(Foo::Foo)));
    assert_eq!(lex.next(), Some(Err(()))); // returns None
}

@lierdakil
Copy link
Author

There also seems to be some variation with a single token, dependent on the token length.

This works as expected:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("FOO")]
    Foo,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("ZAP");
    assert_eq!(lex.next(), Some(Err(())));
}

This doesn't:

#[derive(Logos, Debug, PartialEq, Eq)]
pub enum Foo {
    #[token("FOOB")]
    Foo,
}

#[test]
fn test() {
    let mut lex = Foo::lexer("ZAP");
    assert_eq!(lex.next(), Some(Err(())));
}

So... to me, it seems the lexer returns None if the remaining input is shorter than the shortest token. That's really surprising, I would expect the lexer to consume all input.

@lierdakil lierdakil changed the title Surprising lexer behaviour when there are no callbacks Surprising lexer behaviour when input is shorter than the shortest token Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants