Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorter match ignored, because longer potential match fails #315

Open
m0rphism opened this issue Jun 20, 2023 · 5 comments
Open

Shorter match ignored, because longer potential match fails #315

m0rphism opened this issue Jun 20, 2023 · 5 comments

Comments

@m0rphism
Copy link

m0rphism commented Jun 20, 2023

The following minimized example

#[derive(Logos, Clone, Copy, Debug, PartialEq)]
pub enum Token {
    #[token("a")] A,
    #[token("b")] B,
    #[regex(r"[ab]*c")] ABC,
}

fn main() {
    for (tok, span) in Token::lexer("aba").spanned() {
        println!("{:?}\t {:?}", span, tok);
    }
}

yields the output

0..3  Err(())

while I'd expect the output

0..1  Ok(Token::A)
1..2  Ok(Token::B)
2..3  Ok(Token::A)

However, changing the ABC rule to

    #[regex(r"abc")] ABC,

yields the expected output (but of course also changes the lexed language).

Is this a bug or a known limitation of the implementation?

In case it's the latter, is there any way of lexing the first language (with [ab]*c)?

Apart from this limitation, I'm really enjoying the library so far! Seing rustc segfault, when I messed up my regexes, was a bit surprising, but simply annotating the Token-Definition is just as nice and concise as it gets :)

edit: using logos = "0.13.0"

@m0rphism m0rphism reopened this Jun 20, 2023
@maciejhirsz
Copy link
Owner

This might be fixable, but I'm afraid it will need to wait for the rewrite mentioned in last release notes. Also see #291.

@m0rphism
Copy link
Author

m0rphism commented Jun 20, 2023

Alrighty, cool. Thanks for the quick reply! <3

I also saw #291, but wasn't sure if it's the same underlying issue.

Are you by any chance aware if there is a quick workaround to get this kind of scenario working with the current Logos implementation? If not, that's also fine, just thought I quickly ask, in case you've seen this before :)

@maciejhirsz
Copy link
Owner

maciejhirsz commented Jun 20, 2023

There isn't an easy one. You could try to do some manual parsing with callbacks for offending tokens, see Lexer::remainder and Lexer::bump (and main docs for callbacks and return types).

@zacryol
Copy link

zacryol commented Sep 11, 2023

I've also encountered this bug, and it seems to primarily occur when a repetition starts matching, but a required later symbol is not present.

use logos::Logos;

#[derive(Logos, Debug)]
enum Token {
    #[regex("ab+c")]
    //#[regex("abc")]
    Str,
    #[regex(".")]
    Other,
}

fn main() {
    dbg!(Token::lexer("ab").collect::<Result<Vec<_>, _>>());
}

With this code as written, the output is an error, as after one b for the b+ regex segment starts matching, the lexer "commits" to that regex rule (although it shouldn't), and the later lack of a trailing c causes the token to error completely for that slice.

If the alternative (commented out) regex rule is used instead, then the lack of an ending c after the initial ab match is understood properly, and the lexer tries the alternate branch, and two Other tokens are produced, as would be expected in either case.

tesujimath added a commit to tesujimath/beancount-parser-lima that referenced this issue Feb 5, 2024
The problem arises from maciejhirsz/logos#315

This work-around is a temporary solution until the upcoming Logos rewrite.

Fixes #4
@tesujimath
Copy link

tesujimath commented Feb 5, 2024

I found quite a clean work-around to this problem which may be useful to others.

My work-around is to define a subset of Token which I call RecoveryToken, and when lexing fails, call Logos::lexer again on the failed span, using the subset of tokens which exclude the troublesome one.

In my case, the problem is the overlap between the currency regex and other shorter ones. Simplified:

#[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern currency = r"[A-Z][A-Z0-9'\._-]*|/[0-9'\._-]*[A-Z][A-Z0-9'\._-]*")]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
pub enum Token<'a> {
    #[regex(r#"(?&currency)"#, |lex| super::Currency::try_from(lex.slice()) )]
    Currency(super::Currency<'a>),

    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
    Date(Date),

    #[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
    Number(Decimal),

    // errors are returned as an error token
    Error(LexerError),
}

// just the subset of `Token` which could occur within a currency
#[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
enum RecoveryToken {
    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
    Date(Date),

    #[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
    Number(Decimal),
}

impl<'a> From<RecoveryToken> for Token<'a> {
    fn from(value: RecoveryToken) -> Self {
        use RecoveryToken::*;

        match value {
            Minus => Token::Minus,
            Slash => Token::Slash,
            Date(date) => Token::Date(date),
            Number(decimal) => Token::Number(decimal),
        }
    }
}

fn attempt_recovery(
    failed_span: Range<usize>,
    s: &str,
) -> impl Iterator<Item = (Token, Range<usize>)> {
    let failed_token = &s[failed_span.start..failed_span.end];

    RecoveryToken::lexer(failed_token)
        .spanned()
        .map(move |(lexeme, rel_span)| {
            let span = failed_span.start + rel_span.start..failed_span.start + rel_span.end;
            match lexeme {
                Ok(tok) => (tok.into(), span),
                Err(e) => (Token::Error(e), span),
            }
        })
}

and where we use this:

// lex that input:
    Token::lexer(s)
        .spanned()
        .flat_map(|(lexeme, span)| match lexeme {
            Ok(tok) => smallvec![(tok, span)],
            Err(_e) => attempt_recovery(span.clone(), s)
                .collect::<SmallVec<_, 1>>(),
        })

Complete code is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants