Shorter match ignored, because longer potential match fails #315

m0rphism · 2023-06-20T00:47:48Z

The following minimized example

#[derive(Logos, Clone, Copy, Debug, PartialEq)]
pub enum Token {
    #[token("a")] A,
    #[token("b")] B,
    #[regex(r"[ab]*c")] ABC,
}

fn main() {
    for (tok, span) in Token::lexer("aba").spanned() {
        println!("{:?}\t {:?}", span, tok);
    }
}

yields the output

0..3  Err(())

while I'd expect the output

0..1  Ok(Token::A)
1..2  Ok(Token::B)
2..3  Ok(Token::A)

However, changing the ABC rule to

    #[regex(r"abc")] ABC,

yields the expected output (but of course also changes the lexed language).

Is this a bug or a known limitation of the implementation?

In case it's the latter, is there any way of lexing the first language (with [ab]*c)?

Apart from this limitation, I'm really enjoying the library so far! Seing rustc segfault, when I messed up my regexes, was a bit surprising, but simply annotating the Token-Definition is just as nice and concise as it gets :)

edit: using logos = "0.13.0"

The text was updated successfully, but these errors were encountered:

maciejhirsz · 2023-06-20T08:14:30Z

This might be fixable, but I'm afraid it will need to wait for the rewrite mentioned in last release notes. Also see #291.

m0rphism · 2023-06-20T10:13:15Z

Alrighty, cool. Thanks for the quick reply! <3

I also saw #291, but wasn't sure if it's the same underlying issue.

Are you by any chance aware if there is a quick workaround to get this kind of scenario working with the current Logos implementation? If not, that's also fine, just thought I quickly ask, in case you've seen this before :)

maciejhirsz · 2023-06-20T12:18:08Z

There isn't an easy one. You could try to do some manual parsing with callbacks for offending tokens, see Lexer::remainder and Lexer::bump (and main docs for callbacks and return types).

zacryol · 2023-09-11T16:57:16Z

I've also encountered this bug, and it seems to primarily occur when a repetition starts matching, but a required later symbol is not present.

use logos::Logos;

#[derive(Logos, Debug)]
enum Token {
    #[regex("ab+c")]
    //#[regex("abc")]
    Str,
    #[regex(".")]
    Other,
}

fn main() {
    dbg!(Token::lexer("ab").collect::<Result<Vec<_>, _>>());
}

With this code as written, the output is an error, as after one b for the b+ regex segment starts matching, the lexer "commits" to that regex rule (although it shouldn't), and the later lack of a trailing c causes the token to error completely for that slice.

If the alternative (commented out) regex rule is used instead, then the lack of an ending c after the initial ab match is understood properly, and the lexer tries the alternate branch, and two Other tokens are produced, as would be expected in either case.

The problem arises from maciejhirsz/logos#315 This work-around is a temporary solution until the upcoming Logos rewrite. Fixes #4

tesujimath · 2024-02-05T20:28:38Z

I found quite a clean work-around to this problem which may be useful to others.

My work-around is to define a subset of Token which I call RecoveryToken, and when lexing fails, call Logos::lexer again on the failed span, using the subset of tokens which exclude the troublesome one.

In my case, the problem is the overlap between the currency regex and other shorter ones. Simplified:

#[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern currency = r"[A-Z][A-Z0-9'\._-]*|/[0-9'\._-]*[A-Z][A-Z0-9'\._-]*")]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
pub enum Token<'a> {
    #[regex(r#"(?&currency)"#, |lex| super::Currency::try_from(lex.slice()) )]
    Currency(super::Currency<'a>),

    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
    Date(Date),

    #[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
    Number(Decimal),

    // errors are returned as an error token
    Error(LexerError),
}

// just the subset of `Token` which could occur within a currency
#[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
enum RecoveryToken {
    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
    Date(Date),

    #[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
    Number(Decimal),
}

impl<'a> From<RecoveryToken> for Token<'a> {
    fn from(value: RecoveryToken) -> Self {
        use RecoveryToken::*;

        match value {
            Minus => Token::Minus,
            Slash => Token::Slash,
            Date(date) => Token::Date(date),
            Number(decimal) => Token::Number(decimal),
        }
    }
}

fn attempt_recovery(
    failed_span: Range<usize>,
    s: &str,
) -> impl Iterator<Item = (Token, Range<usize>)> {
    let failed_token = &s[failed_span.start..failed_span.end];

    RecoveryToken::lexer(failed_token)
        .spanned()
        .map(move |(lexeme, rel_span)| {
            let span = failed_span.start + rel_span.start..failed_span.start + rel_span.end;
            match lexeme {
                Ok(tok) => (tok.into(), span),
                Err(e) => (Token::Error(e), span),
            }
        })
}

and where we use this:

// lex that input:
    Token::lexer(s)
        .spanned()
        .flat_map(|(lexeme, span)| match lexeme {
            Ok(tok) => smallvec![(tok, span)],
            Err(_e) => attempt_recovery(span.clone(), s)
                .collect::<SmallVec<_, 1>>(),
        })

Complete code is here

m0rphism closed this as completed Jun 20, 2023

m0rphism reopened this Jun 20, 2023

Skaytacium mentioned this issue Sep 17, 2023

Bug: Lexer returns Err when there is a Token it could have returned #338

Closed

tesujimath mentioned this issue Feb 5, 2024

Regex for currency causes lexer to fail to match slash as a token tesujimath/beancount-parser-lima#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shorter match ignored, because longer potential match fails #315

Shorter match ignored, because longer potential match fails #315

m0rphism commented Jun 20, 2023 •

edited

maciejhirsz commented Jun 20, 2023

m0rphism commented Jun 20, 2023 •

edited

maciejhirsz commented Jun 20, 2023 •

edited

zacryol commented Sep 11, 2023 •

edited

tesujimath commented Feb 5, 2024 •

edited

Shorter match ignored, because longer potential match fails #315

Shorter match ignored, because longer potential match fails #315

Comments

m0rphism commented Jun 20, 2023 • edited

maciejhirsz commented Jun 20, 2023

m0rphism commented Jun 20, 2023 • edited

maciejhirsz commented Jun 20, 2023 • edited

zacryol commented Sep 11, 2023 • edited

tesujimath commented Feb 5, 2024 • edited

m0rphism commented Jun 20, 2023 •

edited

m0rphism commented Jun 20, 2023 •

edited

maciejhirsz commented Jun 20, 2023 •

edited

zacryol commented Sep 11, 2023 •

edited

tesujimath commented Feb 5, 2024 •

edited