New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shorter match ignored, because longer potential match fails #315
Comments
This might be fixable, but I'm afraid it will need to wait for the rewrite mentioned in last release notes. Also see #291. |
Alrighty, cool. Thanks for the quick reply! <3 I also saw #291, but wasn't sure if it's the same underlying issue. Are you by any chance aware if there is a quick workaround to get this kind of scenario working with the current Logos implementation? If not, that's also fine, just thought I quickly ask, in case you've seen this before :) |
There isn't an easy one. You could try to do some manual parsing with callbacks for offending tokens, see |
I've also encountered this bug, and it seems to primarily occur when a repetition starts matching, but a required later symbol is not present. use logos::Logos;
#[derive(Logos, Debug)]
enum Token {
#[regex("ab+c")]
//#[regex("abc")]
Str,
#[regex(".")]
Other,
}
fn main() {
dbg!(Token::lexer("ab").collect::<Result<Vec<_>, _>>());
} With this code as written, the output is an error, as after one If the alternative (commented out) regex rule is used instead, then the lack of an ending |
The problem arises from maciejhirsz/logos#315 This work-around is a temporary solution until the upcoming Logos rewrite. Fixes #4
I found quite a clean work-around to this problem which may be useful to others. My work-around is to define a subset of In my case, the problem is the overlap between the currency regex and other shorter ones. Simplified: #[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern currency = r"[A-Z][A-Z0-9'\._-]*|/[0-9'\._-]*[A-Z][A-Z0-9'\._-]*")]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
pub enum Token<'a> {
#[regex(r#"(?¤cy)"#, |lex| super::Currency::try_from(lex.slice()) )]
Currency(super::Currency<'a>),
#[token("-")]
Minus,
#[token("/")]
Slash,
#[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
Date(Date),
#[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
Number(Decimal),
// errors are returned as an error token
Error(LexerError),
}
// just the subset of `Token` which could occur within a currency
#[derive(Logos, Clone, Debug, PartialEq, Eq)]
#[logos(subpattern date = r"\d{4}[\-/]\d{2}[\-/]\d{2}")]
#[logos(subpattern number = r"\d+(,\d{3})*(\.\d+)?")]
enum RecoveryToken {
#[token("-")]
Minus,
#[token("/")]
Slash,
#[regex(r"(?&date)", |lex| parse_date(lex.slice()))]
Date(Date),
#[regex(r"(?&number)", |lex| parse_number(lex.slice()))]
Number(Decimal),
}
impl<'a> From<RecoveryToken> for Token<'a> {
fn from(value: RecoveryToken) -> Self {
use RecoveryToken::*;
match value {
Minus => Token::Minus,
Slash => Token::Slash,
Date(date) => Token::Date(date),
Number(decimal) => Token::Number(decimal),
}
}
}
fn attempt_recovery(
failed_span: Range<usize>,
s: &str,
) -> impl Iterator<Item = (Token, Range<usize>)> {
let failed_token = &s[failed_span.start..failed_span.end];
RecoveryToken::lexer(failed_token)
.spanned()
.map(move |(lexeme, rel_span)| {
let span = failed_span.start + rel_span.start..failed_span.start + rel_span.end;
match lexeme {
Ok(tok) => (tok.into(), span),
Err(e) => (Token::Error(e), span),
}
})
} and where we use this: // lex that input:
Token::lexer(s)
.spanned()
.flat_map(|(lexeme, span)| match lexeme {
Ok(tok) => smallvec![(tok, span)],
Err(_e) => attempt_recovery(span.clone(), s)
.collect::<SmallVec<_, 1>>(),
}) Complete code is here |
The following minimized example
yields the output
while I'd expect the output
However, changing the
ABC
rule toyields the expected output (but of course also changes the lexed language).
Is this a bug or a known limitation of the implementation?
In case it's the latter, is there any way of lexing the first language (with
[ab]*c
)?Apart from this limitation, I'm really enjoying the library so far! Seing rustc segfault, when I messed up my regexes, was a bit surprising, but simply annotating the Token-Definition is just as nice and concise as it gets :)
edit: using
logos = "0.13.0"
The text was updated successfully, but these errors were encountered: