Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flex segfaults after reading EOF in input() #636

Open
nxg opened this issue Mar 19, 2024 · 2 comments
Open

Flex segfaults after reading EOF in input() #636

nxg opened this issue Mar 19, 2024 · 2 comments

Comments

@nxg
Copy link

nxg commented Mar 19, 2024

The program below works as expected when reading from stdin, but segfaults when it is instead lexing a buffer.

The key thing about this example is that one of the rules uses input() to gobble from "!" to EOF (yes, it looks as if I could use a "!".* pattern, but that doesn't produce the intended results in the real case; the lexer needs to balance braces, and if I hit EOF when trying to do that, I want to recover gracefully).

When run, reading from stdin, I get

$ flex -o eof.c eof.lex
$ cc -o eof eof.c
$ echo -n 'one two!three four' | ./eof
word:<one>
-> 1
-> 2
word:<two>
-> 1
buf=<three four>
-> 3

That's fine, but when I instead ./eof 'one two !three four', which scans the contents of a buffer set up by yy_scan_string, I get identical program output, followed by a segfault inside yy_get_next_buffer.

I can't work out which part of the flex manual is telling me I should expect that to happen.

The sequence of events seems to be that the lexer is finding its way to the end of file, as expected (and an <<EOF>> action confirms this), but not stopping there, despite the presence of the noyywrap option, and collapsing when it can't find a ‘next’ buffer.

Points:

  • Option -d doesn't illuminate.
  • It is, of course, a little hard to follow what the generated code is doing, but looking at the location of the segfault, it is indeed around the place where the code is checking for yywrap, so it should be getting the message that there is no more input coming.
  • The only real illustration of using input(), in the flex manual, is in a case where hitting EOF is reported as an error. Here, I'm doing essentially the same as in that example, but regarding EOF as an acceptable end of the scan.
  • The same behaviour appears when using a reentrant scanner.
  • It's worth noting that input() returns 0, not EOF, at EOF, despite what Sect.8 illustrates (cf. flex repo issue, and links there), and despite the rather mysterious note about a ‘“real” end-of-file’ in Sect.20. I have a suspicion that this remark in Sect.20 is telling me something terribly important, but I can't work out what.
  • This is with flex 2.6.4 and clang 15 on macOS, and 2.6.4 and gcc on (a RHEL-derived) Linux (I can confirm the precise gcc version if that would be helpful, but this doesn't look obviously compiler dependent)..

Program:

ALPHABETIC  [a-zA-Z]
WS      [^a-zA-Z!]

%option noyywrap nounput

%%

{ALPHABETIC}+   {
    printf("word:<%s>\n", yytext);
    return 1;
}
{WS}+   {
    return 2;
}

"!"         {  // gobble to end of input
    char buf[80];
    for (int idx=0; (buf[idx] = input()); idx++) /* empty */ ;
    printf("buf=<%s>\n", buf);
    // YY_FLUSH_BUFFER; /* makes no difference */
    return 3;
}

%%
int main(int argc, char** argv)
{
    switch (argc) {
      case 1: break;
      case 2:
        yy_scan_string(argv[1]);
        break;
      default:
        fprintf(stderr, "Usage: %s [string]\n", argv[0]);
        exit(1);
    }

    int token;
    while ((token = yylex()) != 0) {
        printf("-> %d\n", token);
    }
}
@Mightyjo
Copy link
Contributor

I can't find a spot in the docs that explains this behavior clearly. The best hints I could find are in the sections on multiple buffers, yywrap, and EOF rules.

You need an <> rule that calls yyterminate or sets up the next buffer. That rule will take the place of yywrap in your use case.

I'm away from my computer but I'll post an example when I'm back.

@nxg
Copy link
Author

nxg commented Mar 25, 2024

Thanks for clarifying.

In case it's useful when thinking about the docs, my mental model, when writing what I did, was that when I arrive at EOF using input(), I'm doing so ‘legitimately’ (ie, as opposed to my being illegitimately creative with yyinput, or something like that). It was on that basis that I presumed yywrap would Do The Right Thing, and that when flex subsequently asked for more input from input(), it would be told calmly ‘no’.

Or, put another way, my mental model is that flex is itself using input() to get input, or something equivalent to that, so that I'm working in concert with it if I read from it separately.

If those are bad intuitions, it might be useful for the docs to disabuse the reader fairly explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants