Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't enlarge buffer, but no REJECT #620

Open
jklowden opened this issue Jan 27, 2024 · 11 comments
Open

can't enlarge buffer, but no REJECT #620

jklowden opened this issue Jan 27, 2024 · 11 comments

Comments

@jklowden
Copy link

I am using flex 2.6.4. I compile with -Ca, following random advice on the Internet, because at some earlier juncture it solved some internal error. I also use:

%option debug noyywrap stack yylineno case-insensitive

My scanner uses yyless(), but not yymore() or REJECT. It has 54 start conditions and reads only one input file.

When confronted with input of about 40 KB of ASCII spaces (0x20), it produces the message:

can't enlarge buffer because scanner uses REJECT

It looks like the code that would expand the buffer is not generated:

int num_to_read =
  YY_CURRENT_BUFFER_LVALUE->yy_buf_size - number_to_move - 1;

while ( num_to_read <= 0 )
  { /* Not enough room in the buffer - grow it. */

    YY_FATAL_ERROR(
                   "input buffer overflow, can't enlarge buffer because scanner uses REJECT" );

  }

My scanner has 249 rules involving whitespace. Fortunately, whitespace is never returned to the parser. While it's not a great deal of work to convert every [[:space:]]+ to [[:space:]]{1,4096} or similar, the result is a bit awkward and counterintuitive. Nor do I want to just redefine the buffer size statically, because some other file will have even more whitespace.

Why does the scanner think REJECT is being used? How can I convince it to expand the buffer as needed?

@Mightyjo
Copy link
Contributor

You have defined "variable length" rules when you use constructions like [[:space:]]+ or [[:space:]]{1,4096}. Variable length rules use some of the same machinery as REJECT. When we hit this corner case, we can't tell whether it's because of REJECT or a variable length rule, so the error message covers the more common case.

As you've found, the input buffer is big but not huge. The challenge is biting off chunks of whitespace small enough that they fit in the buffer without needing to enlarge it. I'd refactor your scanner to use scopes, with yybegin(SCOPENAME) in your action, to eat whitespace similar to the way strings are consumed in examples/string2.lex. That way you can keep track of which rule you're consuming whitespace in, if that's important.

@jklowden
Copy link
Author

I understand your answer to mean the behavior I described is expected. Thank you; I suspected as much.

IMO there are problems with the documentation, and I'm not sure I understand your recommendation.

The term "variable length rule" is not defined in the manual. There are warnings against variable-length trailing context, as a performance consideration. There is no indication that a simple "+" on any regex may cause a problem of any kind.

(I have several persistent warnings from flex about "dangerous trailing context" that appear only in the context of a complex flex input file. My attempts to reduce the reported cases to a minimal example invariably result in a file for which flex reports no problem. For example, see, https://gitlab.cobolworx.com/COBOLworx/gcc-cobol/-/blob/master+cobol/gcc/cobol/scan.l?ref_type=heads)

By "scope" I understand you to mean "start condition". By yybegin, which is not documented, I assume you mean BEGIN.

I do not know what examples/string2.lex refers to. No such file appears in https://github.com/westes/flex/tree/master/examples.

Is there any safe extent of "variable length" rules? Is [[:space:]]{1,2} potentially a problem if the pattern matches as its first space the last character of the input buffer? For the regex [[:space:]]{1,x}, what is the safe maximum value of x?

If the pattern x+ is always unsafe because a single x may match at the end of the input buffer, IMO that would present a problem to a great many programs using flex.

@Mightyjo
Copy link
Contributor

You're right, I left out a path component: examples/manual/string2.lex.

@Mightyjo
Copy link
Contributor

Answering your broader question about safe lengths. You're safe using variable length repetitions if you can reasonably expect them to match text that's about the length of your other tokens.

Whitespace, comments, and similar throw-away strings are hard to tokenize because they can run on so long. About the best we can do is delete them a character at a time or scan them into manageable chunks.

What else...

Yes, I said scope but meant start condition.

yybegin and BEGIN are the same action. Flex user functions and actions are often prefixed with yy and the names are used interchangeably. You can set your own prefix that will replace yy if you ever need two independent scanners. But that's getting off topic.

@jklowden
Copy link
Author

Thank you, I think I see my way home now. Just one more question.

One useful tactic has been to remove trailing whitespace prior to scanning. If blank lines consist only of a single \n, it takes a lot of them to bugger up the scanner.

Unfortunately, my scanner must contend with many multi-string tokens such as END[[:space:]]+PROGRAM. Nothing prevents the pathological programmer from using an unnatural number of spaces there, and transitioning between SCs in mid-token is too terrible to contemplate. OTOH, we might indeed "reasonably expect" the number of mid-token spaces to be handful. Giant amounts of whitespace are more likely to appear between statements, not within tokens. So there is a place for your SC recommendation.

That said, I wonder if the problem can be minimized by throwing memory at it. This problem is not, I think you agree, anything fundamental about regular expressions. After all,

$ yes | head -1000000 | tee /tmp/y | tr \\n \  | sed -E 's/(y )+/OK\n/' && wc /tmp/y
OK
1000000 1000000 2000000 /tmp/y

works just fine. The problem is introduced, isn't it, by the buffer itself? Would I be better off setting the buffer size to, say, 10 MB, and telling the user that more than, oh, 5 MB of whitespace is disallowed?

@Mightyjo
Copy link
Contributor

I'd think of the problem differently. Let Flex scan the words END and PROGRAM as tokens. Ignore the whitespace between them for a moment. Use your parser to match the token sequence, END PROGRAM, to whatever semantics you like.

Now you can tell Flex to match and discard each single whitespace character it encounters. (Which results in Flex discarding all the whitespace it encounters.) Totally solves the problem of users trying to embed Brady & Morris Whitespace programs in their COBOL.

@jklowden
Copy link
Author

Brady & Morris Whitespace takes me back. :-)

It's a perfectly reasonable suggestion. What I have discovered over the years, though, is that END serves numerous purposes in the COBOL grammar. As I'm sure you appreciate, returning a unique END_PROGRAM token to the parser simplifies the grammar.

There are many such, including LESS THAN OR EQUAL TO, which returns LE. If the individual strings are returned as tokens, OR winds up both in the midst of relation expressions and as their master in logical expressions. Trapping the whole business in the scanner as a single string (and token) was the way out of days spent in a maze of twisty shift/reduce errors, all alike.

I was serious, though. What harm in defining the input buffer 1000x bigger than currently? Memory is not a constraint. In my system, YY_INPUT is defined over a function that begins with mmap(2) over the whole input. It's a shame the input need be copied to the input buffer at all; it would be more efficient to use pointers into the mapped file.

@Mightyjo
Copy link
Contributor

Mightyjo commented Feb 14, 2024 via email

@jklowden
Copy link
Author

jklowden commented Mar 2, 2024

I've been thinking about your reply, Joseph. It doesn't make sense to me, literally. And I wonder if the defaults are really what anyone wants. I want to enlarge my buffers, and I want to check with you guys if I'm going to regret it.

In the generated scanner, we have this definition:

#ifdef __ia64__
/* On IA-64, the buffer size is 16k, not 8k.
 * Moreover, YY_BUF_SIZE is 2*YY_READ_BUF_SIZE in the general case.
 * Ditto for the __ia64__ case accordingly.
 */
#define YY_BUF_SIZE 32768

Why for IA64 only? It's hardly the only 64-bit chip, and it's obsolete. I am using

$ arch
aarch64

and I'll bet you there are 1000 of those for every IA64 still in use. is there something mysterious and arcane that makes it special?

My plan is to define these in my input file:

# define YY_READ_BUF_SIZE 10 * 16 * 1024
# define YY_BUF_SIZE 2 * YY_READ_BUF_SIZE

It's not obvious to me why I should stop at 10; even if we supported a 32-bit processor, our compiler wouldn't mind a 1600 KB input buffer. I have to read the whole file anyway, and it's already mapped into memory (in preparation for scanning).

Is there any reason that would dissuade me from taking that approach? (Kudos, btw, for using macros in this way. I redefined YY_FATAL_ERROR to good effect.)

I would also like to renew my complaint, for clarity. It may be that variable length rules use some of the same machinery as REJECT, as you say. But they're hardly a "corner case". Thompson used a(b I c)*,d in his seminal paper. If flex users cannot reliably use the * operator, there really should be a disclaimer in the documentation. Regardless, the error message is misleading, and the issue needs better documentation. Asking for a friend, of course.

To be clear, I appreciate my problem domain is unusual. COBOL programs are regularly subjected to preprocessing, long before the lexer sees them. Those preprocessors have a mean streak: they output valid Cobol (FSVO valid), but neither readability nor parsimony are evident design goals. Why one might output 1000 blank lines as part of an interpolation is more than I can understand. But, like so many programmers, they didn't ask me. You might say code flows downhill.

@Mightyjo
Copy link
Contributor

Mightyjo commented Mar 2, 2024

You can enlarge your buffers. Look at the FAQ section in the manual for the question, "Is backing up a big deal?" There's information about the changes you'll need to back to the scanner and possibly to flexdef.h to support a larger state buffer.

The defaults support systems that were written 30 or 40 years ago on lex and we still support them. I tried changing them once and got an earful from the community.

I really don't know what's up with the IA64 special case. I suspect it had to do with a default memory page size, but I have no evidence. When that sort of magic number exists it almost always leads back to page sizes or superstition.

Answering your main questions:
Lex/Flex and YACC/Bison generate LALR(1) scanners and parsers. They don't generate regular expression parsers. Flex and Bison emulate some of the syntax of regular expressions but they aren't generating regular expression parsers behind the scenes. Thompson's paper is great, but Flex is only supporting the syntax. The documentation addresses this, but not as clearly as I'd like. We welcome PRs with improvements.

Your problem isn't that unusual. You should be able to use a character-at-a-time solution to collecting your strings of whitespace. If you need to return the string of whitespace as a token, you can accumulate it in an action.

@jklowden
Copy link
Author

jklowden commented Mar 3, 2024

I reluctantly accept this report will not yield anything, except perhaps (hopefully, still) a limited solution to my particular situation.

Your advice for a misleading message -- and a design choice -- is not to ask of flex what it cannot do. That's clear enough.

It is a bug, though, as you tacitly admit, because flex is not behaving as advertised. The message

"input buffer overflow, can't enlarge buffer 
 because scanner uses REJECT"

does not say what it means; I'm sure you agree. But an accurate replacement,

"input buffer overflow, can't enlarge buffer 
 because scanner uses variable-length rules"

is practically tautological. How can anyone define a too-large token without using the '*' or '+' operator?

You said, "Flex [does not] generate regular expression parsers." You will forgive my confusion, because the casual observer might expect otherwise,

$ find * -name '?fa.*'
src/nfa.c
src/dfa.c
$ grep -cE '\b[nd]fa\b' src/flexdef.h 
21

Here's the thing: the user who gets bitten by this is blindsided. It crops up long after the project has committed to flex and has composed a substantial ruleset. Although the user assiduously avoids REJECT, it's to no avail because the extend-the-buffer logic is moribund. Nothing but nothing warns him. In fact, under "The Default Memory Management" he is told the opposite.

At least, he should be warned.

Workarounds

I don't think "Is backing up a big deal?" is the FAQ you mean. It references the 32000 state limit. Nor do I find any discussion of the input buffer size in flexdef.h.

On my system, "unnamed-faq-82" is the only one that mentions YY_BUF_SIZE. As I read the code, though, it's not enough to change only that one macro. I'm guessing that answer, 26 years old, is out of date.

I would nevertheless like to pursue that course, unless there is good reason not to.

I appreciate your frankness regarding IA64. It sounds to me like "once burnt, twice shy", and it's made you reluctant to change any basic parameter. But changing a default is always dicey. Making it configurable shouldn't be. Why is YY_BUF_SIZE a macro, if it's never to be changed?

As for what I am able to do, are you sure? Of course, in the context of flex, dealing with whitespace "by hand" is one workaround. But are you prepared to tell the Bison user that his 1800-rule parser needs a few hundred more to deal with 42 multi-string tokens? And to help out with the ensuing shift/reduce ambiguities?

I'd like to point out that on one hand, you call this a "corner case", while on the other, you say it's not that unusual. But there is no such thing as a corner case that's not unusual.

This is flex's job, I'm sure you agree: recognize regular expressions, and act on them. Your answer to my bug report is that

 IS[[:space:]]+EQUAL

will not work if the input it matches exceeds N characters, for some unspecified value of N. That you don't consider that a defect utterly astounds me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants