Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differing output for VM vs. IR codegen #340

Open
katef opened this issue Apr 14, 2021 · 6 comments
Open

Differing output for VM vs. IR codegen #340

katef opened this issue Apr 14, 2021 · 6 comments

Comments

@katef
Copy link
Owner

katef commented Apr 14, 2021

For the following retest case:

^abc$
+abc
-qqq\nabc
-abc\nzzz
-qqq\nabc\nzzz

I believe the IR-generated C is correct:

; bmake -r CC=clang DEBUG=1 ASAN=1 && ./build/bin/retest -O1 -l c ./x.tst
[OK    ] line 2: pcre regexp /^abc$/ matched "abc".
[OK    ] line 3: pcre regexp /^abc$/ did not match "qqq\nabc"
[OK    ] line 4: pcre regexp /^abc$/ did not match "abc\nzzz"
[OK    ] line 5: pcre regexp /^abc$/ did not match "qqq\nabc\nzzz"
./x.tst: 1 regexps, 4 test cases
./x.tst: 0 re errors, 0 errors
no errors

and the VM code (either interpreted, vmc or other VM-based codegen) is incorrect:

; bmake -r CC=clang DEBUG=1 ASAN=1 && ./build/bin/retest -O1 -l vm ./x.tst
[OK    ] line 2: pcre regexp /^abc$/ matched "abc".
[OK    ] line 3: pcre regexp /^abc$/ did not match "qqq\nabc"
[NOT OK] line 4: pcre regexp /^abc$/ expected to not match "abc\nzzz", but did
[OK    ] line 5: pcre regexp /^abc$/ did not match "qqq\nabc\nzzz"
./x.tst: 1 regexps, 4 test cases
./x.tst: 0 re errors, 1 errors
[   1/   1] (./x.tst:4) pcre regexp /^abc$/ should not match "abc\nzzz" but does
1 errors

This is on the kate/ci-retest-pcre-suite branch, where 227ca72 introduced fsm_minimise() as well as fsm_determinise() for retest.

@katef
Copy link
Owner Author

katef commented Apr 14, 2021

image

@katef
Copy link
Owner Author

katef commented Apr 14, 2021

and the vm-generated code:

; ./build/bin/re -r pcre -pl vmc '^abc$'       

int
fsm_main(int (*fsm_getc)(void *opaque), void *opaque)
{
	int c;

	if (c = fsm_getc(opaque), c == EOF) return -1;
	if (c != 'a') return -1;
	if (c = fsm_getc(opaque), c == EOF) return -1;
	if (c != 'b') return -1;
	if (c = fsm_getc(opaque), c == EOF) return -1;
	if (c != 'c') return -1;
	if (c = fsm_getc(opaque), c == EOF) return 0x1; /* "^abc$" */
	if (c != '\n') return -1;
	if (c = fsm_getc(opaque), c == EOF) return 0x1; /* "^abc$" */
	return 0x1; /* "^abc$" */
}

@sfstewman
Copy link
Collaborator

sfstewman commented Apr 14, 2021

Interesting. I'm not finding any errors:

% bmake -r CC=clang DEBUG=1 ASAN=1 && ./build/bin/retest -O1 -l vm ./x.tst

[OK    ] line 2: pcre regexp /^abc$/ matched "abc".
[OK    ] line 3: pcre regexp /^abc$/ did not match "qqq\nabc"
[OK    ] line 4: pcre regexp /^abc$/ did not match "abc\nzzz"
[OK    ] line 5: pcre regexp /^abc$/ did not match "qqq\nabc\nzzz"
./x.tst: 1 regexps, 4 test cases
./x.tst: 0 re errors, 0 errors
no errors

Here's my info:

% uname -a
Linux DESKTOP-NM4RCJ5 4.19.104-microsoft-standard #1 SMP Wed Feb 19 06:37:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

% clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

I'm running off of main, I think up to date:

% git log --oneline | head -5
d29af720 Merge pull request #339 from katef/kate/empty-ir
39d8ac0e Don't emit unused variables for an empty IR.
21594554 Return -1 for no matches. This really ought to be a user-supplied macro instead, but another time.
9bf5a665 Construct an empty `struct ir` for an FSM with no start state.
2ed683e3 Merge pull request #338 from katef/kate/group-numbering

@sfstewman
Copy link
Collaborator

Asking retest to dump the vmir encoding with retest -L ir -l vm ./x.tst, I don't see what you're generating with vmdot. I see:

0x631000000810 ;;; state 0 (index: 0, asm_index: 0) (start)
0x631000000810 |      0 |      0 |  FETCHF                                   ; incoming: 0, state: 0
0x631000000910 |      1 |      4 |  BREQ 'a', [st=1]                         ; incoming: 0, dest: 4 (0x6310000008d0)

0x631000000990 ;;; state 2 (index: 2, asm_index: 12)
0x631000000990 |      2 |     12 |  FETCHF                                   ; incoming: 5, state: 2
0x631000000850 |      3 |     13 |  BR  [st=2]                               ; incoming: 0, dest: 2 (0x631000000990)

0x6310000008d0 ;;; state 1 (index: 4, asm_index: 6)
0x6310000008d0 |      4 |      6 |  FETCHF                                   ; incoming: 1, state: 1
0x6310000009d0 |      5 |     11 |  BRNE 'b', [st=2]                         ; incoming: 0, dest: 2 (0x631000000990)

0x631000000890 ;;; state 3 (index: 6, asm_index: 14)
0x631000000890 |      6 |     14 |  FETCHF                                   ; incoming: 0, state: 3
0x631000000b10 |      7 |     19 |  BRNE 'c', [st=2]                         ; incoming: 0, dest: 2 (0x631000000990)

0x631000000ad0 ;;; state 4 (index: 8, asm_index: 20)
0x631000000ad0 |      8 |     20 |  FETCHS                                   ; incoming: 0, state: 4
0x631000000bd0 |      9 |     25 |  BRNE 0x0a, [st=2]                        ; incoming: 0, dest: 2 (0x631000000990)

0x631000000b90 ;;; state 5 (index: 10, asm_index: 26)
0x631000000b90 |     10 |     26 |  FETCHS                                   ; incoming: 0, state: 5
0x631000000a50 |     11 |     27 |  BR  [st=2]                               ; incoming: 0, dest: 2 (0x631000000990)

@sfstewman
Copy link
Collaborator

If I modify the src/retest/runner.c code to print the C code generated by vmc to stderr, code is also quite different and very similar to the IR code above:

int
fsm_main(const char *b, const char *e)
{
        const char *p;
        int c;

        p = b;

        if (p == e) return -1;

        c = (unsigned char) *p++;
        if (c == 'a') goto l1;

l0:
        if (p == e) return -1;

        c = (unsigned char) *p++;
        goto l0;

l1:
        if (p == e) return -1;

        c = (unsigned char) *p++;
        if (c != 'b') goto l0;
        if (p == e) return -1;

        c = (unsigned char) *p++;
        if (c != 'c') goto l0;
        if (p == e) return 4;

        c = (unsigned char) *p++;
        if (c != '\n') goto l0;
        if (p == e) return 5;

        c = (unsigned char) *p++;
        goto l0;
}

@sfstewman
Copy link
Collaborator

Okay, if I add an fsm_minimise() step into retest, I can trigger the error.

I've tracked it down to how the VM IR is generated from IR_NONE nodes. We should fail if we read another character because there are no transitions out of the state. However, the IR_NONE node for abc\n is an accepting state. We changed the behavior in b465407 to accept if the node is an accepting state and to fail if it is not.

I think the old behavior was correct, although that commit indicates that there's at least one corner case it didn't handle (/^$|^$/, apparently?).

This then raises two questions:

  1. How do we fix the behavior to handle the corner case and to ensure that /^abc$/ will fail on abc\nzzz?

  2. How did you trigger the bug from within retest? The main branch doesn't minimize the dfa before converting it to IR and the VM IR. Perhaps it should? I'm fairly perplexed by this.

This also raises an interesting potential peephole optimization. In the non-minimized VM IR code, when the code enters state 2, it will fail to match, but it first consumes all characters in the input and only then fails. This case can/should be detected and optimized to a STOPF instruction. If this is done before branch optimization, it would replace a lot of branches with STOPxx instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants