Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing replaceAll with generated matchers? #357

Open
dkegel-fastly opened this issue Jun 1, 2021 · 3 comments
Open

Implementing replaceAll with generated matchers? #357

dkegel-fastly opened this issue Jun 1, 2021 · 3 comments

Comments

@dkegel-fastly
Copy link

dkegel-fastly commented Jun 1, 2021

I'm trying to port a use of golang's https://golang.org/pkg/regexp/#Regexp.ReplaceAllString to tinygo, where
regexp doesn't work just yet. libfsm works nicely, i.e. I can use

re -l go -k str  "ab"

to generate a nice matcher. But it isn't especially helpful for implementing ReplaceAllString, as it returns
neither the starting nor the ending index.

Returning starting and ending indices seems easy, e.g.

--- a/src/libfsm/print/go.c
+++ b/src/libfsm/print/go.c
@@ -92,14 +92,14 @@ print_end(FILE *f, const struct dfavm_op_ir *op, const struct fsm_options *opt,
        enum dfavm_op_end end_bits, const struct ir *ir)
 {
        if (end_bits == VM_END_FAIL) {
-               fprintf(f, "{\n\t\treturn -1\n\t}\n");
+               fprintf(f, "{\n\t\treturn -1, -1\n\t}\n");
                return;
        }
 
        if (opt->endleaf != NULL) {
                opt->endleaf(f, op->ir_state->end_ids, opt->endleaf_opaque);
        } else {
-               fprintf(f, "{\n\t\treturn %lu\n\t}\n", (unsigned long) (op->ir_state - ir->states));
+               fprintf(f, "{\n\t\treturn int(start_idx), int(idx+1)\n\t}");
        }
 }
 
@@ -175,6 +175,9 @@ fsm_print_gofrag(FILE *f, const struct ir *ir, const struct fsm_options *opt,
                if (op->num_incoming > 0) {
                        print_label(f, op, opt);
                        fprintf(f, "\n");
+                       if (op->index == 0) {
+                               fprintf(f, "\tvar start_idx = idx + 1\n");
+                       }
                }
 
                fprintf(f, "\t");
@@ -257,7 +260,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
 
        switch (fsm->opt->io) {
        case FSM_IO_PAIR:
-               fprintf(f, "(data []byte) int {\n");
+               fprintf(f, "(data []byte) (start, end int) {\n");
                if (ir->n > 0) {
                        /* start idx at -1 unsigned so after first increment we're correct at index 0 */
                        fprintf(f, "\tvar idx = ^uint(0)\n");
@@ -266,7 +269,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
                break;
 
        case FSM_IO_STR:
-               fprintf(f, "(data string) int {\n");
+               fprintf(f, "(data string) (start, end int) {\n");
                if (ir->n > 0) {
                        /* start idx at -1 unsigned so after first increment we're correct at index 0 */
                        fprintf(f, "\tvar idx = ^uint(0)\n");

and would let me implement a ReplaceAllString.

Is there an easier way? I feel like I haven't yet found the appropriate doc...

@dkegel-fastly
Copy link
Author

dkegel-fastly commented Jun 2, 2021

Here's a demo of what I'm trying to do. Not quite happy yet...

package main
  
import (
        "fmt"
        "regexp"
)

func main() {
        input := "aaa aaa"

        // Porting FindStringIndex() to fsm
        r := regexp.MustCompile(`a{2,3}`)
        match := r.FindStringIndex(input)
        fmt.Printf("regexp: r.FindStringIndex returns %d, %d\n", match[0], match[1])
        start, end := Match(input)
        fmt.Printf("fsm: Match returns %d, %d\n", start, end)

        // Porting ReplaceAll to fsm
        out := r.ReplaceAllString(input, "b")
        fmt.Printf("regexp: ReplaceAllString returns %s\n", out)
        out = myReplaceAll(input, "b")
        fmt.Printf("fsm: myReplaceAll returns %s\n", out)
}

// myReplaceAll finds all sequences in value that Match, and replaces them with repl.
// Any matches that might result from the replacement are ignored.
func myReplaceAll(value, repl string) string {
        remaining := value
        out := ""
        for remaining != "" {
                start, end := Match(remaining)
                if start == -1 {
                        out = out + remaining
                        remaining = ""
                } else {
                        println("match", remaining[start:end], "start", start, "end", end)
                        out = out + remaining[0:start] + repl
                        remaining = remaining[end:]
                }
        }
        return out
}

Save that as main.go, then create match.go with something like

re -k str -l go -r pcre 'a{2,3}' | sed 's/fsm_fsm/main/' > match.go

and run with

go run main.go match.go

Currently, it outputs

regexp: r.FindStringIndex returns 0, 3
fsm: Match returns 0, 2
regexp: ReplaceAllString returns b b
match aa start 0 end 2
match aa start 2 end 4
fsm: myReplaceAll returns ba ba

Examining the generated code, fsm expands the quantifier a{2,3} to return immediately after the 2nd a, without even trying to consume a 3rd one.
(I filed #358 for that earlier, but may not have provided enough context.)

This is a fine optimization for detecting a match, so evidently my hack to return the end offset of the match is breaking an assumption libfsm made, and/or I'm misusing libfsm in some other way.

@dhobsd
Copy link
Collaborator

dhobsd commented Jun 2, 2021 via email

@dkegel-fastly
Copy link
Author

Thanks. I updated the problem description to address your last suggestion about induced matches.
I'm not worried about the inefficiency of using string for the moment.

I am very interested in suggestions for coaxing libfsm into actually finishing the match of a{2,3} instead
of calling it a day after aa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants