Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quadratic behavior in yaml_parser_fetch_more_tokens #286

Closed
wants to merge 1 commit into from

Conversation

dtolnay
Copy link

@dtolnay dtolnay commented Mar 17, 2024

The following program reproduces scan time that is quadratic in the size of the input document.

#include <ctype.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <yaml.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "usage: %s 100000\n", argv[0]);
    return 1;
  }

  errno = 0;
  char *rest;
  const long n = strtol(argv[1], &rest, 10);
  if (errno != 0 || rest == argv[1] || isspace(*argv[1]) || *rest != '\0') {
    fprintf(stderr, "invalid argument\n");
    return 1;
  }

  unsigned char *yaml = (unsigned char *)malloc(n);
  if (!yaml) {
    return 1;
  }

  yaml_parser_t parser;
  if (!yaml_parser_initialize(&parser)) {
    free(yaml);
    return 1;
  }

  memset(yaml, '[', n);
  yaml_parser_set_input_string(&parser, yaml, n);

  int finished = 0;
  do {
    yaml_token_t token;
    if (!yaml_parser_scan(&parser, &token)) {
      break;
    }
    finished = token.type == YAML_STREAM_END_TOKEN;
    yaml_token_delete(&token);
  } while (!finished);

  yaml_parser_delete(&parser);
  free(yaml);
  return !finished;
}

Before: With each doubling of input size, the runtime increases by a factor of four.

`time ./repro 10000`   0m0.169s
              20000    0m0.580s
              40000    0m2.244s
              80000    0m8.940s
              160000   0m39.625s

After: Runtime is linear in the input size. Program can handle inputs which are two orders of magnitude larger than before.

`time ./repro 160000`  0m0.490s
              1600000   0m4.710s
              16000000   0m46.292s

The following program reproduces scan time that is quadratic in the size
of the input document.

    #include <ctype.h>
    #include <errno.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <yaml.h>

    int main(int argc, char **argv) {
      if (argc != 2) {
        fprintf(stderr, "usage: %s 100000\n", argv[0]);
        return 1;
      }

      errno = 0;
      char *rest;
      const long n = strtol(argv[1], &rest, 10);
      if (errno != 0 || rest == argv[1] || isspace(*argv[1]) || *rest != '\0') {
        fprintf(stderr, "invalid argument\n");
        return 1;
      }

      unsigned char *yaml = (unsigned char *)malloc(n);
      if (!yaml) {
        return 1;
      }

      yaml_parser_t parser;
      if (!yaml_parser_initialize(&parser)) {
        free(yaml);
        return 1;
      }

      memset(yaml, '[', n);
      yaml_parser_set_input_string(&parser, yaml, n);

      int finished = 0;
      do {
        yaml_token_t token;
        if (!yaml_parser_scan(&parser, &token)) {
          break;
        }
        finished = token.type == YAML_STREAM_END_TOKEN;
        yaml_token_delete(&token);
      } while (!finished);

      yaml_parser_delete(&parser);
      free(yaml);
      return !finished;
    }

**Before;** With each doubling of input size, the runtime increases by a
factor of four.

    `time ./repro 10000`   0m0.169s
                  20000    0m0.580s
                  40000    0m2.244s
                  80000    0m8.940s
                  160000   0m39.625s

**After:** Runtime is linear in the input size. Program can handle
inputs which are two orders of magnitude larger than before.

    `time ./repro 160000`  0m0.490s
                  1600000   0m4.710s
                  16000000   0m46.292s
@perlpunk
Copy link
Member

Thanks!
This reduces the runtime. I'm don't know the simplekeys code completely, so I would like to have a closer look.
There are still cases where the runtime gets quadratic (when it's not simply {{{), I'm currently running some benchmarks.

I'm wondering how this ABI change can be realeased. I guess we would need at least a new minor version and change the so file version in order to not break existing bindings?

@dtolnay
Copy link
Author

dtolnay commented Jun 1, 2024

To also handle [{"":[[{"":[[{"":[..., a different solution is going to be needed.

@dtolnay dtolnay closed this Jun 1, 2024
@dtolnay dtolnay deleted the simplekeys branch June 1, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants