Chunked parsing issue #189

skapix · 2021-07-06T09:26:42Z

Parsing is done with chunking with the following code:

myhtml_tree_t* Parse(myhtml_t* myhtml, const std::string& body,
                     size_t chunk_sz) {
  myhtml_tree_t* tree = myhtml_tree_create();
  myhtml_tree_init(tree, myhtml);
  size_t body_chunk_pos = 0;
  while (body_chunk_pos < body.size()) {
    size_t current_chunk_sz = std::min(chunk_sz, body.size() - body_chunk_pos);
    mystatus_t parse_status = myhtml_parse_chunk_single(
        tree, body.c_str() + body_chunk_pos, current_chunk_sz);
    if (parse_status != MyHTML_STATUS_OK) {
      myhtml_tree_destroy(tree);
      return nullptr;
    }
    body_chunk_pos += current_chunk_sz;
  }
  return tree;
}

And called with arguments:

myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
std::string body = "<html><head><style>a</style></head><body>f</body></html>";
size_t chunk_sz = 13;
myhtml_tree_t* tree = Parse(myhtml, body, chunk_sz);

Depending on build options, there may be various results.
In some cases serialized tree looks like this:

<html><head><style>a</style></head><body>f</body></html></style></head><body></body></html>

In some cases looks like this

<html><head><style></style></head></html>

While it should be:

<html><head><style>a</style></head><body>f</body></html>

After some investigation I found out, that the issue is inside myhtml_tokenizer_state_rawtext_end_tag_name with token_node->raw_begin.

The text was updated successfully, but these errors were encountered:

skapix · 2021-07-06T09:52:19Z

Looks like Lexbor project does not have similar issue. But it's also nice to have it here since it's a standalone html5 parser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked parsing issue #189

Chunked parsing issue #189

skapix commented Jul 6, 2021 •

edited

skapix commented Jul 6, 2021

Chunked parsing issue #189

Chunked parsing issue #189

Comments

skapix commented Jul 6, 2021 • edited

skapix commented Jul 6, 2021

skapix commented Jul 6, 2021 •

edited