Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element wrong location level error handling #132

Open
elekt opened this issue Mar 19, 2018 · 11 comments
Open

Element wrong location level error handling #132

elekt opened this issue Mar 19, 2018 · 11 comments

Comments

@elekt
Copy link

elekt commented Mar 19, 2018

I am working on a project that parses html and replaces href attributes.
If the html is invalid because instead a table cell (ex. <td>) an <a> tab is coming, in myhtml_insertion_mode_in_table, it tries to handle the parse error by "foster parenting" and calling myhtml_insertion_mode_in_body with the <a> token.

The problem is that by this that when I loop through the tree's nodes it seems that the node is added twice. The clone is added in myhtml_tree_active_formatting_reconstruction.

See the minimal html to reproduce:
testminimal_github.txt

In my application I throw away the copy of the node but for some reason if this happens the href (link 1 in the example) remains the same. Also it messes up the order I get the nodes with node = myhtml_node_next(node). I would like to fix this bug in myhtml, and I would appreciate some help.

I am not looking to fix the invalid html, but to make sure each href links are changed and the structure stays the same.

@lexborisov
Copy link
Owner

Hi!
I'll deal with this soon.
Thanks!

@lexborisov
Copy link
Owner

I'm trying to understand the problem. But I do not understand.
Actually, the specification requires this. Try to see how this example is handled in a modern browser.

@EmielBruijntjes
Copy link
Contributor

Elekt is my colleague. Our use case is the following:

  1. Parse input HTML
  2. Modify some attributes
  3. Regenerate the HTML, but keep it as close to the original input HTML as possible (so without fixing it, adding more nodes, et cetera)

Is there a way how we can find out whether a node was artificially added by myhtml? We currently check if position.length == 0, but this does not work in the example given above.

@EmielBruijntjes
Copy link
Contributor

EmielBruijntjes commented Mar 28, 2018

There is a "flags" member in myhtml_tree_node_t, but it looks like it is not really in use. It would be nice if this flag can be set to a special value, and that user space programs can inspect it, and check if a node was (for example):

  • a real node that comes from the input HTML
  • an artificial node that was created by myhtml to fix a broken tree
  • a node that was moved to a different location in the tree to fix things
  • a node that was duplicated and added to the tree to fix things (like the links in the above example)
  • a node that was later modified by the user space program (like having a modified attribute)
  • a mismatched node (like not linked to a closing node)
  • a node that was opened-and closed in one tag (like <br/>)
  • et cetera

For our own use case it would already be very helpful if we could recognize "artificial" nodes, so that we can skip them when we regenerate the source code.

@lexborisov
Copy link
Owner

lexborisov commented Apr 3, 2018

I found bug. We need pos.len = 0 (for clone element), but now it contains a garbage. Need to fix it.

@elekt
Copy link
Author

elekt commented Apr 10, 2018

Can you ellaborate a bit more?
I assume it need to be set in myhtml_tree_node_clone.

@lexborisov
Copy link
Owner

It seems that no, today I will try to deal with this.
It is necessary to understand at what point the cloned nodes have garbage in the position values.
Position values in cloned nodes must be zero.

@EmielBruijntjes
Copy link
Contributor

Hello @lexborisov, do you need more info or help in any form?

@lexborisov
Copy link
Owner

@EmielBruijntjes
I understood the task, but it will take time. In enum myhtml_tree_node_flags we need to create, some like a MyHTML_TREE_NODE_CLONE, MyHTML_TREE_NODE_MOVED.

For use:

if (node->type & (MyHTML_TREE_NODE_CLONE|MyHTML_TREE_NODE_MOVED)) {
...
}

@EmielBruijntjes
Copy link
Contributor

@lexborisov Is there anything that I can do to help you here? It's a feature that we really like to have.

@lexborisov
Copy link
Owner

Sorry, but in the current project, I can not do anything about it. Just somehow mark the cloned elements. But I would not want to spend that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants