Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPATH3.1: mimic handling of multiple root element nodes #2351

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

Constantin1489
Copy link
Contributor

@Constantin1489 Constantin1489 commented May 7, 2024

Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)

But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because sometimes lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a remnant of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)

See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716

So I mimicked it.
The test I included describes the point.

fixes #2318

@Constantin1489 Constantin1489 changed the title mimic several root element nodes handling mimic multiple root element nodes handling May 7, 2024
@Constantin1489 Constantin1489 changed the title mimic multiple root element nodes handling XPATH3.1: mimic multiple root element nodes handling May 7, 2024
requirements.txt Outdated
@@ -55,7 +55,7 @@ beautifulsoup4
lxml >=4.8.0,<6

# XPath 2.0-3.1 support - 4.2.0 broke something?
elementpath==4.1.5
elementpath==4.4.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Sure, if the tests pass it's OK

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was required to fix this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1

new_root.append(node)
return new_root, True

return root, False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, existing working xpath: won't be affected

@Constantin1489 Constantin1489 marked this pull request as draft May 7, 2024 16:33
@Constantin1489 Constantin1489 changed the title XPATH3.1: mimic multiple root element nodes handling XPATH3.1: mimic handling of multiple root element nodes May 7, 2024
@Constantin1489 Constantin1489 marked this pull request as ready for review May 7, 2024 16:59
])
def test_broken_DOM_01(html_content, xpath, answer):
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
with pytest.raises(Exception):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.

@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
@pytest.mark.parametrize("xpath, answer", [
("/html/body/p[1]", "First paragraph."),
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.

<p>First paragraph.</p>
</body>
</html>
<html>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second html root element.

@dgtlmoon
Copy link
Owner

dgtlmoon commented May 8, 2024

As an idea, what about having this enabled by default as a config option?

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented May 13, 2024

Valid HTML DOM are all alike; every non-valid HTML DOM is not valid and unhappy in its own way.

Since the suggestion will add another synthetic root element by default, the point is the initial context item.

For that, three possible options exist.

  1. the first is that (fragment True and the context item is the new root element).

  2. the second is that (fragment True and, the context item is dynamic). The latter means that if only one root element exists, then that is the context item. when multiple root elements exist, the context item will be the new_root element. So the second is meaningless. it would be the same with current PR.

  3. the third one is that fragment True and don't care about the number of root element as a context item and select one of them. It will reduce the accessible scope of information for multiple root element cases. this is unacceptable.

One may say that I chose the wrong html parser and I should choose etree.HTML. But, in this case the "html" tag root element node can have html tag as a child. In this case "/html/html"

To be honest, all sorts of solutions seem intuitively unfair even with mine. But libxml2 will develop html5.

As I mentioned, selecting one of the html "root element nodes" as a context item is not an option.

So, in the first method, the context item is one depth deeper(?) for all cases. (https://github.com/dgtlmoon/changedetection.io/actions/runs/9057420996/job/24881401637)
e.g. manager[@name = 'Godot'] -> branches_to_visit/manager[@name = 'Godot']
So, making the default new root element node for all cases makes sense when the context item is the "new_root" element.

@dgtlmoon
Copy link
Owner

do we need this one? it could help #2175 , thoughts?

@Constantin1489
Copy link
Contributor Author

the minimum version of elementpath is 4.2.1 because of https://github.com/Constantin1489/changedetection.io/actions

@dgtlmoon
Copy link
Owner

Looks OK to me, i guess it doubles the CPU usage for checking a watch right?

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented May 17, 2024

Good point. Since the new function will inevitably increase the usage, I chose just another method to increase speed.
in my test, 507427 function calls (502074 primitive calls) in 0.374 seconds becomes 12909 function calls (12291 primitive calls) in 0.007 seconds

In the 138 tests in CI, it was 0.39 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9107046053/job/25035224418#step:9:3365). now it has become 0.31 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9126805653/job/25095800223?pr=2351#step:9:3367) like this function doesn't exist.(https://github.com/dgtlmoon/changedetection.io/actions/runs/9124493905/job/25088778195#step:9:3292)

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented May 20, 2024

BTW, I couldn't find any evidence that lxml parses the content again.
Could you provide me an example of double CPU usage?

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented May 20, 2024

BTW, the reason why I don't do lazy import is because it is Python. I'm not an expert of php but a SOF user said this point (https://stackoverflow.com/a/10084940/20307768)

@dgtlmoon
Copy link
Owner

could you merge current master into this branch so we can test again? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'str' object has no attribute '__name__' error on some xpath filters
2 participants