Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS selector finds nothing with invalid HTML #26

Closed
marciof opened this issue Jul 17, 2013 · 1 comment
Closed

CSS selector finds nothing with invalid HTML #26

marciof opened this issue Jul 17, 2013 · 1 comment

Comments

@marciof
Copy link

marciof commented Jul 17, 2013

Since this example has invalid HTML, feel free to ignore this issue.

Anyway, here it is (simplified from http://www.weheart.co.uk/2013/02/18/alley-oop-design-exhibition/):

import cssselect
import lxml.html

d = lxml.html.document_fromstring('''
<!DOCTYPE html>
<html/>
<body></body>
''')

t = cssselect.HTMLTranslator()

print d.xpath(t.css_to_xpath('body'))
print d.xpath(t.css_to_xpath('body', prefix = '//'))

Just a bit unexpected that the first XPath query doesn't find anything.

@SimonSapin
Copy link
Contributor

All cssselect is doing is giving you a an XPath string:

>>> t.css_to_xpath('body')
'descendant-or-self::body'
>>> t.css_to_xpath('body', prefix = '//')
'//body'

These results are as expected: if you run this selector on an element that is not the root, the results will be limited to it and its descendants with the default prefix.

Everything else is in lxml and libxml2, not cssselect. What matters for XPath is not whether the original HTML source is valid, but what the parsed tree looks like. Here you’re using libxml2’s HTML parser, which gives you a tree in a weird state.

>>> d
<Element html at 0x7f8bcc0e5b90>
>>> list(d)
[]
>>> d.getparent() is None
True
>>> d.getnext()
<Element html at 0x7f8bcc110350>
>>> list(d.getnext())
[<Element body at 0x7f8bcc110050>]

d is the root element of the tree (it has no parent, as expected), but it also has a sibling! (Very much unexpected.) This looks like a bug in libxml2’s parser.

In the meantime, try using the html5lib parser instead:

d = html5lib.parse('''
<!DOCTYPE html>
<html>
<body></body>
''', treebuilder='lxml', namespaceHTMLElements=False).getroot()

(I’m disabling namespaces here because of cssselect bug #9.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants