CSS selector finds nothing with invalid HTML #26

marciof · 2013-07-17T08:58:05Z

Since this example has invalid HTML, feel free to ignore this issue.

Anyway, here it is (simplified from http://www.weheart.co.uk/2013/02/18/alley-oop-design-exhibition/):

import cssselect
import lxml.html

d = lxml.html.document_fromstring('''
<!DOCTYPE html>
<html/>
<body></body>
''')

t = cssselect.HTMLTranslator()

print d.xpath(t.css_to_xpath('body'))
print d.xpath(t.css_to_xpath('body', prefix = '//'))

Just a bit unexpected that the first XPath query doesn't find anything.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2013-07-17T09:49:07Z

All cssselect is doing is giving you a an XPath string:

>>> t.css_to_xpath('body')
'descendant-or-self::body'
>>> t.css_to_xpath('body', prefix = '//')
'//body'

These results are as expected: if you run this selector on an element that is not the root, the results will be limited to it and its descendants with the default prefix.

Everything else is in lxml and libxml2, not cssselect. What matters for XPath is not whether the original HTML source is valid, but what the parsed tree looks like. Here you’re using libxml2’s HTML parser, which gives you a tree in a weird state.

>>> d
<Element html at 0x7f8bcc0e5b90>
>>> list(d)
[]
>>> d.getparent() is None
True
>>> d.getnext()
<Element html at 0x7f8bcc110350>
>>> list(d.getnext())
[<Element body at 0x7f8bcc110050>]

d is the root element of the tree (it has no parent, as expected), but it also has a sibling! (Very much unexpected.) This looks like a bug in libxml2’s parser.

In the meantime, try using the html5lib parser instead:

d = html5lib.parse('''
<!DOCTYPE html>
<html>
<body></body>
''', treebuilder='lxml', namespaceHTMLElements=False).getroot()

(I’m disabling namespaces here because of cssselect bug #9.)

SimonSapin closed this as completed Jul 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSS selector finds nothing with invalid HTML #26

CSS selector finds nothing with invalid HTML #26

marciof commented Jul 17, 2013

SimonSapin commented Jul 17, 2013

CSS selector finds nothing with invalid HTML #26

CSS selector finds nothing with invalid HTML #26

Comments

marciof commented Jul 17, 2013

SimonSapin commented Jul 17, 2013