Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cssselect can't work on firefox #27

Open
bukzor opened this issue Jul 30, 2013 · 11 comments
Open

cssselect can't work on firefox #27

bukzor opened this issue Jul 30, 2013 · 11 comments
Labels

Comments

@bukzor
Copy link

bukzor commented Jul 30, 2013

Firefox is unlike other xpath implementations in that name() returns an upper-cased string. cssselect's translation of the nth-child selector (for example) uses "name() = 'foo'" which will never possibly match, due to the above oddity. One workaround is to set HTMLTranslator.lower_case_element_names to False, and write selectors like 'LI:nth-child(2)', which will result in a working xpath for firefox, but won't work on any other xpath implementation.

I see two possible solutions:

  1. Call lower-case() wherever name() is called.
  2. Factor out the use of name() entirely, replacing [name() = 'foo'] with [self::foo].

Demonstration of the problem and solution here: (the contrast between chrome and firefox is stark, ie is like chrome)
http://fiddle.jshell.net/J7VrG/10/show/light/

@SimonSapin
Copy link
Contributor

First, I know this is probably not gonna be helpful, but I’m curious why you want to use cssselect to use XPath in a browser that supports document.querySelectorAll()?

As to the bug itself, I will gladly review and merge a pull request. But to be honest, I’m a bit tired of XPath. There are so many corner cases like this that are just wrong in cssselect that I don’t really want to invest any more time in the "Selectors → XPath" approach.

@bukzor
Copy link
Author

bukzor commented Jul 30, 2013

Thanks for replying.

Our use case is a bit involved. We use selenium for testing, which only supports css2 selectors. It also supports xpath, so our attempt to bolt-on css3 support was to use cssselect, but the above bug prevents us from using this approach.

I have a branch which replaces the use of [name() = 'e'] with [self::e], but it causes xpath to explode for tag names such as "h\a0 ref". I personally think this is reasonable and appropriate, but wanted to get your feedback since it seems someone went through some trouble to make sure pathological tags wouldn't crash.

The branch in question is my "no_name" branch. All tests pass, but I had to convert some assertions to assertRaises.

@bukzor
Copy link
Author

bukzor commented Jul 31, 2013

I've been looking at this a bit more.
I believe the core of the issue is the is_safe_name regex, which is intentionally written as a too-strict check for the xml Name (http://www.w3.org/TR/REC-xml/#NT-Name).

I believe the Right fix to this issue, which doesn't use name(), is case-insensitive, and doesn't allow cssselect to generated invalid xpath expressions is to update this regex to fully conform to the xml Name spec, and throw an ExpressionError when it doesn't match.

Please give me your opinions of this idea.
If you agree, I'll provide the patch and tests.

@SimonSapin
Copy link
Contributor

I don’t see how changing is_safe_name would help, this is not the only cases when name() is used. (See add_name_test().)

@bukzor
Copy link
Author

bukzor commented Aug 1, 2013

The name-test uses name() == 'x' because we can't be sure that self::x will be valid xpath for all x, such as u'h\xa0ref' or u'h]ref'.

However, if I improve the safe_name regex to match the standard, we can detect invalid names, and use self::x syntax with confidence. The improved regex would be used to throw a cssselect.xpath.ExpressionError for identifiers such as 'h]ref'.

I just want some sign that you'd consider such a change before I do the work.

@SimonSapin
Copy link
Contributor

Again, this doesn't seem like it would fix the original issue since this not the only cases where name() is used.

@SimonSapin
Copy link
Contributor

How about using upper case when HTMLTranslator.lower_case_element_name is set to 'upper', or something like that?

@bukzor
Copy link
Author

bukzor commented Aug 5, 2013

It would need to vary upper and lower depending on which xpath implementation the expression will be sent to. I'd rather not go down that path if possible.

It sounds like you don't hate the idea, you're just not sure it will work. Let me work up a rough branch and show my work. I think all the uses of name() can be refactored, if we have a regex that can reliably determine whether a string can be used in an expression like self::x.

@SimonSapin
Copy link
Contributor

To clarify: I’m pretty sure it will not work in all cases (especially in very much non-obvious corner cases), and I don’t like the idea as I find XML’s Name grammar production very silly.

@bukzor
Copy link
Author

bukzor commented Aug 6, 2013

I agree that it will be hard to validate xml Qnames in all the edge cases, but I believe I can do it using a similar style to what you've done for the css grammar. I should be able to find a good listing of edge-case Qnames in some open-source project's test suite (libxml2?).

@SimonSapin
Copy link
Contributor

It's not that it's hard, it's that it doesn't fix your issue.

@Gallaecio Gallaecio added the bug label May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants