Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching for the term C++ is broken #8123

Closed
hofmandl1 opened this issue Aug 14, 2020 · 5 comments
Closed

Searching for the term C++ is broken #8123

hofmandl1 opened this issue Aug 14, 2020 · 5 comments

Comments

@hofmandl1
Copy link
Contributor

hofmandl1 commented Aug 14, 2020

Describe the bug

Instead of searching for C++ sphinx searches for C (two spaces after the C)

I've done some investigation on the matter:

var key = jQuery.urldecode(tmp[0]);

These two lines

    var key = jQuery.urldecode(tmp[0]);
    var value = jQuery.urldecode(tmp[1]);

do too much decoding. Instead I suppose that https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent should be used.

To Reproduce

Go to https://www.sphinx-doc.org/en/master/search.html?q=C%2B%2B

On the sphinx search page In your browser's Javascript Console, type

jQuery.urldecode('C%2B%2B') // e.g. a user wants to search for the language `C++`, leading for the URI to be search.html?q=C%2B%2B

-> C
so sphinx will actually search for C instead of C++.

Expected behavior

I would like sphinx to search for C++

@hofmandl1
Copy link
Contributor Author

Hmm .. I just saw, that this happens on purpose, somewhat above:

jQuery.urldecode = function(x) {
  return decodeURIComponent(x).replace(/\+/g, ' ');
};

@hofmandl1
Copy link
Contributor Author

When trying to fix it locally, I ran into:

jquery.js:2 Uncaught SyntaxError: Invalid regular expression: /c++/: Nothing to repeat
    at String.match (<anonymous>)
    at Object.performTermsSearch (searchtools.js:396)
    at Object.query (searchtools.js:205)
    at Object.performSearch (searchtools.js:146)
    at Object.init (searchtools.js:79)
    at HTMLDocument.<anonymous> (searchtools.js:503)
    at e (jquery.js:2)
    at t (jquery.js:2)

which I fixed using the attached patch

searchtools.js.escape_search_regex.patch.txt

@hofmandl1 hofmandl1 changed the title Small bug when decoding query string for search page in doctools.js %2B is decoded to Space when decoding query string for search page in doctools.js Aug 14, 2020
@hofmandl1
Copy link
Contributor Author

hofmandl1 commented Aug 14, 2020

Ok, Now I see, why

jQuery.urldecode = function(x) {
  return decodeURIComponent(x).replace(/\+/g, ' ');
};

was done.
It needs to be just the other way round:

jQuery.urldecode = function(x) {
  return decodeURIComponent(x.replace(/\+/g, ' '));
};

also see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL

@hofmandl1 hofmandl1 changed the title %2B is decoded to Space when decoding query string for search page in doctools.js Searching for the term C++ is broken Aug 14, 2020
@tk0miya
Copy link
Member

tk0miya commented Aug 14, 2020

Thank you for reporting. But, unfortunately, our indexer does not support a word containing "+". So it will be failed even if we fixed the decoding logic because search-index does not contain "C++" as a keyword.

@hofmandl1
Copy link
Contributor Author

we have defined a custom search-language with a custom splitter that is very simple and doesn't throw away punctutations, thus in our (commercial) project we can now nicely search for C++:

    def split(self, content_snippet: str) -> List[str]:
        """Especially the "splitting" of "en" is way too agressive for
           technical documentation. It splits search terms on all kinds
           of punctuations, that all too often are parts of search-terms
           our customers use."""

        # !!! This must match the javascript implementation as closely as possible !!!
        return [x for x in content_snippet.split(' ') if x]

we also are locally applying 2 more patches to searchtools.js removing the feature that excludes searchterms with a leading - and the one that excludes digit-only search-terms in order to make the search more useful for us.

The leading - feature bit us hard, because we want to document things like command-line switches. Digit-only search terms are also quite common I guess e.g. when searching for a keyword in combination with a version number or similar.

I would be willing to start a PR dumping in these changes if someone is interested, however I don't think I'd have the time to get this merge-ready.

hofmandl1 added a commit to hofmandl1/sphinx that referenced this issue Aug 19, 2020
…ic html theme search

Note, that the default splitter will not index +, so this isn't of much of much use, unless
the splitter of the search-language is reconfigured.
@tk0miya tk0miya added this to the 3.5.0 milestone Jan 25, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants