Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML search results: identical summaries displayed for different link targets #11943

Closed
jayaddison opened this issue Feb 4, 2024 · 4 comments · Fixed by #11944
Closed

HTML search results: identical summaries displayed for different link targets #11943

jayaddison opened this issue Feb 4, 2024 · 4 comments · Fixed by #11944

Comments

@jayaddison
Copy link
Contributor

Describe the bug

My understanding (that could be mistaken) about the html_show_search_summary feature (enabled by default) is that each search result displayed to the user should have a relevant snippet of the content displayed alongside, to aid the user's determination of which result(s) are most relevant.

When the feature is enabled, we make an HTTP request to each result and display a portion of that using JavaScript.

How to Reproduce

Build the Sphinx documentation locally and then run a basic webserver to host it:

sphinx.git $ sphinx-build -b html doc _build 
sphinx.git $ cd _build
sphinx.git $ python -m http.server -b 127.0.0.1

Open http://127.0.0.1:8000 in a web browser, and perform a search for the query term test.

Observe that multiple results with separate HTML anchor targets appear in the search results. However: they display the same search snippet (not useful - this is the bug).

image

Environment Information

Platform:              linux; (Linux-6.6.13-amd64-x86_64-with-glibc2.37)
Python version:        3.11.7 (main, Dec  8 2023, 14:22:46) [GCC 13.2.0])
Python implementation: CPython
Sphinx version:        7.2.6
Docutils version:      0.20.1
Jinja2 version:        3.1.3
Pygments version:      2.17.2

Sphinx extensions

No response

Additional context

Discovered during investigation of #11942.

@jayaddison
Copy link
Contributor Author

Here's the code for the relevant helper method:

/**
* helper function to return a node containing the
* search summary for a given text. keywords is a list
* of stemmed words.
*/
makeSearchSummary: (htmlText, keywords) => {
const text = Search.htmlToText(htmlText);
if (text === "") return null;
const textLower = text.toLowerCase();
const actualStartPosition = [...keywords]
.map((k) => textLower.indexOf(k.toLowerCase()))
.filter((i) => i > -1)
.slice(-1)[0];
const startWithContext = Math.max(actualStartPosition - 120, 0);
const top = startWithContext === 0 ? "" : "...";
const tail = startWithContext + 240 < text.length ? "..." : "";
let summary = document.createElement("p");
summary.classList.add("context");
summary.textContent = top + text.substr(startWithContext, 240).trim() + tail;
return summary;
},
};

@jayaddison
Copy link
Contributor Author

I had optimistically hoped that this issue might be a regression somewhere in the makeSearchSummary code (for example, that anchors were no longer identified correctly). That doesn't appear to be the case; we don't attempt to identify anchor elements in the returned result-page-HTML data currently.

That could make sense; parsing the HTML of multiple results pages client-side could be an expensive operation. But it does mean that locating the anchors -- and relevant text near them -- would be required, and could require some care.

I think I would pause at this point and check for advice from more experienced Sphinx developers.

@wlach
Copy link
Contributor

wlach commented Feb 4, 2024

That could make sense; parsing the HTML of multiple results pages client-side could be an expensive operation. But it does mean that locating the anchors -- and relevant text near them -- would be required, and could require some care.

We do this already, so I think it's fine (the amount of computation/time needed to parse out the HTML is almost always going to be overwhelmed by the latency in requesting the HTML). See #11944 for what I think is the right solution.

@jayaddison
Copy link
Contributor Author

That could make sense; parsing the HTML of multiple results pages client-side could be an expensive operation. But it does mean that locating the anchors -- and relevant text near them -- would be required, and could require some care.

We do this already, so I think it's fine (the amount of computation/time needed to parse out the HTML is almost always going to be overwhelmed by the latency in requesting the HTML). See #11944 for what I think is the right solution.

Brilliant - I didn't realize that we already did that. I'll add a question about the content we retrieve relative to each anchor in that pull request.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants