HTML Search: Fix partial matches overwriting full matches #11958

wlach · 2024-02-06T14:25:10Z

Subject: Fix partial matches overwriting full matches for the 0th document

Feature or Bugfix

Bugfix

Purpose

Fixes matches being artificially lower than they should be.

Detail

Due to a small bug, we would overwrite full matches with the partial match score for the first document. This is a trivial bug, but makes testing more difficult.

Relates

Closes HTML Search: Partial matches can overwrite main matches in certain rare cases #11957

wlach · 2024-02-06T14:25:38Z

sphinx/themes/basic/static/searchtools.js

@@ -466,14 +466,18 @@ const Search = {
      // add support for partial matches
      if (word.length > 2) {
        const escapedWord = _escapeRegExp(word);
-        Object.keys(terms).forEach((term) => {
-          if (term.match(escapedWord) && !terms[word])


terms[word] will be falsey (0) for the first document.

@wlach based on index-format findings (and test coverage), should this be rephrased to:

terms[word] will be falsey (0) for a word that only exists in document zero

(kinda wordy, but captures the fact that [0, 1], for example, is not falsy)

(edited with a correction and for conciseness)

wlach · 2024-02-06T14:26:13Z

sphinx/themes/basic/static/searchtools.js

-          if (term.match(escapedWord) && !titleTerms[word])
-            arr.push({ files: titleTerms[word], score: Scorer.partialTitle });
-        });
+        if (!terms.hasOwnProperty(word)) {


It is marginally more efficient to do this check outside the loop too.

Unless the performance improvement is significant, I think it'd be better to leave the conditional check after the && clause.

(that's based on considering that if the first condition fails, then the clause after the && isn't evaluated. if moving the property-check eliminates a lot of regex matches, then fair enough, but this seems intended to be primarily a correctness fixup, so I'd lean towards making the logical diff as small as possible)

I'm going to push back on this a bit: I think this small improvement is in scope (since we're changing the logic anyway) and the diff here is relatively small and easy to understand. Moving the property check should indeed reduce the number of regex matches a fair bit.

Ok, no problem. I think we'll have to agree to disagree on that.

I'd prefer if you could avoid the extraction of an outer if statement here. When viewing the diff, the larger set of line changes obscure what the bugfix is; and that's based on my assessment today, after having understood the fix a couple of weeks ago. It'd be trickier for someone who wasn't familiar with the original pull request, or looking at it again after a longer duration of time.

Yes, I understood your point, I just don't agree with it in this case. I have more to say on this, but I think we've gone out of scope of a review. If you want to talk about it more, feel free to email me.

Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.

Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.

Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.

I don't think I have more to say on the technical front. But just to state it one more time so it's clear: I think this (small) PR as it stands fixes the issue in a more fundamental way and improves performance so it's my strong recommendation that it be the final state of the code. I don't feel the benefit of breaking it up into two PRs is worth the hassle and churn (leaving aside that this discussion has now occupied more effort than that would entail...).

Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.

It's not clear to me when someone who can actually merge the code will look at these changes. I can't stop you from proposing an alternative, but I think the best thing is probably just to wait for a maintainer to break our impasse. I am willing to make the change you suggest if it will actually get this merged and my PR is open to updates from the owners of this repository.

Ok, after finding more time to understand the code, I concede that the performance improvements here are important. I've opened #12045 with a description of what I believe the problem is.

Also, compared to my suggested alternative #12040, I think that the use of the JavaScript hasOwnProperty function is probably better/safer code style, making this pull request preferable.

@wlach could you edit the description of this pull request (#11958) to add 'Resolves #12045'?

(and thanks for your patience!)

I wouldn't bother much about this. Here, it's more important to use hasOwnProperty instead of !terms[word] which could lead to incorrect boolean evaluations (and personally if we can have an easy-to-write improvement because we can isolate a condition from outside a loop, it's fine for me).

wlach · 2024-02-06T14:26:28Z

tests/js/searchtools.js

@@ -21,7 +21,7 @@ describe('Basic html theme search', function() {
        "&lt;no title&gt;",
        "",
        null,
-        2,
+        5,


With this change, the score gets boosted to 5 (what it should be).

jayaddison · 2024-02-27T18:07:15Z

@wlach could you merge the latest changes from the master branch into this one? That should resolve the Windows test failures. From what I remember of the plan, this is to be merged before #11942.

wlach · 2024-02-27T19:55:35Z

@wlach could you merge the latest changes from the master branch into this one? That should resolve the Windows test failures. From what I remember of the plan, this is to be merged before #11942.

Done

jayaddison · 2024-03-01T17:30:42Z

I think that the problem here is due to an incorrect test fixture in our test suite; I've added one or two more details in the linked bugreport (#11957) and as a result am closing this too. Please re-open with supporting details if I'm mistaken.

wlach · 2024-03-01T23:29:54Z

I think that the problem here is due to an incorrect test fixture in our test suite; I've added one or two more details in the linked bugreport (#11957) and as a result am closing this too. Please re-open with supporting details if I'm mistaken.

@jayaddison It looks to me like the index generated by Sphinx only uses an array for the term entries if there are multiple matches. See this example:

https://repo-parser-demo.netlify.app/searchindex.js

And note the output here:

{ ..., "terms": {"thi": [0, 3, 4, 5, 10, 12], "i": [0, 6, 10, 12], "an": [0, 7], "type": 0, "you": [0, 3, 11], "might": 0, "abl": 0, ... }

I haven't tested, but it looks to me that the code that writes it out is here:

sphinx/sphinx/search/__init__.py

Line 371 in 3596590

    
           def get_terms(self, fn2index: dict) -> tuple[dict[str, list[str]], dict[str, list[str]]]:

I presume it's done this way as a space-saving strategy.

Unfortunately I don't have the rights to re-open this PR and issue, could you please do it for me?

jayaddison · 2024-03-02T00:31:23Z

Yikes - OK. Thank you @wlach. That's a cleverer/more-adaptive format than I'd expected, and I should have checked for single-document terms. Yep, I'll re-open these.

wlach · 2024-03-02T15:03:24Z

sphinx/themes/basic/static/searchtools.js

+        if (!titleTerms.hasOwnProperty(word)) {
+          Object.keys(titleTerms).forEach((term) => {
+            if (term.match(escapedWord))
+              arr.push({ files: titleTerms[word], score: Scorer.partialTitle });


Is the access to titleTerms[word] here a (separate) bug?

I think #11957 as described is broad enough to describe both. It's the same error: a partial match overwriting a full one.

Oh wait, I see what you mean now given https://github.com/sphinx-doc/sphinx/pull/11958/files#r1509999419. Yeah, this looks like a separate bug strictly speaking. This is such a rabbit hole. 😭

I'd be inclined to fix it in a new PR with its own test (maybe depending on this one).

Yep, no problem, there's a lot going on here :) As mentioned though, I feel good about the effect on quality that all this will have.

Agreed; I'll create a separate bugreport for this, and then request review for #12037 after this one is merged.

Could you clarify in the description of #11957 that it relates solely to terms that only occur in the zeroth document in the index? (if my understanding of that is true?)

Edit: fix pr/issues references

Arg, sorry - those linked issues were wrong; I meant to request a change for the description of #11957 - to make the nature of the bug more precise.

Reported separately as #12040.

sphinx/themes/basic/static/searchtools.js

+        if (!titleTerms.hasOwnProperty(word)) {
+          Object.keys(titleTerms).forEach((term) => {
+            if (term.match(escapedWord))
+              arr.push({ files: titleTerms[word], score: Scorer.partialTitle });


jayaddison · 2024-03-03T13:14:51Z

Note: if reviewing this pull request using the GitHub web viewer, then it may be easier to inspect by requesting the diff with whitespace-changes ignored: https://github.com/sphinx-doc/sphinx/pull/11958/files?w=1 (the w=1 query-string param)

picnixz · 2024-03-03T14:18:12Z

Thank you!

jayaddison · 2024-03-03T14:19:00Z

Thank you both @wlach @picnixz!

wlach · 2024-03-03T15:57:19Z

Such a relief to finally get this in, thanks all!

wlach commented Feb 6, 2024

View reviewed changes

This was referenced Feb 6, 2024

HTML Search: Fix multiple term matching edge case #11960

Merged

HTML Search: Fix duplicate results #11942

Closed

wlach marked this pull request as ready for review February 6, 2024 16:09

picnixz added the html search label Feb 7, 2024

wlach mentioned this pull request Feb 7, 2024

Upstream issue #11961: alternative de-duplication approach wlach/sphinx#1

Closed

wlach force-pushed the html-search-issue-11950 branch from b7bd5f8 to e2695c7 Compare February 27, 2024 19:55

wlach added 2 commits February 27, 2024 19:06

HTML Search: Fix partial matches overwriting full matches

76d171a

Add changelog entry

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

Loading
Loading status checks…

a3bbf41

wlach force-pushed the html-search-issue-11950 branch from e2695c7 to a3bbf41 Compare February 28, 2024 00:07

jayaddison closed this Mar 1, 2024

jayaddison mentioned this pull request Mar 1, 2024

Test suite: search: test fixture does not accurately represent the JS search index format #12028

Closed

jayaddison reopened this Mar 2, 2024

jayaddison mentioned this pull request Mar 2, 2024

[HTML search] tests: bugfix: correction for test index fixture format #12029

Closed

jayaddison reviewed Mar 2, 2024

View reviewed changes

jayaddison mentioned this pull request Mar 2, 2024

[HTML search] correction for scoring of search terms that only appear in document indexed with id zero. #12037

Closed

jayaddison added the type:bug label Mar 2, 2024

jayaddison reviewed Mar 2, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js

if (!titleTerms.hasOwnProperty(word)) {

Object.keys(titleTerms).forEach((term) => {

if (term.match(escapedWord))

arr.push({ files: titleTerms[word], score: Scorer.partialTitle });

This comment was marked as outdated.

Sign in to view

This was referenced Mar 2, 2024

HTML Search: partially-matched titles are not included in search results. #12040

Closed

HTML Search: fixup: include partially-matched document titles in search results. #12041

Merged

jayaddison added javascript type:performance priority:high labels Mar 3, 2024

jayaddison requested a review from picnixz March 3, 2024 13:10

picnixz approved these changes Mar 3, 2024

View reviewed changes

Merge branch 'master' into html-search-issue-11950

Loading
Loading status checks…

9ebdbf9

picnixz merged commit 1e4f80d into sphinx-doc:master Mar 3, 2024
23 checks passed

jayaddison mentioned this pull request Mar 3, 2024

[HTML search] optimization: don't loop over all document terms and title terms during partial-matching. #12045

Closed

wlach deleted the html-search-issue-11950 branch March 3, 2024 15:57

picnixz removed the priority:high label Mar 17, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2024

AA-Turner added this to the 7.3.0 milestone Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Search: Fix partial matches overwriting full matches #11958

HTML Search: Fix partial matches overwriting full matches #11958

wlach commented Feb 6, 2024 •

edited

Loading

wlach Feb 6, 2024

jayaddison Mar 2, 2024 •

edited

Loading

jayaddison Mar 2, 2024

wlach Feb 6, 2024

jayaddison Feb 7, 2024

wlach Feb 7, 2024

jayaddison Feb 7, 2024

jayaddison Feb 27, 2024

wlach Feb 28, 2024

jayaddison Feb 28, 2024

wlach Feb 28, 2024 •

edited

Loading

jayaddison Mar 3, 2024

picnixz Mar 3, 2024

wlach Feb 6, 2024 •

edited

Loading

jayaddison commented Feb 27, 2024

wlach commented Feb 27, 2024

jayaddison commented Mar 1, 2024

wlach commented Mar 1, 2024 •

edited

Loading

jayaddison commented Mar 2, 2024

This comment was marked as resolved.

wlach Mar 2, 2024

wlach Mar 2, 2024

jayaddison Mar 2, 2024 •

edited

Loading

jayaddison Mar 2, 2024

jayaddison Mar 2, 2024

This comment was marked as outdated.

jayaddison commented Mar 3, 2024

picnixz commented Mar 3, 2024

jayaddison commented Mar 3, 2024

wlach commented Mar 3, 2024

HTML Search: Fix partial matches overwriting full matches #11958

HTML Search: Fix partial matches overwriting full matches #11958

Conversation

wlach commented Feb 6, 2024 • edited Loading

Feature or Bugfix

Purpose

Detail

Relates

Choose a reason for hiding this comment

jayaddison Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison commented Feb 27, 2024

wlach commented Feb 27, 2024

jayaddison commented Mar 1, 2024

wlach commented Mar 1, 2024 • edited Loading

jayaddison commented Mar 2, 2024

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayaddison Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

jayaddison commented Mar 3, 2024

picnixz commented Mar 3, 2024

jayaddison commented Mar 3, 2024

wlach commented Mar 3, 2024

wlach commented Feb 6, 2024 •

edited

Loading

jayaddison Mar 2, 2024 •

edited

Loading

wlach Feb 28, 2024 •

edited

Loading

wlach Feb 6, 2024 •

edited

Loading

wlach commented Mar 1, 2024 •

edited

Loading

jayaddison Mar 2, 2024 •

edited

Loading