- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Search: Fix partial matches overwriting full matches #11958
Conversation
@@ -466,14 +466,18 @@ const Search = { | |||
// add support for partial matches | |||
if (word.length > 2) { | |||
const escapedWord = _escapeRegExp(word); | |||
Object.keys(terms).forEach((term) => { | |||
if (term.match(escapedWord) && !terms[word]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terms[word]
will be falsey (0
) for the first document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlach based on index-format findings (and test coverage), should this be rephrased to:
terms[word]
will be falsey (0
) for a word that only exists in document zero
(kinda wordy, but captures the fact that [0, 1]
, for example, is not falsy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(edited with a correction and for conciseness)
if (term.match(escapedWord) && !titleTerms[word]) | ||
arr.push({ files: titleTerms[word], score: Scorer.partialTitle }); | ||
}); | ||
if (!terms.hasOwnProperty(word)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is marginally more efficient to do this check outside the loop too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the performance improvement is significant, I think it'd be better to leave the conditional check after the &&
clause.
(that's based on considering that if the first condition fails, then the clause after the &&
isn't evaluated. if moving the property-check eliminates a lot of regex matches, then fair enough, but this seems intended to be primarily a correctness fixup, so I'd lean towards making the logical diff as small as possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to push back on this a bit: I think this small improvement is in scope (since we're changing the logic anyway) and the diff here is relatively small and easy to understand. Moving the property check should indeed reduce the number of regex matches a fair bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no problem. I think we'll have to agree to disagree on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if you could avoid the extraction of an outer if
statement here. When viewing the diff, the larger set of line changes obscure what the bugfix is; and that's based on my assessment today, after having understood the fix a couple of weeks ago. It'd be trickier for someone who wasn't familiar with the original pull request, or looking at it again after a longer duration of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understood your point, I just don't agree with it in this case. I have more to say on this, but I think we've gone out of scope of a review. If you want to talk about it more, feel free to email me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.
Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.
I don't think I have more to say on the technical front. But just to state it one more time so it's clear: I think this (small) PR as it stands fixes the issue in a more fundamental way and improves performance so it's my strong recommendation that it be the final state of the code. I don't feel the benefit of breaking it up into two PRs is worth the hassle and churn (leaving aside that this discussion has now occupied more effort than that would entail...).
Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.
It's not clear to me when someone who can actually merge the code will look at these changes. I can't stop you from proposing an alternative, but I think the best thing is probably just to wait for a maintainer to break our impasse. I am willing to make the change you suggest if it will actually get this merged and my PR is open to updates from the owners of this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after finding more time to understand the code, I concede that the performance improvements here are important. I've opened #12045 with a description of what I believe the problem is.
Also, compared to my suggested alternative #12040, I think that the use of the JavaScript hasOwnProperty
function is probably better/safer code style, making this pull request preferable.
@wlach could you edit the description of this pull request (#11958) to add 'Resolves #12045'?
(and thanks for your patience!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother much about this. Here, it's more important to use hasOwnProperty
instead of !terms[word]
which could lead to incorrect boolean evaluations (and personally if we can have an easy-to-write improvement because we can isolate a condition from outside a loop, it's fine for me).
@@ -21,7 +21,7 @@ describe('Basic html theme search', function() { | |||
"<no title>", | |||
"", | |||
null, | |||
2, | |||
5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, the score gets boosted to 5 (what it should be).
b7bd5f8
to
e2695c7
Compare
e2695c7
to
a3bbf41
Compare
I think that the problem here is due to an incorrect test fixture in our test suite; I've added one or two more details in the linked bugreport (#11957) and as a result am closing this too. Please re-open with supporting details if I'm mistaken. |
@jayaddison It looks to me like the index generated by Sphinx only uses an array for the term entries if there are multiple matches. See this example: https://repo-parser-demo.netlify.app/searchindex.js And note the output here:
I haven't tested, but it looks to me that the code that writes it out is here: sphinx/sphinx/search/__init__.py Line 371 in 3596590
I presume it's done this way as a space-saving strategy. Unfortunately I don't have the rights to re-open this PR and issue, could you please do it for me? |
Yikes - OK. Thank you @wlach. That's a cleverer/more-adaptive format than I'd expected, and I should have checked for single-document terms. Yep, I'll re-open these. |
if (!titleTerms.hasOwnProperty(word)) { | ||
Object.keys(titleTerms).forEach((term) => { | ||
if (term.match(escapedWord)) | ||
arr.push({ files: titleTerms[word], score: Scorer.partialTitle }); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the access to titleTerms[word] here a (separate) bug?
I think #11957 as described is broad enough to describe both. It's the same error: a partial match overwriting a full one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I see what you mean now given https://github.com/sphinx-doc/sphinx/pull/11958/files#r1509999419. Yeah, this looks like a separate bug strictly speaking. This is such a rabbit hole. 😭
I'd be inclined to fix it in a new PR with its own test (maybe depending on this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, no problem, there's a lot going on here :) As mentioned though, I feel good about the effect on quality that all this will have.
Agreed; I'll create a separate bugreport for this, and then request review for #12037 after this one is merged.
Could you clarify in the description of #11957 that it relates solely to terms that only occur in the zeroth document in the index? (if my understanding of that is true?)
Edit: fix pr/issues references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg, sorry - those linked issues were wrong; I meant to request a change for the description of #11957 - to make the nature of the bug more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reported separately as #12040.
if (!titleTerms.hasOwnProperty(word)) { | ||
Object.keys(titleTerms).forEach((term) => { | ||
if (term.match(escapedWord)) | ||
arr.push({ files: titleTerms[word], score: Scorer.partialTitle }); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Note: if reviewing this pull request using the GitHub web viewer, then it may be easier to inspect by requesting the diff with whitespace-changes ignored: https://github.com/sphinx-doc/sphinx/pull/11958/files?w=1 (the |
Thank you! |
Such a relief to finally get this in, thanks all! |
Subject: Fix partial matches overwriting full matches for the 0th document
Feature or Bugfix
Purpose
Fixes matches being artificially lower than they should be.
Detail
Due to a small bug, we would overwrite full matches with the partial match score for the first document. This is a trivial bug, but makes testing more difficult.
Relates