Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Search: Fix partial matches overwriting full matches #11958

Merged
merged 3 commits into from Mar 3, 2024

Conversation

wlach
Copy link
Contributor

@wlach wlach commented Feb 6, 2024

Subject: Fix partial matches overwriting full matches for the 0th document

Feature or Bugfix

  • Bugfix

Purpose

Fixes matches being artificially lower than they should be.

Detail

Due to a small bug, we would overwrite full matches with the partial match score for the first document. This is a trivial bug, but makes testing more difficult.

Relates

@@ -466,14 +466,18 @@ const Search = {
// add support for partial matches
if (word.length > 2) {
const escapedWord = _escapeRegExp(word);
Object.keys(terms).forEach((term) => {
if (term.match(escapedWord) && !terms[word])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terms[word] will be falsey (0) for the first document.

Copy link
Contributor

@jayaddison jayaddison Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlach based on index-format findings (and test coverage), should this be rephrased to:

terms[word] will be falsey (0) for a word that only exists in document zero

(kinda wordy, but captures the fact that [0, 1], for example, is not falsy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(edited with a correction and for conciseness)

if (term.match(escapedWord) && !titleTerms[word])
arr.push({ files: titleTerms[word], score: Scorer.partialTitle });
});
if (!terms.hasOwnProperty(word)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is marginally more efficient to do this check outside the loop too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the performance improvement is significant, I think it'd be better to leave the conditional check after the && clause.

(that's based on considering that if the first condition fails, then the clause after the && isn't evaluated. if moving the property-check eliminates a lot of regex matches, then fair enough, but this seems intended to be primarily a correctness fixup, so I'd lean towards making the logical diff as small as possible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to push back on this a bit: I think this small improvement is in scope (since we're changing the logic anyway) and the diff here is relatively small and easy to understand. Moving the property check should indeed reduce the number of regex matches a fair bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem. I think we'll have to agree to disagree on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if you could avoid the extraction of an outer if statement here. When viewing the diff, the larger set of line changes obscure what the bugfix is; and that's based on my assessment today, after having understood the fix a couple of weeks ago. It'd be trickier for someone who wasn't familiar with the original pull request, or looking at it again after a longer duration of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understood your point, I just don't agree with it in this case. I have more to say on this, but I think we've gone out of scope of a review. If you want to talk about it more, feel free to email me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.

Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.

Copy link
Contributor Author

@wlach wlach Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. From my perspective, bug report, code review and related discussion are often useful information when trying to understand a commit/change, so I'd prefer to keep the discussion public; in this case I do think that the size of the line diff is relevant to code review.

I don't think I have more to say on the technical front. But just to state it one more time so it's clear: I think this (small) PR as it stands fixes the issue in a more fundamental way and improves performance so it's my strong recommendation that it be the final state of the code. I don't feel the benefit of breaking it up into two PRs is worth the hassle and churn (leaving aside that this discussion has now occupied more effort than that would entail...).

Would you be OK with me drafting an alternative PR? I don't know if it'd be more likely to be accepted for merge, and it's OK if it isn't, but I'd like to present it as an alternative.

It's not clear to me when someone who can actually merge the code will look at these changes. I can't stop you from proposing an alternative, but I think the best thing is probably just to wait for a maintainer to break our impasse. I am willing to make the change you suggest if it will actually get this merged and my PR is open to updates from the owners of this repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after finding more time to understand the code, I concede that the performance improvements here are important. I've opened #12045 with a description of what I believe the problem is.

Also, compared to my suggested alternative #12040, I think that the use of the JavaScript hasOwnProperty function is probably better/safer code style, making this pull request preferable.

@wlach could you edit the description of this pull request (#11958) to add 'Resolves #12045'?

(and thanks for your patience!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother much about this. Here, it's more important to use hasOwnProperty instead of !terms[word] which could lead to incorrect boolean evaluations (and personally if we can have an easy-to-write improvement because we can isolate a condition from outside a loop, it's fine for me).

@@ -21,7 +21,7 @@ describe('Basic html theme search', function() {
"<no title>",
"",
null,
2,
5,
Copy link
Contributor Author

@wlach wlach Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the score gets boosted to 5 (what it should be).

@jayaddison
Copy link
Contributor

@wlach could you merge the latest changes from the master branch into this one? That should resolve the Windows test failures. From what I remember of the plan, this is to be merged before #11942.

@wlach
Copy link
Contributor Author

wlach commented Feb 27, 2024

@wlach could you merge the latest changes from the master branch into this one? That should resolve the Windows test failures. From what I remember of the plan, this is to be merged before #11942.

Done

@jayaddison
Copy link
Contributor

I think that the problem here is due to an incorrect test fixture in our test suite; I've added one or two more details in the linked bugreport (#11957) and as a result am closing this too. Please re-open with supporting details if I'm mistaken.

@wlach
Copy link
Contributor Author

wlach commented Mar 1, 2024

I think that the problem here is due to an incorrect test fixture in our test suite; I've added one or two more details in the linked bugreport (#11957) and as a result am closing this too. Please re-open with supporting details if I'm mistaken.

@jayaddison It looks to me like the index generated by Sphinx only uses an array for the term entries if there are multiple matches. See this example:

https://repo-parser-demo.netlify.app/searchindex.js

And note the output here:

{ ..., "terms": {"thi": [0, 3, 4, 5, 10, 12], "i": [0, 6, 10, 12], "an": [0, 7], "type": 0, "you": [0, 3, 11], "might": 0, "abl": 0, ... }

I haven't tested, but it looks to me that the code that writes it out is here:

def get_terms(self, fn2index: dict) -> tuple[dict[str, list[str]], dict[str, list[str]]]:

I presume it's done this way as a space-saving strategy.

Unfortunately I don't have the rights to re-open this PR and issue, could you please do it for me?

@jayaddison
Copy link
Contributor

Yikes - OK. Thank you @wlach. That's a cleverer/more-adaptive format than I'd expected, and I should have checked for single-document terms. Yep, I'll re-open these.

if (!titleTerms.hasOwnProperty(word)) {
Object.keys(titleTerms).forEach((term) => {
if (term.match(escapedWord))
arr.push({ files: titleTerms[word], score: Scorer.partialTitle });

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the access to titleTerms[word] here a (separate) bug?

I think #11957 as described is broad enough to describe both. It's the same error: a partial match overwriting a full one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I see what you mean now given https://github.com/sphinx-doc/sphinx/pull/11958/files#r1509999419. Yeah, this looks like a separate bug strictly speaking. This is such a rabbit hole. 😭

I'd be inclined to fix it in a new PR with its own test (maybe depending on this one).

Copy link
Contributor

@jayaddison jayaddison Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no problem, there's a lot going on here :) As mentioned though, I feel good about the effect on quality that all this will have.

Agreed; I'll create a separate bugreport for this, and then request review for #12037 after this one is merged.

Could you clarify in the description of #11957 that it relates solely to terms that only occur in the zeroth document in the index? (if my understanding of that is true?)

Edit: fix pr/issues references

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg, sorry - those linked issues were wrong; I meant to request a change for the description of #11957 - to make the nature of the bug more precise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reported separately as #12040.

if (!titleTerms.hasOwnProperty(word)) {
Object.keys(titleTerms).forEach((term) => {
if (term.match(escapedWord))
arr.push({ files: titleTerms[word], score: Scorer.partialTitle });

This comment was marked as outdated.

@jayaddison jayaddison requested a review from picnixz March 3, 2024 13:10
@jayaddison
Copy link
Contributor

Note: if reviewing this pull request using the GitHub web viewer, then it may be easier to inspect by requesting the diff with whitespace-changes ignored: https://github.com/sphinx-doc/sphinx/pull/11958/files?w=1 (the w=1 query-string param)

@picnixz picnixz merged commit 1e4f80d into sphinx-doc:master Mar 3, 2024
23 checks passed
@picnixz
Copy link
Member

picnixz commented Mar 3, 2024

Thank you!

@jayaddison
Copy link
Contributor

Thank you both @wlach @picnixz!

@wlach
Copy link
Contributor Author

wlach commented Mar 3, 2024

Such a relief to finally get this in, thanks all!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search javascript Pull requests that update Javascript code type:bug type:performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML Search: Partial matches can overwrite main matches in certain rare cases
3 participants