Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge in upstream changes from mkdocs-material #338

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

jbms
Copy link
Owner

@jbms jbms commented Apr 11, 2024

This replaces the modified Sphinx search implementation with the lunr.js-based search implementation from the upstream mkdocs-material theme.

Maintaining the existing custom search was proving to be untenable. Using the upstream search implementation greatly reduces the size of the diff and should make it much easier to incorporate future changes.

The downsides are:

  • Search index is significantly bigger. However, individual pages no longer need to be fetched to display detailed search result info; consequently, the net effect may not be that large.
  • The Sphinx search implementation had special support for sphinx object descriptions which was especially helpful for API documentation.

Remaining work before merging this:

  • Update documentation to describe new features/changes
  • Fix search index for incremental builds
  • Better integration of Sphinx "objects" into the lunr.js-based search:
    • typing the name of an API symbol often does not give that API symbol as the first result; we will need to fix that.
    • search results for Sphinx objects displayed specially (e.g. tagged with "Python function") in the old search implementation. We should add something similar to the new search implementation.

@jbms jbms requested a review from 2bndy5 April 11, 2024 04:24
@jbms jbms force-pushed the merge-upstream branch 3 times, most recently from 7b53c8b to 593856d Compare April 11, 2024 05:49
@2bndy5

This comment was marked as resolved.

@jbms

This comment was marked as outdated.

This replaces the modified Sphinx search implementation with the
lunr.js-based search implementation from the upstream mkdocs-material
theme.
@jbms

This comment was marked as resolved.

@2bndy5

This comment was marked as off-topic.

@2bndy5

This comment was marked as resolved.

@2bndy5

This comment was marked as resolved.

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 11, 2024

I hope you don't mind if I push some commits to your branch. I won't force push anything, so you keep squashing the history as you see fit.

@jbms
Copy link
Owner Author

jbms commented Apr 11, 2024

I hope you don't mind if I push some commits to your branch. I won't force push anything, so you keep squashing the history as you see fit.

Yes please go ahead!

Maybe there's a more elegant way to approach this, but I'm just hacking this into something passable for CI
@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 11, 2024

I won't force push anything

I lied. I screwed up the typing on a commit I didn't run through mypy first...

@2bndy5

This comment was marked as resolved.

add info to doc and mention the `hide-edit-link` metadata for overriding per page
@jbms
Copy link
Owner Author

jbms commented Apr 11, 2024

I noticed you fixed some lint issues in the search plugin (which I copied unmodified from upstream except for replacing mkdocs with mkdocs_compat) --- unless they are necessary bug fixes it would be better to revert those to make it easier to merge changes in the future.

Similarly we can just check for the HTML builder in our own search extension sphinx_immaterial/search.py rather than modifying the upstream plugin.

@jbms

This comment was marked as resolved.

@2bndy5

This comment was marked as resolved.

probably a relic from the merge script rebase
@2bndy5

This comment was marked as resolved.

also ran `npm audit fix` which bumped `node_modules/tar` and `tar` in package-lock
@2bndy5 2bndy5 linked an issue Apr 12, 2024 that may be closed by this pull request
@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 12, 2024

I bumped mermaid to 10.7.0 (as is used in upstream). this fixed the mermaid problem and resolves #328.

I also ran npm audit fix which bumped a couple package versions in the lock file (both had tar in the name). I hope that's ok. I can revert the audit if that was unwanted.

@jbms
Copy link
Owner Author

jbms commented Apr 12, 2024

Bumping dependencies is fine in general, the one tricky thing is that we have to be careful about cssnano/postcss-related deps because newer versions result in OOM (also applies to upstream theme) and I don't yet know of a solution. The only way I got this merge to work was to copy the lockfile from the upstream theme and then run npm install to integrate our additional deps.

Copy link
Collaborator

@2bndy5 2bndy5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this works well enough for me as it is now.

I'm going to focus on the feat: upstream issues though.

I haven't fully wrapped my head around the ported search implementation. My main confusion comes from unfamiliarity with lunr.js (or any search implementation for that matter). But I can try to help if this PR gets too stale.

FWIW, I really don't use the search functionality on web sites unless I'm in a hurry or completely lost for more than 3-5 minutes. I typically find the right page and use Ctrl+F instead.

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 13, 2024

A few other things I've noticed (all of which do not bother me):

  • Enhanced tooltips are not applied to graphviz diagrams. This likely isn't a simple CSS fix that could be added to our _graphviz.scss because the placement of enhanced tooltips seems to be calculated by JS.

  • The hlist directive throws enhanced tooltip placement off terribly.

  • The newer navigation.prune feature doesn't seem to have an effect. Although, I thought we were already doing this by default.

  • The toc object icons are aligned to the top of multi-lined entries, whereas before they were vertically center-aligned in multi-line entries

    before after
    image image

Comment on lines +136 to +139
{% if page.meta.git_revision_date_localized %}
{% set updated = page.meta.git_revision_date_localized %}
{% elif page.meta.revision_date %}
{% set updated = page.meta.revision_date %}
Copy link
Collaborator

@2bndy5 2bndy5 Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that we've been feeding Sphinx' last_updated context into this template.

"meta": {"hide": [], "revision_date": context.get("last_updated"), "meta": []},

But this doesn't actually describe the last revision date. All pages have the last build date used even if the doc wasn't actually altered on the date it was built. This inaccuracy is not caused by the shallow checkout used in CI builds; I'm also seeing the same behavior locally.

From yesterday's RTD build for this PR (in task_lists.html):
image

This template feature would be better served by #341, but I guess this behavior would suffice since it has been this way. We could instead manually get the info using

time.strftime("%x", os.stats(doc2path(pagename)).st_mtime)

but again, the checkout date is not the same as the last modified date since git checkout doesn't persist file attributes.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git doesn't store any times at the per-file level, and git checkout / git clone sets the modification time to the time of the checkout (for an existing work tree, only the modified files get updated time stamps). Therefore we have to explicitly query git to find the most recent commit that modified the file as in #341, because the only timestamps stored by git are commit timestamps. For that we will need a non-shallow checkout but that probably isn't too big of a deal in practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran across a reference in upstream docs that recommends sparce checkouts. But I didn't see it get used in their CI (at least in community edition docs build).

@jbms
Copy link
Owner Author

jbms commented Apr 13, 2024

Honestly, this works well enough for me as it is now.

I'm going to focus on the feat: upstream issues though.

I haven't fully wrapped my head around the ported search implementation. My main confusion comes from unfamiliarity with lunr.js (or any search implementation for that matter). But I can try to help if this PR gets too stale.

FWIW, I really don't use the search functionality on web sites unless I'm in a hurry or completely lost for more than 3-5 minutes. I typically find the right page and use Ctrl+F instead.

I think the usefulness of search may depend on a lot of factors (including how good the search results are!) but for API documentation in particular a common use case is that you want to find the documentation for a specific symbol, and for that search is very useful. Suppose I want to find out about the tensorstore_demo.IndexDomain class. Compare our old search results:

https://jbms.github.io/sphinx-immaterial/?q=IndexDomain

with the new search results:

https://sphinx-immaterial--338.org.readthedocs.build/en/338/?q=IndexDomain

As for how search works, basically:

  1. the upstream search plugin parses the HTML of each page to split it into sections.
  2. Each section is then treated as a separate "document" for the purpose of searching. All HTML tags except a few that are supported for the "rich search results" are stripped out. Each document has a number of fields in addition to the main text/html content: the currently-used fields are title and tags, I believe. (Tags refers to the upstream tag feature that we don't currently support.) Documents can also have an additional numeric "boost" metadata property that affects how it is ranked. Each section with its fields is then serialized directly as JSON without further processing.
  3. The client javascript reads the json containing all of the sections and builds a lunr.js search index. When doing a search, the results are ranked based on which field the terms matched against --- tags are the highest, then title, then main body. Then the additional boost property is also factored into the rank.
  4. After performing the search, multiple documents from the same page are then grouped together when displaying results.

To improve results I think we should:

  • Modify the section parsing to ensure sphinx object descriptions are treated as separate sections
  • Include some metadata in the search documents generated from sphinx object descriptions to indicate the object description type (e.g. "Python class"), as we did before, and find a way to display it nicely (e.g. as text or as some type of icon) in the results. (Kind of optional, but pretty helpful.)
  • Sphinx has a concept of search weighting for object description types. We should include that as the boost value for the sections corresponding to object descriptions. That allows, for example, object descriptions for individual function parameters to be downweighted in the results.

It is possible we may need other changes to make API documentation searching work well, but those things would be a start.

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 13, 2024

Superb explanation! 🚀 Doing another code dive...

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 13, 2024

Modify the section parsing to ensure sphinx object descriptions are treated as separate sections

Just a gut reaction here: To avoid any deviation from the inherited search plugin, we could instead derive from it and make any API changes that way.

def _make_indexer(app: sphinx.application.Sphinx):
return search_plugin.SearchIndex(**dict(_get_search_config(app)))

To compensate for this complexity, our search.py module could be organized as a sub-package. Also, support for other features (like tags) would then be contained/scoped to search-specific implementations.

@jbms
Copy link
Owner Author

jbms commented Apr 13, 2024

Modify the section parsing to ensure sphinx object descriptions are treated as separate sections

Just a gut reaction here: To avoid any deviation from the inherited search plugin, we could instead derive from it and make any API changes that way.

def _make_indexer(app: sphinx.application.Sphinx):
return search_plugin.SearchIndex(**dict(_get_search_config(app)))

To compensate for this complexity, our search.py module could be organized as a sub-package. Also, support for other features (like tags) would then be contained/scoped to search-specific implementations.

Yeah potentially we could indeed monkey patch some of the classes (e.g. SearchIndex and Parser) rather than modifying the file itself --- once we know what modifications are needed we can figure out the best way to make them.

Note that the upstream search plugin is already used in a hacky way from sphinx_immaterial/search.py since it is intended as a mkdocs plugin, not a sphinx extension. In particular the only portion of the SearchPlugin class itself that we are using is on_config for filling in default values of the config. Then we use the SearchIndex class directly.

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 14, 2024

Still investigating an adequate approach, so this post is just an organization of my thoughts/findings.

Sphinx has a concept of search weighting for object description types. We
should include that as the boost value for the sections corresponding to object descriptions

I found that Sphinx aggregates a prio in search.IndexBuilder.get_objects(). Assuming prio meabs "priority", this prio is returned by <derived-Domain()>.get_objects() (ie here for python & here for cpp). Luckily, this prio is return in a tuple that includes the object ID as used in the HTML elements' id attribute.

To suplement this info into our SearchIndex.add_entry_from_context() results, we actually need to augment upstream's SearchIndex.create_entry_for_section(). Each entry's "boost" field can be set according to the object's id; I'm thinking 1 + (0.5 * prio). However, fetching the object description's ID is not straight forward for our apigen outputs.

  • For autodoc (& manual invocation of sphinx domain directives), the id is appended to the entry's "location" field (as a URL hash location).
    [
      {
        "location": "demo_api.html#test-py-module",
        "title": "<code>test_py_module</code>",
        "text": "omitted for brevity"
      },
    ]
  • For apigen, the id may not present in the "location" field because each documented member gets its own page, resolving to just the page URL without hash URI. This case is a stumper (at least as I'm writing this).
    [
      {
        "location": "python_apigen_generated/IndexDomain.size-aea1f878.html",
        "title": "tensorstore_demo.IndexDomain.size",
        "text": "omitted for brevity"
      },
    ]

It would be nicer to have a 1-to-1 map of HTML-to-doctree to properly & robustly set the boost appropriately for API descriptions.

Initially, I don't think we need to monkeypatch anything to achieve this. It could probably be done in a separate function that _html_page_context() calls. But here's where we have to consider performance. If we can extract the CSS classes of the API description signature, then we can avoid traversing all domains in the builder env and, for example, just focus on py domain if "py" is in the list of classes.


I'm partial to opening another PR that merges into this branch, so we don't have to play around with this branch's git history to keep my proposal's reviews/changes "clean". The remaining goals blocking this PR are not simple objectives.

PS - I also thought about leveraging upstream's tag plugin with Sphinx domains, but I don't see that being very feasible since tags are really specific to an entire page (not sections of a page).

@jbms
Copy link
Owner Author

jbms commented Apr 14, 2024

I think we can map back to the sphinx object description info from the HTML in a similar way to how the old search_adapt.py does a similar mapping:

We can first iterate over all objects to compute a map of (html_path, anchorname) -> (domain, object_type, any other info) and then use that when processing the HTML. The builder has a method to get the html path from the docname.

@jbms
Copy link
Owner Author

jbms commented Apr 14, 2024

Note that most object descriptions do not result in a section based on the current section parsing logic in the search plugin --- we would have to change the parsing, or modify the HTML before feeding it to the parser, to fix that. With apigen we do get a section because there is a separate page, but I think otherwise we don't.

@2bndy5
Copy link
Collaborator

2bndy5 commented Apr 14, 2024

Just as a quick hack, I added "dl" to the Parser.keep attr, and I get much better API results. 😆 They still need to be boosted though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken link to fontawesome/brands icons Feature request: Mermaid js 10.x
2 participants