Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new references builder #12190

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Mar 23, 2024

This PR add a new builder references, to build a single references.json, which provides a mapping for almost* all targets available to reference in the project, including:

  1. Internal domain objects, generated within the current project
  2. External domain objects, loaded from the objects.inv configured via intersphinx_mapping (when using the sphinx.ext.intersphinx extension)

* I say almost, because this assumes the objects returned from domain.get_objects account for the majority of referencable items in a project, but there are currently some notable exceptions, like the math domain not returning any (but that is for another PR to fix)

This partialy addresses #12152, to allow for a clear way for users to understand:

  1. What targets are available for them to reference
  2. How to reference these targets

Crucially, the references.json includes the mapping of object type to role names (this can be one-to-many),
since a role name is required for the reference syntax, not the object type.

I would also invisage other tools (like VS Code extension) could utilise this, to provide things like auto-completions, and "jump to target/references"


Some considerations:

  1. I feel a builder is really the only way to do this comprehensively; having a standalone CLI (like the current python -m sphinx.ext.intersphinx) can only get you so far, and then you will have to start re-implementing features of a normal sphinx build (like reading configuration, etc)

  2. Perhaps in a follow up PR I could introduce a complimentary CLI, that reads the references.json and allows users to quickly generate references. Something like sphinx-ref find 're.Match' returning :class:`~re.Match` (i.e https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8862652)

  3. There are cases where an object type has no matching role names, this PR is not addressing that (although I want to eventually)

  4. As I mention in Make intersphinx (a.k.a. external references) more user friendly #12152, it would be ideal for this to include, not just the document path where a local target is defined, but also the line number (if available). But this is not within the scope of this PR

  5. Creating a singular references.json is probably the simplest way to do this. But, it could get rather large, for a large project, or one with lots of intersphinx mappings.
    Is this ok, or do we think another format would be better, like one JSON file per domain / object type, or even something like an sqlite database file?

  6. The other non included in this PR, is any additions to the documention, I could do this here or in a follow-up PR

@chrisjsewell
Copy link
Member Author

(cc also @webknjaz, as I can't add you as a reviewer)

@picnixz
Copy link
Member

picnixz commented Mar 23, 2024

(test failure is likely because of a side effect)

@chrisjsewell
Copy link
Member Author

(test failure is likely because of a side effect)

yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder

@chrisjsewell
Copy link
Member Author

anyway, whilst I fix that, interested to hear your thoughts

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of comments (I'll be less available from now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new files like that, could we have explicit __all__ (empty by default if possible, since we don't really know what should be public or not).


class LocalReference(TypedDict, total=False):
type: Literal['local']
document: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we use document for something else I think. Would it be possible to use docname instead? (I didn't see yet but is it a full path or not?). If so, I'd suggest path instead of filepath. Because document is generally.. the document node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, its the full path as thats more helpful for users, so yeh could change to filepath

from __future__ import annotations

import json
from os import path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use os.path instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually, it's essentially to reduce the possibility of having a variable shadowing the import)

sphinx/builders/references.py Show resolved Hide resolved
data[domainname][otype_name] = {'items': {}}
if otype := domain.object_types.get(otype_name):
data[domainname][otype_name]['roles'] = list(otype.roles)
data.setdefault(domainname, {}).setdefault(otype_name, {})['items'].setdefault(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data.setdefault(domainname, {}).setdefault(otype_name, {})['items'].setdefault(
data[domainname].setdefault(otype_name, {})['items'].setdefault(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better: use intermediate variables here (because it's a bit hard to parse).

'url': url,
}
# only add dispname if it is set and not the same as name
if not (dispname == name or not dispname or dispname == '-'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

otype := local_domain.object_types.get(otype_name)
):
data[domainname][otype_name]['roles'] = list(otype.roles)
data.setdefault(domainname, {}).setdefault(otype_name, {})[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

# test the content of the reference file
content = (app.outdir / 'references.json').read_text('utf-8')
data = json.loads(content)
assert data == {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this one might need a factory for the sake of readability.

# test the content of the reference file
content = (app.outdir / 'references.json').read_text('utf-8')
data = json.loads(content)
assert data == {
Copy link
Member

@picnixz picnixz Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@picnixz
Copy link
Member

picnixz commented Mar 23, 2024

yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder

If you are worried about that, use srcdir=os.urandom(16).hex() in the sphinx marker. It's a way to isolate your test so that you don't have weird surprises (well you could still have surprises but you should be VERY unlucky (or lucky, if you were an adversary targetting AES-128)).

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 23, 2024

A bit of comments (I'll be less available from now)

Thanks for the review @picnixz, but perhaps I could nudge you for some quick general feedback on the concept 😅
Do you agree that this is a "good" thing to introduce? any thoughts on the references.json format?

@bskinn
Copy link
Contributor

bskinn commented Mar 24, 2024

Read through the new references.py builder. I'm weak on some of the technical details and Sphinx internals, so I can't speak strongly there.

But, here are some other thoughts.

Reaction to the 'generate a complete local & remote references list' idea --- +0.25.

It might be helpful having all targets, local or intersphinx, in one artifact? But after thinking about it, I don't think it's very important to me, personally---and, it seems to me the bigger problem is the object-type lossiness of the current v2 objects.inv format. (Or, at least the way in which Sphinx currently builds to that format.)

I think I would rather have better/more accurate information about the targets in my intersphinx-referenced docsets---which would require a new inventory format, as best I figure---than a list of all local and remote references, where the info I get on the remote references in that all-in-one artifact requires as much work to transform into a working cross-reference as the info I can get out of sphobjinv does.

If I'm trying to reference something in another project, I know which project it is, and I don't mind pointing a single-docset tool at that project's docs. (And, there's a good chance I might prefer that single-docset tool if I don't have to mess with an intermediate data file as part of the process.) The 'all in one place' aspect of this may have a broad appeal, but it's less important to me, personally.

Reaction to the layout of references.json --- overall +0.5 or so, with thoughts/caveats.

For automated ingestion of reference data, this schema seems great. 👍

Coming from a sphobjinv-biased perspective, my primary use case is, "I have this thing X that I want to cross-reference; how do I do that?"

So, from a data mapping perspective, what I want to be able to do with the output of this is to walk from [object name] -> [object reference].

I like the sound of the sphinx-ref find ... tool you proposed, but what happens if it doesn't do the search I want?

The current semantics of references.json are exactly backward for manual REPL exploration: it'll take a beefy, nested list comprehension to search through it for target names.

That said, using the right tool -- jsonpath-ng, say -- probably would make that search relatively straightforward. (Though, it would be more complex if the JSON gets broken up into multiple files.)

Choice of references.<ext> Format

If there's eventually a sphinx-ref find, I don't think the format of the output matters too much. As long as it's a standard, open format, anybody who wants to can interface with it. Format thoughts:

  • JSON would probably be the simplest format
    • Likely the easiest for manual exploration
    • Though the filesize question is real for large docsets, especially given that references.json would include all transitive references to intersphinx projects
      • All targets from the entire Python docs included in every references.json built...
  • SQLite does seem like a good option, giving a more compact file and sqlite is in stdlib
    • Manual exploration would be considerably more cumbersome, though
    • The schema would take some figuring out -- performance isn't a huge issue
      • One giant table, with domain and object_type columns?
      • One table per domain, with object_type columns? (Probably best?)
      • One table per domain/object_type combo? (Probably way too many tables)
  • Maybe tinydb? SQLite-like, but document database
    • Not in the stdlib, so it'd be a dependency both for Sphinx and for anyone trying to read it independently
    • But it fits the hierarchical data shape better, and it would be easier for manual exploration

@picnixz
Copy link
Member

picnixz commented Mar 24, 2024

I'll comment tomorrow (for this one, I need a bit of sleep)

@jakobandersen
Copy link
Contributor

A core problem is the use of domain.get_objects(). As alluded to in https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8877586 there is an inherent problem in intersphinx in that it assumes in knows how to write and read declared entities from each domain. The reading was mostly delegated to the domains, but the writing has not been yet.
Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats.

Currently get_objects() is used for only two purposes, as far as I can see: creating the index and creating inventories. The former is fine, as the fullname and dispname are only used for display purposes.
For inventories the fullname needs to encode all information about the entity in a string, so it can be loaded in again. This is not convenient for languages like C++ where the scoping information can be rather complex.
If I'm not mistaken, then this references builder is very similar to the inventory generation in its used of get_objects().

@picnixz
Copy link
Member

picnixz commented Mar 24, 2024

Since we are talking about a new Intersphinx format, I would like you to also think about how to serialize the entries in the inventories, especially concerning #11932. After reading Jakob's argument, I also think that domains should be responsible for serializing their intersphinx part however they see fit. It could also solve multiple issues that I could not necessarily find when implementing #11932 but if each domain knows how to properly represent their references in intersphinx, it would be better.

In addition, we could change the format of a specific domain (e.g., if there are bugs) without affecting the format of other domains. I suggest using the same approach as for ELF where there is a header section containing the location of each program section. Then each domain would serialize its own intersphinx inventory and intersphinx would only be responsible for merging the parts together. Then, each domain would deserialize its dedicated section and recover its references mapping.

The references builder you are suggesting would be responsible to normalize each domain output in a more human-readable format. In the JSON output, you would include "human-readable" entries + an offset and buffer size to the serialized data in the objects.inv binary file. That way, you can use it to recover a single referencable entity, and using in a standalone manner as well.

@chrisjsewell
Copy link
Member Author

Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats.
Since we are talking about a new Intersphinx format

See https://github.com/orgs/sphinx-doc/discussions/12204

@AA-Turner AA-Turner changed the title ✨ Add references builder Add a new references builder Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants