Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using alternative markdown renderers #455

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

huonw
Copy link
Contributor

@huonw huonw commented May 6, 2020

Shelling out to pandoc twice for each markdown cell can be slow. Using a single-process/in-memory renderer for any markdown can be noticably faster.

This adds support for using commonmark-py via `nbsphinx_markdown_renderer = "commonmark".

For instance, builds of https://github.com/stellargraph/stellargraph's docs go from ~2 minutes to 40 seconds (stellargraph/stellargraph#1517).

This is a draft PR because there's several open questions, that would be good to collaborate on/clarify before I spend more effort on this:

  • is this a useful feature at all?
  • is there an alternative (or additional) Python library we can use? Preferably one that covers more common extensions like tables (Support Tables readthedocs/commonmark.py#28) and equations (Math equations readthedocs/commonmark.py#49). I briefly investigated:
    • mistune: the rST support and table extensions seem to only be available in the 2.0.0 alphas, but this cannot be installed as nbconvert depends on version 0.8.4
    • Python-Markdown: has a lot of extensions, but seems to only support targeting HTML
    • m2r: not maintained
    • m2rr: maybe not popular enough?
    • recommonmark: unclear to me how to use this for this particular case
  • is it worth generalising this even more? e.g. this could allow nbsphinx_markdown_renderer to be set to any subclass of nbsphinx.MarkdownRenderer (in addition to a str name) to let users customise

(NB. the diff here is much smaller ignoring whitespace, to skip over the changed indentation of the markdown2rst: https://github.com/spatialaudio/nbsphinx/pull/455/files?w=1)

Shelling out to pandoc twice for each markdown cell can be very slow. Using a
single-process/in-memory renderer for any markdown can be noticably faster.

For instance, builds of https://github.com/stellargraph/stellargraph's docs go
from ~2 minutes to 40 seconds.
@pep8speaks
Copy link

Hello @huonw! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 752:80: E501 line too long (102 > 79 characters)
Line 1307:80: E501 line too long (170 > 79 characters)
Line 1314:80: E501 line too long (80 > 79 characters)
Line 1322:80: E501 line too long (81 > 79 characters)
Line 1323:80: E501 line too long (84 > 79 characters)
Line 1363:80: E501 line too long (82 > 79 characters)
Line 1414:1: E305 expected 2 blank lines after class or function definition, found 1
Line 1414:80: E501 line too long (97 > 79 characters)

@chrisjsewell
Copy link
Contributor

chrisjsewell commented May 6, 2020

I would point out: https://github.com/executablebooks/markdown-it-py, which has tables, among other extensions, and is already used in https://myst-parser.readthedocs.io, https://myst-nb.readthedocs.io/en/latest/ and https://jupyterbook.org 😁

@mgeier
Copy link
Member

mgeier commented May 6, 2020

is this a useful feature at all?

Yes, definitely!
But I'd like to do it a bit differently.

Instead of converting from Markdown to reST and then from reST to a docutils "doctree" object, I would like to do the conversion in one step, see #36.

is there an alternative (or additional) Python library we can use? Preferably one that covers more common extensions like tables (readthedocs/commonmark.py#28) and equations (readthedocs/commonmark.py#49).

Ideally this should be a generic extension point, so we wouldn't have to support a limited set of parsers.

is it worth generalising this even more? e.g. this could allow nbsphinx_markdown_renderer to be set to any subclass of nbsphinx.MarkdownRenderer (in addition to a str name) to let users customise

Yes, I think it should have at least that level of generality.

I'm not sure a class is the best API, though. I think a function should suffice.

I think that's what nbsphinx should do in the future:

  • create an empty "doctree"
  • iterate through all cells of the notebook
    • if it's a code cell: add the necessary nodes to the doctree
    • if it's a Markdown cell: call the configurable Markdown parser, which adds the necessary nodes to the doctree

This means, instead of Markdown "renderers" which generate reST code, we would need Markdown "parsers" that can append to a doctree.

A very important feature of those parsers would be to support piecemeal parsing, while not destroying the section hierarchy.

For example, one Markdown cell might "open" a section of a certain level.
If a code cell follows, this code cell would have to be added to the same section level.
And if another Markdown cell follows after that which has only text (and no section headers), this text should also be added to the right section level.

I don't know if the existing Markdown/CommonMark parsers with "doctree" output do support this feature.

@chrisjsewell Can MyST-Parser parse partial Markdown documents?

I think the most promising option for a parser would be something between markdown-it-py and MyST-Parser. Something that supports parsing into docutils doctrees but doesn't have the MyST extensions enabled by default.

@huonw
Copy link
Contributor Author

huonw commented May 10, 2020

Instead of converting from Markdown to reST and then from reST to a docutils "doctree" object, I would like to do the conversion in one step, see #36.

That issue has been open for a while, has there been more progress other than the comments on it? If not, is it worth pursuing this PR as an intermediate step?

One concern that I guess you might have is migration: at the moment I guess switching to docutils directly would be unobservable to a user (except for differences to Pandoc's rendering), whereas with this change (with the arbitrary-function version) it would be observable to users who have a custom def nbsphinx_markdown_renderer(markdown): ... function?

Ideally this should be a generic extension point, so we wouldn't have to support a limited set of parsers.

If this is generalised, would you want built-in support for any parsers other than Pandoc? At least with the current MD -> reST approach, the function required for other libraries is tiny (e.g. the core of the CommonMark one is two lines, and mistune would be similar).

I would point out: https://github.com/executablebooks/markdown-it-py, which has tables, among other extensions, and is already used in https://myst-parser.readthedocs.io, https://myst-nb.readthedocs.io/en/latest/ and https://jupyterbook.org 😁

@chrisjsewell I think this doesn't work with the current setup of nbsphinx, for the same reason as Python-Markdown (and mistune < 2): no reStructuredText rendering?

@mgeier
Copy link
Member

mgeier commented May 11, 2020

Instead of converting from Markdown to reST and then from reST to a docutils "doctree" object, I would like to do the conversion in one step, see #36.

That issue has been open for a while, has there been more progress other than the comments on it?

Not that I know of, except the things already discussed in this PR.

We would need an appropriate Markdown parser before #36 can be tackled.

The MyST/markdown-it-py project looks promising, but I think it doesn't have all the needed features (yet?).

If not, is it worth pursuing this PR as an intermediate step?

No, not when it exposes an implementation detail like the intermediate reST representation.

But if it just "invisibly" switches the Markdown parser, I'm all for it!

One concern that I guess you might have is migration: at the moment I guess switching to docutils directly would be unobservable to a user (except for differences to Pandoc's rendering), whereas with this change (with the arbitrary-function version) it would be observable to users who have a custom def nbsphinx_markdown_renderer(markdown): ... function?

Exactly.

Ideally this should be a generic extension point, so we wouldn't have to support a limited set of parsers.

If this is generalised, would you want built-in support for any parsers other than Pandoc?

I guess there should be some parser selected by default. I don't care which one, as long as it has the right features.
And ideally, Pandoc wouldn't be involved at all in the default workflow.

At least with the current MD -> reST approach, the function required for other libraries is tiny (e.g. the core of the CommonMark one is two lines, and mistune would be similar).

You would still have to take care of the Markdown extensions used in Jupyter.

Do you already have a feature-complete parser to replace Pandoc?

I have no problem with switching from Pandoc to another parser, as long as we don't expose the intermediate reST representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants