Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LaTex] code-block printed out of margin #8849

Closed
sebastien-riou opened this issue Feb 7, 2021 · 4 comments · Fixed by #8854
Closed

[LaTex] code-block printed out of margin #8849

sebastien-riou opened this issue Feb 7, 2021 · 4 comments · Fixed by #8854

Comments

@sebastien-riou
Copy link

@jfbu
this is not handled correctly currently: long hex strings
Screenshot from 2021-02-07 12-37-17

code:

DryGASCON128k56:

.. code-block:: shell

   $ python3 -m drysponge.drygascon128_aead e 000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F3031323334353637 000102030405060708090A0B0C0D0E0F "" ""
   28830FE67DE9772201D254ABE4C9788D

link to rst file: examples_cli.rst

Originally posted by @sebastien-riou in #8686 (comment)

@jfbu
Copy link
Contributor

jfbu commented Feb 7, 2021

Thanks for report. The mechanism for wrapping long code lines does not work here. This mechanism uses distinct techniques:

  • benefit from Pygments mark-up and modify meaning of mark-up to insert potential linebreaks,
  • handle the space character in a special TeXnical way,
  • handle a few extra characters, not escaped by Pygments, in the similar TeXnical way.

For digits 0123456789 and letters ABCDEF, although it is possible in small TeX files to imitate what is done in the last two items, in real life this is simply a no-go.

(for example digits 0 to 9 appear in color specifications in the Pygments mark-up; if we start making them behave specially we immediately break the \color macro; and although Pygments mark-up does not use letters A..F in macro names, font files loaded by LaTeX on switching fonts do, in particular \ProvidesFile has an F).

So, the only solution I see is for the Pygments library itself to identify hex-strings and add LaTeX dummy mark-up for each digit and letter within it. Then we can achieve the desired result. Something like having

\PYGZhexdigit{7}\PYGZhexdigit{F}\PYGZhexdigit{0}

Short of that, I currently see no way. In a pure LaTeX way where you write the document manually it is enough to have a single macro \makethisbreakableateachcharacter{7F0314...}, but it is even easier if we already have the mark-up in place hexdigit per hexdigit and what is tedious for a human is ok for a program!

I suggest you open a ticket at their repo: https://github.com/pygments/pygments linking to this issue.
(I admit I have only a partial knowledge of pygments latex aspects, perhaps it already has such an option for this functionality, then that would be great)

@sebastien-riou
Copy link
Author

Thanks for your reactivity. I think that the problem is more general than hex strings. Potentially a code block may contain arbitrary strings, some very long. For example a base64 encoded message.
I have 0 knowledge about sphinx to PDF conversion but I think that if all the smart mechanism in place fails to find a line break, a last resort mechanism shall "simply" break the line at the last character before hitting the margin.

@jfbu
Copy link
Contributor

jfbu commented Feb 7, 2021

Yes you are right that the problem is more general. Also I suspect it is not probable we will get from Pygments the feature; here the "shell" lexer is used, and what is then a "string" for it? not obvious. You don't want to cut in middle of some shell command, if a great many of them are ; separated and any word can be a shell command. Or it can be an innocent argument we are allowed to cut into two pieces.

"simply" breaking the line could work but it will not be simple! To give some context here is how your input is converted:

\PYGZdl{} python3 \PYGZhy{}m drysponge.drygascon128\PYGZus{}aead e 000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F3031323334353637 000102030405060708090A0B0C0D0E0F \PYG{l+s+s2}{\PYGZdq{}\PYGZdq{}} \PYG{l+s+s2}{\PYGZdq{}\PYGZdq{}}
28830FE67DE9772201D254ABE4C9788D

You see the mark-up macros. They aren't too numerous in this case. Counting character is not obvious due to their presence, and we don't want to cut right inside their names. On the other hand because we tell TeX that backslash and { and } have their usual meaning it is not too big catastrophy if the "cut" is right in middle of an argument, i.e. {...} enclosed stuff. Because the non-paired { on line N will force TeX into picking up until the } on next line. So the typesetting macro will have almost its correct argument but for a line break. In normal TeX this gives a space token, but here it has special meaning, so the output will have the line break. Like this:
Capture d’écran 2021-02-07 à 22 44 28
I manually split after A1B1. I think I can arrange for the continuation symbol you see at start of other lines be there too. There are technicalities here.

For standard code lines which are long not because of "strings", the current approach which lets TeX itself via its paragraph building algorithm do the line breaks works fine in general. There is #8686. If we apply the "simply pre cut line" approach, this will not escape completely #8686 because as I explained above anyhow perhaps some synatx highligting macro will fetch the whole thing, and it will be rendered in one "horizontal box", only with twice the normal vertical height, it will not be two stacked horizontal boxes, it can't allow pagebreak.

Currently the main issue I see is that we don't know in advance what will be the linewidth; the user can change the font so we can not know for sure the character width (it can be determined dynamically but I am talking here about the Python side of things; I am focusing here on some Python parsing, because doing it entirely on LaTeX side could be feasible but will be complex). Code-blocks can appear in indented context, even in table cells, in narrow columns. The Sphinx latex builder will not have the means to know in advance how many characters make a line in output. Even if we knew it, say that the target width is 66 characters, it will require a bit of work to identify in output of Pygments inclusive of its latex mark-up where to legitimately break.

Ideally we should also count how deeply we are inside braces at cut location. We should add closing braces as many are needed, then re-inserts the nested formatting macros at start of second part of what was split. If we do that then the successive partial parts will each occupy a "horizontal box" of its own, and pagebreaks will work.

Basically we need a parser of Pygments LaTeX output... something could be done. The most satisfying result would be to let this parsing be done by LaTeX itself. No easy task. But only way to adapt well to linewidth.

But whether in Python or LaTeX, it will be very difficult to not break in middle of a keyword. How to distinguish things we can't break and things we can? The current approach allows breaks only at some punctuation characters and other special character and we can tell TeX to preferentiably break before or after, up to the cost of some space left at end of line.

edit: I see a way to instruct the latex code that the TeX native process could not find a good break point; then it would be possible go to the "cut at any cost approach"; this could try the ambitious method of counting braces and adding them as well as the syntax highlighting macros at suitable point, or perhaps to work with some TeX box manipulation. I will think about it.

The problem you raise is to allow more breakpoints.This is not feasible via handing over the whole stuff to TeX paragraph builder because it is simply impossible to let digits and A..F become "active", only way would be a pre-analysis of the line. So the breaking must be done either at Python side, or at LaTeX side either via some pre-parsing or via some multi-pass approach (latex does not have a try: way, errors or always terminal..); pre-parsing could possibly locate the Pygments "keyword" mark-up and avoid breaking there.

ideally I wish I could transfer the hard work to Pygments: Pygments's lexers can know unbreakable keywords and they could arrange so that when one breaks somewhere it isnot in middle of a keyword, and that one closes all nested formatting to restart them all on next line. User will tell Pygments: my target width is 80characters.

(I have edited to be a bit clear; trying also to be less verbose. It is hard because verbosity is my usual way to prepare to solve the problem... eventually, perhaps. It is true that #8686 has quite some relation to this because my latest thoughts on how to solve #8686 would be to write myself all the necessary latex code rather than being dependent on fancyvrb latex package, and it is possible that some things coming to my mind for the current issue also can be done easier from fresh coding)

@jfbu
Copy link
Contributor

jfbu commented Feb 8, 2021

Turns out I may have a working solution. It will require some testing. See #8854.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants