Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighting skipped on literal_block with unicode #4225

Closed
abooij opened this issue Nov 8, 2017 · 14 comments
Closed

Syntax highlighting skipped on literal_block with unicode #4225

abooij opened this issue Nov 8, 2017 · 14 comments

Comments

@abooij
Copy link

abooij commented Nov 8, 2017

When using unicode in a language that accepts it, on a block of code that is accepted by pygments, the doc build breaks.

Problem

I am trying to get syntax highlighting for the following code:

Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).

As you can see, this uses the unicode symbol ¬.

Procedure to reproduce the problem

As a test case, I try to compile the following index.rst in an otherwise vanilla sphinx project:

This breaks:

.. code-block:: coq

    Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).

But this works:

.. code-block:: coq

    Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).

Error logs / results

$ make html                     
Running Sphinx v1.6.5
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
reading sources... [100%] index                                                                          
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] index                                                                           
/tmp/test/index.rst:3: WARNING: Could not lex literal_block as "coq". Highlighting skipped.
generating indices... genindex
writing additional pages... search
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 1 warning.

Build finished. The HTML pages are in _build/html.

Expected results

The original code is processed fine by pygments directly, hence sphinx should, too.
And if sphinx insists on breaking, it should tell me why.

Environment info

  • OS: linux
  • Python version: 3.6.2
  • Sphinx version: 1.6.5
  • Pygments version: 2.2.0
@jfbu
Copy link
Contributor

jfbu commented Nov 8, 2017

Additional info: the html produced from pygmentize -v -l coq -f html -o temp.coqout temp.coq looks like this

<div class="highlight"><pre><span></span><span class="kn">Definition</span> <span class="n">logeq_both_false</span> <span class="o">{</span><span class="n">X</span> <span class="n">Y</span> <span class="o">:</span> <span class="n">UU</span><span class="o">}</span> <span class="o">:</span> <span class="err">¬</span><span class="n">X</span> <span class="o">-&gt;</span> <span class="err">¬</span><span class="n">Y</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">X</span> <span class="o">&lt;-&gt;</span> <span class="n">Y</span><span class="o">).</span>
</pre></div>

The bits <span class="err">¬</span> do not look encouraging... does err stand for "erroneous input"? I inserted this manually in some HTML produced by Sphinx (I used your working input and injected directly in html the above) and here is what it gives;

capture du 2017-11-08 17-03-21

Similarly I tried with the latex formatter of pygmentize, I picked up its produced code, replaced all \PY macros by \PYG as used by Sphinx, inserted it by manual surgery in a Sphinx LaTeX document and this gives this PDF output:

capture du 2017-11-08 16-56-29

So, it does seem that even on Pygmentize side the Unicode code-points are treated as some sort of error.

@jfbu
Copy link
Contributor

jfbu commented Nov 8, 2017

Even if Pygmentize 2.2.0's COQ lexer does not yet handle such input, the question remains why Sphinx would abort syntax highlighting altogether rather than keeping the "err"̀ class tags, which may be better than nothing as user can override the CSS to remove visual display of an error condition of pygmentize.

@jfbu
Copy link
Contributor

jfbu commented Nov 8, 2017

about the good first issue tag, I don't think we have a definitive policy here, so in the end I removed it as I don't like so much the idea of it actually... ;-)

@tk0miya
Copy link
Member

tk0miya commented Nov 11, 2017

Now sphinx uses raiseonerror filter of pygments. On my local, pygments considers the unicode characters as an error.

bash-3.2$ cat failed.coq
Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < failed.coq

*** Error while highlighting:
ErrorToken: \xac
   (file "/Users/tkomiya/work/sphinx/.tox/py27/lib/python2.7/site-packages/pygments/filters/__init__.py", line 196, in filter)
*** If this is a bug you want to report, please rerun with -v.
Definition logeq_both_false {X Y : UU} : 
bash-3.2$ cat suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).

@jfbu
Copy link
Contributor

jfbu commented Nov 11, 2017

Oh, I had missed the -F raiseonerror pygmentize option. If Sphinx has no way to instruct Pygments to distinguish Unicode character caused error from other types of error then prospects are bleak here.

@tk0miya
Copy link
Member

tk0miya commented Nov 11, 2017

I don't know why pygments raises an error because I don't know about coq language. Anyway, current Sphinx always uses raiseonerror mode of pygments. So it is nice to change this settings through options.

@abooij
Copy link
Author

abooij commented Nov 20, 2017

The code is valid coq as it is accepted by the compiler. Does that mean this should be additionally be reported with pygments?

@jfbu
Copy link
Contributor

jfbu commented Nov 20, 2017

@abooij it does look as being primarily a Pygments problem.

@abooij
Copy link
Author

abooij commented Nov 20, 2017

@jfbu, why do you say that this is "primarily" a pygments problem? After all, pygments' own output is usable, although indeed not perfect.

@jfbu
Copy link
Contributor

jfbu commented Nov 20, 2017

@abooij 1) the "raiseonerror" Pygments filter does not seem to have great granularity, it generates an exception when the lexer generates an error token, and 2) if Unicode characters are ok in Coq input (I don't know), then why does the Coq lexer generate an error token? Except if something is wrong in the way Sphinx calls Pygments, this puts most of the problem on the Pygments side.

@abooij
Copy link
Author

abooij commented Nov 20, 2017

@jfbu Well indeed there is a problem on the pygments side. However, I highly doubt pygments will ever be able to get Coq syntax (or any other moderately complicated modern syntax, for that matter) completely right. In those cases, it would be preferable if sphinx generated something workable, rather than give up completely as soon as some subtle aspect of a language's syntax is used.

(Unicode is OK in coq and widely used.)

@jfbu
Copy link
Contributor

jfbu commented Nov 20, 2017

I think @tk0miya concurred that an option to let Sphinx not use raiseonerror Pygments filter could be added to Sphinx. But then the user still has to use CSS for html or a custom redefinition in LaTeX preamble of \PYG@tok@err macro to decide what to do with the error highlighting itself.

Issue #4249 explains that Pygments' LaTeX formatter has a bug in that respect because it uses a \strut inside an \fcolorbox for highlighting error tokens and this causes an increased total height+depth to larger than normal baseline (also the character width is increased). That issue remains invisible currently because Sphinx renounces using the Pygments output if there are token errors reported by the lexer.

@asmeurer
Copy link
Contributor

Can this option to disable raiseonerror be added, or alternatively, could Sphinx just not use it at all? I want to syntax highlight some IPython console code and if the output contains a Unicode character (like it does with %timeit, with ±), the lexer produces an error token and it fails. But it could easily just syntax highlight it ignoring the error token.

This is how, for instance, the highlighter on GitHub works, and also virtually every Markdown parser I've ever used. If you put some text in ```python that contains some not valid Python, it just highlights what it knows and ignores the rest, e.g.,

In [1]: %timeit sum([i for i in range(10000)])
335 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

It seems that in general it should be possible to choose a highlighter and have it just do the best it can with the input. It will at least definitely color the parts of it that definitely are Python (or whatever language), but the way it works now, a single "bad" character means it doesn't color anything.

@AA-Turner
Copy link
Member

Fixed by 7d8df06.

A

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants