Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linkcheck] PDF anchor (...pdf#anchor) leads to 'utf-8' codec can't decode byte ... #11041

Open
goekce opened this issue Dec 20, 2022 · 3 comments · May be fixed by #12197
Open

[linkcheck] PDF anchor (...pdf#anchor) leads to 'utf-8' codec can't decode byte ... #11041

goekce opened this issue Dec 20, 2022 · 3 comments · May be fixed by #12197

Comments

@goekce
Copy link

goekce commented Dec 20, 2022

Describe the bug

Related to #7694.

Note that the query symbol ? is not required when using an anchor (i.e., #fragment).

How to Reproduce

index.rst:

`link1 <https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf?#page=226>`_
`link2 <https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf#page=226>`_
`link3 <https://docs.python.org/3/whatsnew/3.11.html?#whatsnew311-pep654>`_
`link4 <https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654>`_
$ sphinx-build -b linkcheck . build

Output:

(           index: line    1) ok        https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654
(           index: line    1) redirect  https://docs.python.org/3/whatsnew/3.11.html?#whatsnew311-pep654 - with unknown code to https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654
(           index: line    1) broken    https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf#page=226 - 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
(           index: line    1) broken    https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf?#page=226 - 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Environment Information

Platform:              linux; (Linux-6.0.12-arch1-1-x86_64-with-glibc2.36)
Python version:        3.10.8 (main, Nov  1 2022, 14:18:21) [GCC 12.2.0])
Python implementation: CPython
Sphinx version:        5.3.0
Docutils version:      0.19
Jinja2 version:        3.1.2
@andy-maier
Copy link

andy-maier commented Jan 21, 2023

I have the same issue when linking to https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G53253.
Fails on all of our test environments, both with latest and with minimum versions: https://github.com/pywbem/nocasedict/actions/runs/3916927936

Circumvented by removing the #G53253 anchor.

Quite some irony that linking to the Unicode standard runs into a UTF-8 issue :-) :-)

@jayaddison
Copy link
Contributor

I'm not sure what to suggest as a remedy for this, but from investigating why it occurs: this problem is due to the anchor-checking mechanism expecting to parse HTML content.

When anchor-checking is enabled and content with a binary header is retrieved from a URI with an anchor fragment (#....), decoding of that data is likely to fail, unless it happens to be a format that overlaps with valid unicode text.

In other words: it's not exactly the hash fragment in the URI that causes the problem, but it is a requirement for the problem to occur -- because for links without anchors, we don't need to read the HTTP response content, only the status line.

I think the most difficult decision for a fix is: what do we do for non-HTML formats like PDF when anchor-checking is enabled, and the hyperlink contains a fragment? Is it better to consider the result working (despite not checking for existence of a matching anchor destination), ignored/unchecked (informationally wasteful given that we have made a network request), or do something else?

@jayaddison
Copy link
Contributor

only the status line

Self-nitpick: and sometimes response headers, I suppose - for example to handle rate-limiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants