Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

Merged
merged 11 commits into from Jul 6, 2021
4 changes: 4 additions & 0 deletions CHANGES
Expand Up @@ -43,6 +43,10 @@ Features added
text
* #9176: i18n: Emit a debug message if message catalog file not found under
:confval:`locale_dirs`
* #6525: linkcheck: Add :confval:`linkcheck_warn_redirects` to emit a warning
when the hyperlink is redirected
* #6525: linkcheck: Add :confval:`linkcheck_allowed_redirects` to mark
hyperlinks that are redirected to expected URLs as "working"
* #1874: py domain: Support union types using ``|`` in info-field-list
* #9097: Optimize the paralell build
* #9131: Add :confval:`nitpick_ignore_regex` to ignore nitpicky warnings using
Expand Down
30 changes: 30 additions & 0 deletions doc/usage/configuration.rst
Expand Up @@ -2527,6 +2527,28 @@ Options for the linkcheck builder

.. versionadded:: 1.1

.. confval:: linkcheck_allowed_redirects

A dictionary that maps a pattern of the source URI to a pattern of the canonical
URI. The linkcheck builder treats the redirected link as "working" when:

- the link in the document matches the source URI pattern, and
- the redirect location matches the canonical URI pattern.
tk0miya marked this conversation as resolved.
Show resolved Hide resolved

Example:

.. code-block:: python

linkcheck_allowed_redirects = {
# All HTTP redirections from the source URI to the canonical URI will be treated as "working".
r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/en/master/.*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no regex capture then this isn't checking what it implies it's checking. To allow for a more complete example without comparing the source and destination paths, I've added an allowance for redirecting to www. but this could be removed too.

Suggested change
r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/en/master/.*'
r'https://sphinx-doc\.org/': r'https://(www\.)?sphinx-doc\.org/.*'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example should use patterns for both source and canonical URIs. So I think the current one is enough to describe the feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not just allowing redirects to the canonical version of a URL, it's allowing any redirects.

You may as well use r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/.*' because there isn't really any validation of the destination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? The current rule verifies the pages under https://sphinx-doc.org/ to the URL under https://sphinx-doc.org/en/master/. It does not mean "any redirects".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's allowing practically any redirect, so not what I would consider a good example.

It's not far off allowing https://www\.example\.com/.* to https://www\.example\.com/.* which then causes linkcheck to ignore removed pages that just redirect to the home page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If https://sphinx-doc.org/foo/bar redirects to https://sphinx-doc.org/other/path, it's not "allowed". Is this allowing any redirect? I'm confused.

}

It's helpful to enable :confval:`linkcheck_warn_redirects` to warn for URIs
causing unexpected HTTP redirection.

.. versionadded:: 4.1

.. confval:: linkcheck_request_headers

A dictionary that maps baseurls to HTTP request headers.
Expand Down Expand Up @@ -2647,6 +2669,14 @@ Options for the linkcheck builder

.. versionadded:: 3.4

.. confval:: linkcheck_warn_redirects

If true, emits a warning when the response for a hyperlink is a redirect.
It's useful to detect unexpected redirects under :option:`the warn-is-error
mode <sphinx-build -W>`. Default is ``False``.

.. versionadded:: 4.1
tk0miya marked this conversation as resolved.
Show resolved Hide resolved


Options for the XML builder
---------------------------
Expand Down
39 changes: 35 additions & 4 deletions sphinx/builders/linkcheck.py
Expand Up @@ -272,8 +272,12 @@ def process_result(self, result: CheckResult) -> None:
except KeyError:
text, color = ('with unknown code', purple)
linkstat['text'] = text
logger.info(color('redirect ') + result.uri +
color(' - ' + text + ' to ' + result.message))
if self.config.linkcheck_warn_redirects:
logger.warning('redirect ' + result.uri + ' - ' + text + ' to ' +
result.message, location=(filename, result.lineno))
else:
logger.info(color('redirect ') + result.uri +
color(' - ' + text + ' to ' + result.message))
self.write_entry('redirected ' + text, result.docname, filename,
result.lineno, result.uri + ' to ' + result.message)
else:
Expand Down Expand Up @@ -494,13 +498,23 @@ def check_uri() -> Tuple[str, str, int]:
new_url = response.url
if anchor:
new_url += '#' + anchor
# history contains any redirects, get last
if response.history:

if allowed_redirect(req_url, new_url):
return 'working', '', 0
tk0miya marked this conversation as resolved.
Show resolved Hide resolved
elif response.history:
# history contains any redirects, get last
code = response.history[-1].status_code
return 'redirected', new_url, code
else:
return 'redirected', new_url, 0

def allowed_redirect(url: str, new_url: str) -> bool:
for from_url, to_url in self.config.linkcheck_allowed_redirects.items():
if from_url.match(url) and to_url.match(new_url):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification from the issue also asks for a mean to reuse match groups from the from_url in the to_url.

IIUC, the issue asks something similar to:

                from_match = from_url.match(url)
                if from_match:
                    to_url = from_match.expand(to_url)
                    if to_url.match(new_url):
                        return True

I am unsure if that is feasible at all.

From the sub() documentation (used by expand()):

Unknown escapes of ASCII letters are reserved for future use and treated as errors.
For example:

>>> import re
>>> to_url = r"^https://example.org/\w{2}/\1"
>>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/')
>>> match.expand(to_url)
Traceback (most recent call last):
  File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template
    this = chr(ESCAPES[this][1])
KeyError: '\\w'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/re.py", line 322, in _expand
    template = sre_parse.parse_template(template, pattern)
  File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \w at position 21

There’s always the option of documenting that the canonical URI cannot contain special sequences (e.g. \d, \w, etc.), but that’s sure to bite most people.


Another idea to tighten the validation would be to use named groups in both the source URI and canonical URI, and compare the named groups. Something like:

# conf.py
linkcheck_ignore_redirects = {
    'https://sphinx-doc.org/en/(?P<page>.*)': 'https://sphinx-doc.org/en/master/(?P<page>.*)',
}
# linkcheck
from_match = from_url.match(url)
to_match = to_url.match(to)
if from_match.groupdict() == to_match.group_dict():
    return True

That may not be as robust as one might expect. For example, I can imagine somebody documenting their regexp with capturing groups. IDK if the extra validation is worth the effort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If https://www.sphinx-doc.org/en/stable/usage/quickstart.html redirects to something other than https://www.sphinx-doc.org/en/master/usage/quickstart.html then it implies that the page no longer exists in the stable version and it should be possible to consider that an error. The validation should be worth the effort if someone is taking the effort to link to /stable/ versions (although I'm not linking to readthedocs websites like this myself).

I would be inclined to use re.sub() and document it. Alternatively, do the group dict checking but only if the group name exists in both matches?

There may also need to be a regex and non-regex version because the need to escape every . for an accurate match may confuse people. Any examples should definitely use r'...' strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If https://www.sphinx-doc.org/en/stable/usage/quickstart.html redirects to something other than https://www.sphinx-doc.org/en/master/usage/quickstart.html then it implies that the page no longer exists in the stable version and it should be possible to consider that an error.

For a page that does not exist, the server should return a 404 in the first place.

I understand not all web servers behave that way, but let’s keep in mind we’re dealing with poorly configured web servers here. I wouldn’t advise implementing a brittle solution to work around bad behaving servers. I consider the re.sub() solution to be brittle:

  • disallowing \ in the to_url is sure to trip a good portion of users,
  • re.sub() was design to perform replacements in a template string, not in a regular expression. There’s no guarantee the regular expression will be valid after the templating. The \ is a symptom of it, but there may be others.

The comparison of groups seems more robust to me. One can adjust the groups to capture the same portion on both sides. It needs some work to experiment (e.g. try to implement the validation in the #6525 with that system). I’ll try it in the days to come.

There may also need to be a regex and non-regex version because the need to escape every . for an accurate match may confuse people.

I think it would mostly be confusing. There would be two ways of doing the same thing. Besides, there’s a strong probability the unexpected redirect URL will be different enough that the regexp won’t match, even if r"." is used instead r"\.". Not discussing that r"\." is tighter and correct, simply trying to explain why a second setting without regexp seems overkill to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @nomis says, the idea here is that it is possible to accept redirect targets matching a certain pattern to be considered "working" and, if they don't match, then they should still show up as "redirected".

I've had a play with the example that @francoisfreitag posted and think that we can make this work by escaping non-backreference escapes:

>>> import re
>>> to_url = r"^https://example.org/\w{2}/\1"
>>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/')
>>> match.expand(to_url)
Traceback (most recent call last):
  File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template
    this = chr(ESCAPES[this][1])
KeyError: '\\w'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/re.py", line 322, in _expand
    template = sre_parse.parse_template(template, pattern)
  File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \w at position 21
>>> to_url2 = re.sub(r'(\\[^0-9g])', r'\\\1', to_url)
>>> to_url
'^https://example.org/\\w{2}/\\1'
>>> to_url2
'^https://example.org/\\\\w{2}/\\1'
>>> match.expand(to_url2)
'^https://example.org/\\w{2}/the-real-page/'

So re.sub(r'(\\[^0-9g])', r'\\\1', to_url) seems to do the trick, escaping non-backreference escapes. It seems that .expand() also reverses this extra escaping for us.

return True

return False

def check(docname: str) -> Tuple[str, str, int]:
# check for various conditions without bothering the network
if len(uri) == 0 or uri.startswith(('#', 'mailto:', 'tel:')):
Expand Down Expand Up @@ -642,11 +656,25 @@ def run(self, **kwargs: Any) -> None:
hyperlinks[uri] = uri_info


def compile_linkcheck_allowed_redirects(app: Sphinx, config: Config) -> None:
"""Compile patterns in linkcheck_allowed_redirects to the regexp objects."""
for url, pattern in list(app.config.linkcheck_allowed_redirects.items()):
try:
app.config.linkcheck_allowed_redirects[re.compile(url)] = re.compile(pattern)
except re.error as exc:
logger.warning(__('Failed to compile regex in linkcheck_allowed_redirects: %r %s'),
exc.pattern, exc.msg)
finally:
# Remove the original regexp-string
app.config.linkcheck_allowed_redirects.pop(url)


def setup(app: Sphinx) -> Dict[str, Any]:
app.add_builder(CheckExternalLinksBuilder)
app.add_post_transform(HyperlinkCollector)

app.add_config_value('linkcheck_ignore', [], None)
app.add_config_value('linkcheck_allowed_redirects', {}, None)
app.add_config_value('linkcheck_auth', [], None)
app.add_config_value('linkcheck_request_headers', {}, None)
app.add_config_value('linkcheck_retries', 1, None)
Expand All @@ -657,6 +685,9 @@ def setup(app: Sphinx) -> Dict[str, Any]:
# commonly used for dynamic pages
app.add_config_value('linkcheck_anchors_ignore', ["^!"], None)
app.add_config_value('linkcheck_rate_limit_timeout', 300.0, None)
app.add_config_value('linkcheck_warn_redirects', False, None)

app.connect('config-inited', compile_linkcheck_allowed_redirects, priority=800)

return {
'version': 'builtin',
Expand Down
@@ -0,0 +1 @@
exclude_patterns = ['_build']
@@ -0,0 +1,2 @@
`local server1 <http://localhost:7777/path1>`_
`local server2 <http://localhost:7777/path2>`_
58 changes: 56 additions & 2 deletions tests/test_build_linkcheck.py
Expand Up @@ -23,6 +23,7 @@
import requests

from sphinx.builders.linkcheck import HyperlinkAvailabilityCheckWorker, RateLimit
from sphinx.testing.util import strip_escseq
from sphinx.util.console import strip_colors

from .utils import CERT_FILE, http_server, https_server
Expand Down Expand Up @@ -250,7 +251,7 @@ def log_date_time_string(self):


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver', freshenv=True)
def test_follows_redirects_on_HEAD(app, capsys):
def test_follows_redirects_on_HEAD(app, capsys, warning):
with http_server(make_redirect_handler(support_head=True)):
app.build()
stdout, stderr = capsys.readouterr()
Expand All @@ -265,10 +266,11 @@ def test_follows_redirects_on_HEAD(app, capsys):
127.0.0.1 - - [] "HEAD /?redirected=1 HTTP/1.1" 204 -
"""
)
assert warning.getvalue() == ''


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver', freshenv=True)
def test_follows_redirects_on_GET(app, capsys):
def test_follows_redirects_on_GET(app, capsys, warning):
with http_server(make_redirect_handler(support_head=False)):
app.build()
stdout, stderr = capsys.readouterr()
Expand All @@ -284,6 +286,58 @@ def test_follows_redirects_on_GET(app, capsys):
127.0.0.1 - - [] "GET /?redirected=1 HTTP/1.1" 204 -
"""
)
assert warning.getvalue() == ''


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver-warn-redirects',
freshenv=True, confoverrides={'linkcheck_warn_redirects': True})
def test_linkcheck_warn_redirects(app, warning):
with http_server(make_redirect_handler(support_head=False)):
app.build()
assert ("index.rst.rst:1: WARNING: redirect http://localhost:7777/path1 - with Found to "
"http://localhost:7777/?redirected=1\n" in strip_escseq(warning.getvalue()))
assert ("index.rst.rst:1: WARNING: redirect http://localhost:7777/path2 - with Found to "
"http://localhost:7777/?redirected=1\n" in strip_escseq(warning.getvalue()))
assert len(warning.getvalue().splitlines()) == 2


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver-warn-redirects',
freshenv=True, confoverrides={
'linkcheck_allowed_redirects': {'http://localhost:7777/.*1': '.*'}
})
def test_linkcheck_allowed_redirects(app, warning):
with http_server(make_redirect_handler(support_head=False)):
app.build()

with open(app.outdir / 'output.json') as fp:
records = [json.loads(l) for l in fp.readlines()]

assert len(records) == 2
result = {r["uri"]: r["status"] for r in records}
assert result["http://localhost:7777/path1"] == "working"
assert result["http://localhost:7777/path2"] == "redirected"
assert warning.getvalue() == ''


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver-warn-redirects',
freshenv=True, confoverrides={
'linkcheck_allowed_redirects': {'http://localhost:7777/.*1': '.*'},
'linkcheck_warn_redirects': True,
})
def test_linkcheck_allowed_redirects_and_linkcheck_warn_redirects(app, warning):
with http_server(make_redirect_handler(support_head=False)):
app.build()

with open(app.outdir / 'output.json') as fp:
records = [json.loads(l) for l in fp.readlines()]

assert len(records) == 2
result = {r["uri"]: r["status"] for r in records}
assert result["http://localhost:7777/path1"] == "working"
assert result["http://localhost:7777/path2"] == "redirected"
assert ("index.rst.rst:1: WARNING: redirect http://localhost:7777/path2 - with Found to "
"http://localhost:7777/?redirected=1\n" in strip_escseq(warning.getvalue()))
assert len(warning.getvalue().splitlines()) == 1


class OKHandler(http.server.BaseHTTPRequestHandler):
Expand Down