Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

tk0miya · 2021-05-15T17:28:42Z

Feature or Bugfix

Feature

Purpose

Add a new confval; linkcheck_warn_redirects to emit a warning when
the hyperlink is redirected. It's useful to detect unexpected redirects
under the warn-is-error mode.
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
refs: Link checker should be able to prohibit unknown redirects #6525

Add a new confval; `linkcheck_warn_redirects` to emit a warning when the hyperlink is redirected. It's useful to detect unexpected redirects under the warn-is-error mode.

sphinx/builders/linkcheck.py

Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks that are redirected as expected.

doc/usage/configuration.rst

francoisfreitag

Overall the change looks good to me and is definitely an improvement. 👍

Perhaps add two other tests, with a non-matching key and a non-matching value. In both tests, assert the redirection triggers a warning?

doc/usage/configuration.rst

francoisfreitag · 2021-05-16T10:16:10Z

sphinx/builders/linkcheck.py

                    code = response.history[-1].status_code
                    return 'redirected', new_url, code
                else:
                    return 'redirected', new_url, 0

+        def ignored_redirect(url: str, new_url: str) -> bool:
+            for from_url, to_url in self.config.linkcheck_ignore_redirects.items():
+                if from_url.match(url) and to_url.match(new_url):


The specification from the issue also asks for a mean to reuse match groups from the from_url in the to_url.

IIUC, the issue asks something similar to:

from_match = from_url.match(url) if from_match: to_url = from_match.expand(to_url) if to_url.match(new_url): return True

I am unsure if that is feasible at all.

From the sub() documentation (used by expand()):

Unknown escapes of ASCII letters are reserved for future use and treated as errors.
For example:

>>> import re >>> to_url = r"^https://example.org/\w{2}/\1" >>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/') >>> match.expand(to_url) Traceback (most recent call last): File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template this = chr(ESCAPES[this][1]) KeyError: '\\w' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/re.py", line 322, in _expand template = sre_parse.parse_template(template, pattern) File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template raise s.error('bad escape %s' % this, len(this)) re.error: bad escape \w at position 21

There’s always the option of documenting that the canonical URI cannot contain special sequences (e.g. \d, \w, etc.), but that’s sure to bite most people.

Another idea to tighten the validation would be to use named groups in both the source URI and canonical URI, and compare the named groups. Something like:

# conf.py linkcheck_ignore_redirects = { 'https://sphinx-doc.org/en/(?P<page>.*)': 'https://sphinx-doc.org/en/master/(?P<page>.*)', } # linkcheck from_match = from_url.match(url) to_match = to_url.match(to) if from_match.groupdict() == to_match.group_dict(): return True

That may not be as robust as one might expect. For example, I can imagine somebody documenting their regexp with capturing groups. IDK if the extra validation is worth the effort?

If https://www.sphinx-doc.org/en/stable/usage/quickstart.html redirects to something other than https://www.sphinx-doc.org/en/master/usage/quickstart.html then it implies that the page no longer exists in the stable version and it should be possible to consider that an error. The validation should be worth the effort if someone is taking the effort to link to /stable/ versions (although I'm not linking to readthedocs websites like this myself).

I would be inclined to use re.sub() and document it. Alternatively, do the group dict checking but only if the group name exists in both matches?

There may also need to be a regex and non-regex version because the need to escape every . for an accurate match may confuse people. Any examples should definitely use r'...' strings.

If https://www.sphinx-doc.org/en/stable/usage/quickstart.html redirects to something other than https://www.sphinx-doc.org/en/master/usage/quickstart.html then it implies that the page no longer exists in the stable version and it should be possible to consider that an error.

For a page that does not exist, the server should return a 404 in the first place.

I understand not all web servers behave that way, but let’s keep in mind we’re dealing with poorly configured web servers here. I wouldn’t advise implementing a brittle solution to work around bad behaving servers. I consider the re.sub() solution to be brittle:

disallowing \ in the to_url is sure to trip a good portion of users,

re.sub() was design to perform replacements in a template string, not in a regular expression. There’s no guarantee the regular expression will be valid after the templating. The \ is a symptom of it, but there may be others.

The comparison of groups seems more robust to me. One can adjust the groups to capture the same portion on both sides. It needs some work to experiment (e.g. try to implement the validation in the #6525 with that system). I’ll try it in the days to come.

There may also need to be a regex and non-regex version because the need to escape every . for an accurate match may confuse people.

I think it would mostly be confusing. There would be two ways of doing the same thing. Besides, there’s a strong probability the unexpected redirect URL will be different enough that the regexp won’t match, even if r"." is used instead r"\.". Not discussing that r"\." is tighter and correct, simply trying to explain why a second setting without regexp seems overkill to me.

As @nomis says, the idea here is that it is possible to accept redirect targets matching a certain pattern to be considered "working" and, if they don't match, then they should still show up as "redirected".

I've had a play with the example that @francoisfreitag posted and think that we can make this work by escaping non-backreference escapes:

>>> import re >>> to_url = r"^https://example.org/\w{2}/\1" >>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/') >>> match.expand(to_url) Traceback (most recent call last): File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template this = chr(ESCAPES[this][1]) KeyError: '\\w' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/re.py", line 322, in _expand template = sre_parse.parse_template(template, pattern) File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template raise s.error('bad escape %s' % this, len(this)) re.error: bad escape \w at position 21 >>> to_url2 = re.sub(r'(\\[^0-9g])', r'\\\1', to_url) >>> to_url '^https://example.org/\\w{2}/\\1' >>> to_url2 '^https://example.org/\\\\w{2}/\\1' >>> match.expand(to_url2) '^https://example.org/\\w{2}/the-real-page/'

So re.sub(r'(\\[^0-9g])', r'\\\1', to_url) seems to do the trick, escaping non-backreference escapes. It seems that .expand() also reverses this extra escaping for us.

tests/test_build_linkcheck.py

doc/usage/configuration.rst

ngnpope

Thanks for taking this idea forward.

sphinx/builders/linkcheck.py

doc/usage/configuration.rst

ngnpope · 2021-05-17T10:03:32Z

sphinx/builders/linkcheck.py

                    code = response.history[-1].status_code
                    return 'redirected', new_url, code
                else:
                    return 'redirected', new_url, 0

+        def ignored_redirect(url: str, new_url: str) -> bool:
+            for from_url, to_url in self.config.linkcheck_ignore_redirects.items():
+                if from_url.match(url) and to_url.match(new_url):


As @nomis says, the idea here is that it is possible to accept redirect targets matching a certain pattern to be considered "working" and, if they don't match, then they should still show up as "redirected".

I've had a play with the example that @francoisfreitag posted and think that we can make this work by escaping non-backreference escapes:

>>> import re >>> to_url = r"^https://example.org/\w{2}/\1" >>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/') >>> match.expand(to_url) Traceback (most recent call last): File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template this = chr(ESCAPES[this][1]) KeyError: '\\w' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/re.py", line 322, in _expand template = sre_parse.parse_template(template, pattern) File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template raise s.error('bad escape %s' % this, len(this)) re.error: bad escape \w at position 21 >>> to_url2 = re.sub(r'(\\[^0-9g])', r'\\\1', to_url) >>> to_url '^https://example.org/\\w{2}/\\1' >>> to_url2 '^https://example.org/\\\\w{2}/\\1' >>> match.expand(to_url2) '^https://example.org/\\w{2}/the-real-page/'

So re.sub(r'(\\[^0-9g])', r'\\\1', to_url) seems to do the trick, escaping non-backreference escapes. It seems that .expand() also reverses this extra escaping for us.

francoisfreitag · 2021-05-17T12:36:42Z

So re.sub(r'(\[^0-9g])', r'\\1', to_url) seems to do the trick, escaping non-backreference escapes. It seems that .expand() also reverses this extra escaping for us.

That seems to work in my testing 👍, I had not thought of that option. It’s quite clean, with an identified set of patterns to look for.

I’m not yet convinced the approach of mangling the user regexp is valid and worth the complexity. Unknown parts come from the URL, the content would also need escaping. Consider the following example:

>>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/how-to-make-real-$$$/')
>>> match.expand(r"^https://example.org/\1")
'^https://example.org/how-to-make-real-$$$/'
>>> to_url = match.expand(r"^https://example.org/\1")
>>> re.match(to_url, 'https://example.org/how-to-make-real-$$$/')

Doesn’t match because $ has a different semantic in the regexp.

It looks like applying re.escape() on all match.groups() would help, but I’m not sure how to re-inject that in the user regexp.

A minor note, if we keep going down the route of crafting regular expressions, I think we should remove the regexp compilation step for the to_url.

Edit: Accidental send caused an intermediary version of the answer to be sent.

francoisfreitag · 2021-05-18T08:36:54Z

The escaping approach prevents placing backrefs in the to_url regexp. Not sure if that is an issue in practice tough.

Example regexp:

>>> import re
>>> re.match(r'https://example.org/([\w-]+)/\1/', 'https://example.org/foo/foo/')
<re.Match object; span=(0, 28), match='https://example.org/foo/foo/'>
>>> re.match(r'https://example.org/([\w-]+)/\1/', 'https://example.org/foo/bar/')
>>>

tk0miya · 2021-05-20T17:06:18Z

I updated this PR except for the capturing feature. I think it's another topic. So I'd like to separate it to another PR (and it's helpful if somebody will post it :-)

nomis · 2021-05-20T18:09:37Z

doc/usage/configuration.rst

+
+      linkcheck_working_redirects = {
+          # All HTTP redirections from the source URI to the canonical URI will be treated as "working".
+          r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/en/master/.*'


If there's no regex capture then this isn't checking what it implies it's checking. To allow for a more complete example without comparing the source and destination paths, I've added an allowance for redirecting to www. but this could be removed too.

Suggested change

r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/en/master/.*'

r'https://sphinx-doc\.org/': r'https://(www\.)?sphinx-doc\.org/.*'

I think the example should use patterns for both source and canonical URIs. So I think the current one is enough to describe the feature.

But this is not just allowing redirects to the canonical version of a URL, it's allowing any redirects.

You may as well use r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/.*' because there isn't really any validation of the destination.

What do you mean? The current rule verifies the pages under https://sphinx-doc.org/ to the URL under https://sphinx-doc.org/en/master/. It does not mean "any redirects".

It's allowing practically any redirect, so not what I would consider a good example.

It's not far off allowing https://www\.example\.com/.* to https://www\.example\.com/.* which then causes linkcheck to ignore removed pages that just redirect to the home page.

If https://sphinx-doc.org/foo/bar redirects to https://sphinx-doc.org/other/path, it's not "allowed". Is this allowing any redirect? I'm confused.

francoisfreitag

I believe this patch is an improvement to the linkcheck builder.

Before, it was not possible to treat redirects as errors. This patch offers the tools to do it. It also offers basic tools to verify the redirection occurs within the expected domain and path.

The redirection validation tools can be improved (e.g. by comparing match groups, or injecting backreferences captured from the source URI to the canonical URI). IMO, that should not prevent this first step from landing.

Thanks @tk0miya!

doc/usage/configuration.rst

Co-authored-by: François Freitag <mail@franek.fr>

doc/usage/configuration.rst

tk0miya · 2021-05-30T18:11:24Z

Now I'm debating to withdraw linkcheck_warn_redirects. After renaming to linkcheck_allowed_redirects, redirection that is not matched to the configuration is already considered as "disallowed". It's enough to emit a warning without linkcheck_warn_redirects.

I believe it's reasonable that the linkcheck builder emits warnings if "disallowed" redirection detected. In other words, it emits warnings if 1) linkcheck_allowed_redirects is configured and 2) a hyperlink having "redirected" status found.

Now linkcheck builder integrates `linkcheck_warn_redirects` into `linkcheck_allowed_redirects`. As a result, linkcheck builder will emit a warning when "disallowed" redirection detected via `linkcheck_allowed_redirects`.

tk0miya · 2021-07-06T17:20:44Z

It's time to release. I'm merging this. Please try the new option and let me know if something not working well.

Hopefully, it would be nice if somebody will send us a pull request that uses regexp as discussed above.

tk0miya added type:enhancement enhance or introduce a new feature builder:linkcheck labels May 15, 2021

tk0miya added this to the 4.1.0 milestone May 15, 2021

Close sphinx-doc#6525: linkcheck: Add linkcheck_warn_redirects

05eb2ca

Add a new confval; `linkcheck_warn_redirects` to emit a warning when the hyperlink is redirected. It's useful to detect unexpected redirects under the warn-is-error mode.

tk0miya force-pushed the 6525_linkcheck_warn_redirects branch 3 times, most recently from a975934 to 0515e8d Compare May 15, 2021 17:42

tk0miya commented May 15, 2021

View reviewed changes

sphinx/builders/linkcheck.py Show resolved Hide resolved

tk0miya marked this pull request as ready for review May 15, 2021 17:44

tk0miya requested a review from francoisfreitag May 15, 2021 17:44

tk0miya mentioned this pull request May 15, 2021

Link checker should be able to prohibit unknown redirects #6525

Closed

Close sphinx-doc#6525: linkcheck: Add linkcheck_ignore_redirects

707319a

Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks that are redirected as expected.

tk0miya force-pushed the 6525_linkcheck_warn_redirects branch from 0515e8d to 707319a Compare May 15, 2021 17:48

nomis reviewed May 15, 2021

View reviewed changes

doc/usage/configuration.rst Outdated Show resolved Hide resolved

francoisfreitag reviewed May 16, 2021

View reviewed changes

nomis reviewed May 16, 2021

View reviewed changes

doc/usage/configuration.rst Outdated Show resolved Hide resolved

ngnpope reviewed May 17, 2021

View reviewed changes

tk0miya added 3 commits May 20, 2021 23:26

Rename linkcheck_ignore_redirects to linkcheck_allowed_redirects

ce9e2e6

Update docs

c97e9b1

Add testcase for combination of allow_redirects and warn_redirects

4201a84

nomis reviewed May 20, 2021

View reviewed changes

francoisfreitag approved these changes May 25, 2021

View reviewed changes

doc/usage/configuration.rst Outdated Show resolved Hide resolved

tk0miya and others added 2 commits May 31, 2021 02:54

Fix typo in doc/usage/configuration.rst

fe89a07

Co-authored-by: François Freitag <mail@franek.fr>

Merge branch '4.x' into 6525_linkcheck_warn_redirects

676834b

tk0miya commented May 30, 2021

View reviewed changes

doc/usage/configuration.rst Outdated Show resolved Hide resolved

doc: Fix indentation

48c80b1

tk0miya commented May 30, 2021

View reviewed changes

doc/usage/configuration.rst Outdated Show resolved Hide resolved

doc: Fix indentation

2887dd0

tk0miya added 2 commits May 31, 2021 21:53

linkcheck: Emit a warning for disallowed redirects

988a79d

Now linkcheck builder integrates `linkcheck_warn_redirects` into `linkcheck_allowed_redirects`. As a result, linkcheck builder will emit a warning when "disallowed" redirection detected via `linkcheck_allowed_redirects`.

Merge branch '4.x' into 6525_linkcheck_warn_redirects

5e5bca9

tk0miya merged commit b09acab into sphinx-doc:4.x Jul 6, 2021

tk0miya deleted the 6525_linkcheck_warn_redirects branch July 6, 2021 17:20

github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

tk0miya commented May 15, 2021

francoisfreitag left a comment

francoisfreitag May 16, 2021

nomis May 16, 2021

francoisfreitag May 17, 2021

ngnpope May 17, 2021

ngnpope left a comment

ngnpope May 17, 2021

francoisfreitag commented May 17, 2021 •

edited

francoisfreitag commented May 18, 2021 •

edited

tk0miya commented May 20, 2021

nomis May 20, 2021

tk0miya May 22, 2021

nomis May 22, 2021

tk0miya May 23, 2021

nomis May 23, 2021

tk0miya May 23, 2021

francoisfreitag left a comment

tk0miya commented May 30, 2021

tk0miya commented Jul 6, 2021

	r'https://sphinx-doc\.org/.': r'https://sphinx-doc\.org/en/master/.'
	r'https://sphinx-doc\.org/': r'https://(www\.)?sphinx-doc\.org/.*'

Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234

Conversation

tk0miya commented May 15, 2021

Feature or Bugfix

Purpose

francoisfreitag left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngnpope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoisfreitag commented May 17, 2021 • edited

francoisfreitag commented May 18, 2021 • edited

tk0miya commented May 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoisfreitag left a comment

Choose a reason for hiding this comment

tk0miya commented May 30, 2021

tk0miya commented Jul 6, 2021

francoisfreitag commented May 17, 2021 •

edited

francoisfreitag commented May 18, 2021 •

edited