Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link checker should be able to prohibit unknown redirects #6525

Closed
nomis opened this issue Jun 27, 2019 · 6 comments
Closed

Link checker should be able to prohibit unknown redirects #6525

nomis opened this issue Jun 27, 2019 · 6 comments
Labels
builder:linkcheck type:enhancement enhance or introduce a new feature
Milestone

Comments

@nomis
Copy link
Contributor

nomis commented Jun 27, 2019

Is your feature request related to a problem? Please describe.
A lot of links become stale or move. Good websites will provide redirects to the correct new location or return an HTTP error code. Bad websites will redirect to an unrelated page or the root of the website.

Preventing all redirects does not allow links to URLs like https://www.sphinx-doc.org/ which redirects to https://www.sphinx-doc.org/en/master/. It needs to be possible to allow these redirects but disallow others.

Describe the solution you'd like
It should be possible to prohibit unknown redirects by listing all of the allowed redirects as pairs of URLs.

Describe alternatives you've considered
Post-process linkcheck/output.txt by removing filenames and line numbers then sorting it and comparing it with known good output.

Additional context
A link to https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/ (which used to work) now redirects to https://blogs.windows.com/windowsdeveloper/. Linkcheck allows this but the original link is not valid and needs to be updated to the article's new URL of https://blogs.windows.com/windowsdeveloper/2016/12/02/symlinks-windows-10/.

Linkcheck should be able to report an error for this redirect.

@nomis nomis added the type:enhancement enhance or introduce a new feature label Jun 27, 2019
@tk0miya tk0miya added this to the some future version milestone Jun 28, 2019
@francoisfreitag
Copy link
Contributor

If one can tell in advance where they are redirected, might as well use the direct link in the docs and skip the redirect.
Perhaps a step forward would be a new setting to treat redirects as errors?

@nomis
Copy link
Contributor Author

nomis commented Nov 22, 2020

I provided a reason why I want to be able to link to a redirect, unless you think the base URL of sphinx itself should not be linkable?

@francoisfreitag
Copy link
Contributor

I misread the issue originally, I was hoping all redirects could be replaced by the final version of the URL, but that’s not true.
In the provided example, sphinx-doc.org could redirect to a different language based on your language preferences. Replacing the link with the final version would force users to visit the English version of the page.

What do you think of a mapping in the config: {"original_URL": final_url}, perhaps named linkcheck_validate_redirects?
The behavior upon redirect would be:

  • original URL present in the mapping, verify the final URL matches the value from linkcheck_validate_redirects,
  • original URL not present, mark link as broken.

@nomis
Copy link
Contributor Author

nomis commented Dec 13, 2020

For the sphinx-doc.org case I would not expect to specify the exact final URL because I don't care where it redirects to when I link to /. (It may decide that CI in another country should get a different language by default.)

If final_url could be None to allow any final URL, that would appear to work but I'd really want it to redirect within the same domain. If https://www.sphinx-doc.org/ redirects to https://this-domain-is-for-sale.example.com/sphinx-doc.org then the link is broken.

So final_url could be None, a string or a regex.

{"https://www.sphinx-doc.org/": None}
{"https://www.sphinx-doc.org/": "https://www\.sphinx-doc\.org/en/master/"}
import re
{"https://www.sphinx-doc.org/": re.compile(r"^https://www\.sphinx-doc\.org/.*$")}

Of course, when you start allowing regex in the final_urlyou might want to allow regex in the original_url and group references:

import re
{re.compile("^https://sphinx-doc.org/(.*)$"): re.compile(r"^https://(www\.)?sphinx-doc\.org/\1$")}

There may be multiple conflicting mappings, if any one of them matches then the link is ok.

@ngnpope
Copy link

ngnpope commented Apr 28, 2021

This is something I have just come across myself, and such a setting would be helpful to ignore the fact that a redirect happened - in other words, set the state as "working" instead of "redirected" as long as the target page is available.

Another example of a case where this would be helpful is wanting to ignore redirects in the case of documentation versions, e.g. .../en/stable/.../en/3.2/. In this case it is preferable to always link to the latest/stable version via a URL rewrite.

I could see a configuration along the following lines (very much what @nomis has specified above):

# Check that the link is "working" but don't flag as "redirected" unless the target doesn't match.
linkcheck_redirects_ignore = {
    r'^https://([^/?#]+)/$': r'^https://\1/(?:home|index)\.html?$',
    r'^https://(nodejs\.org)/$', r'^https://\1/[-a-z]+/$',
    r'^https://(pip\.pypa\.io)/$', r'^https://\1/[-a-z]+/stable/$',
    r'^https://(www\.sphinx-doc\.org)/$', r'^https://\1/[-a-z]+/master/$',
    r'^https://(pytest\.org)/$', r'^https://docs\.\1/[-a-z]+/\d+\.\d+\.x/$',
    r'^https://github.com/([^/?#]+)/([^/?#])+/blob/(.*)$': r'https://github.com/\1/\2/tree/\3$',
    r'^https://([^/?#\.]+)\.readthedocs\.io/$': r'^https://\1\.readthedocs\.io/[-a-z]+/(?:master|latest|stable)/$',
    r'^https://dev\.mysql\.com/doc/refman/': r'^https://dev\.mysql\.com/doc/refman/\d+\.\d+/',
    r'^https://docs\.djangoproject\.com/': r'^https://docs\.djangoproject\.com/[-a-z]+/\d+\.\d+/',
    r'^https://docs\.djangoproject\.com/([-a-z]+)/stable/': r'^https://docs\.djangoproject\.com/\1/\d+\.\d+/',
}

@tk0miya tk0miya modified the milestones: some future version, 4.1.0 May 9, 2021
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 9, 2021
Add a new confval; `linkcheck_warn_redirects` to emit a warning when
the hyperlink is redirected.  It's useful to detect unexpected redirects
under the warn-is-error mode.
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 9, 2021
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 15, 2021
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 15, 2021
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 15, 2021
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
tk0miya added a commit to tk0miya/sphinx that referenced this issue May 15, 2021
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
@tk0miya
Copy link
Member

tk0miya commented May 15, 2021

Now I posted #9234 to resolve this issue. Please let me know your opinion if you have time.

tk0miya added a commit to tk0miya/sphinx that referenced this issue May 15, 2021
Add a new confval; linkcheck_ignore_redirects to ignore hyperlinks
that are redirected as expected.
@tk0miya tk0miya closed this as completed in 05eb2ca Jul 6, 2021
tk0miya added a commit that referenced this issue Jul 6, 2021
Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
builder:linkcheck type:enhancement enhance or introduce a new feature
Projects
None yet
Development

No branches or pull requests

4 participants