Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dead links galore #174

Open
anarcat opened this issue Jan 5, 2020 · 21 comments
Open

dead links galore #174

anarcat opened this issue Jan 5, 2020 · 21 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Jan 5, 2020

it seems dead links are not checked when changes are pushed to this repository, or at least there isn't a job doing that regularly, because I can easily find some. ;) I don't remember the ones I found the last time (and unfortunately did not report), but today I found two in the whatis page:

There might be other dead links on the site worth fixing.

@rhatdan
Copy link
Member

rhatdan commented Jan 6, 2020

Thanks @anarcat would you like to open a PR to fix the links?

@anarcat
Copy link
Contributor Author

anarcat commented Jan 6, 2020

i'm kind of busy right now, and this was more a meta-issue than those specific two... i would suggest to create a CI step to check the links on push, so that this doesn't occur again. and while I could fix those links with a rather small PR, this will happen again unless such a process is setup. and that part i'm not familiar enough with to fix.

anarcat added a commit to anarcat/podman.io that referenced this issue Jan 6, 2020
Those were found by accident, there are probably others on site.

See containers#174.
@anarcat
Copy link
Contributor Author

anarcat commented Jan 6, 2020

@rhatdan PR in #175 but the broader issue will need more work.

@anarcat
Copy link
Contributor Author

anarcat commented Jan 6, 2020

@TomSweeneyRedHat asked in #175 which tools can be used to automate such checks... since this is a static website, what you want is a link checker. i happen to have inherited the maintenance of such a tool, called exactly that, linkchecker. it's kind of clunky and old, but it generally works. by default, it spiders the whole site but doesn't check external URLs, so it doesn't find the broken links I reported in #175 (because they are external).

anarcat@curie:~(master)$ LANG=C.UTF-8 linkchecker https://podman.io/
INFO linkcheck.cmdline 2020-01-06 15:49:01,728 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
LinkChecker 9.4.0              Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2020-01-06 15:49:01-004
10 threads active,    40 links queued,   50 links in 100 URLs checked, runtime 1 seconds
10 threads active,   102 links queued,  125 links in 242 URLs checked, runtime 6 seconds
10 threads active,    91 links queued,  137 links in 243 URLs checked, runtime 11 seconds
10 threads active,    74 links queued,  172 links in 261 URLs checked, runtime 16 seconds
10 threads active,    63 links queued,  214 links in 292 URLs checked, runtime 21 seconds
10 threads active,    47 links queued,  230 links in 292 URLs checked, runtime 26 seconds
10 threads active,    34 links queued,  243 links in 292 URLs checked, runtime 31 seconds
10 threads active,    21 links queued,  288 links in 324 URLs checked, runtime 36 seconds
10 threads active,     8 links queued,  318 links in 341 URLs checked, runtime 41 seconds
 3 threads active,     0 links queued,  359 links in 367 URLs checked, runtime 46 seconds

Statistics:
Downloaded: 963.18KB.
Content types: 5 image, 121 text, 0 video, 0 audio, 5 application, 2 mail and 229 other.
URL lengths: min=15, max=215, avg=52.

That's it. 362 links in 367 URLs checked. 0 warnings found. 0 errors found.
Stopped checking at 2020-01-06 15:49:48-004 (47 seconds)
anarcat@curie:~(master)$ 

it's possible to tell linkchecker to crawl external links, but then it becomes a web crawler and can potentially crawl the entire universe.

the way I use it for my site is that I run this, for every $URL modified:

linkchecker --check-extern --no-robots --recursion-level 1 --quiet --no-status $URL

so, for example, in the case of the affected page:

anarcat@curie:~(master)$ LANG=C.UTF-8 linkchecker --check-extern --no-robots --recursion-level 1 https://podman.io/whatis.html
LinkChecker 9.4.0              Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2020-01-06 15:50:33-004
 4 threads active,     0 links queued,    4 links in   8 URLs checked, runtime 1 seconds

URL        `https://github.com/containers/libpod/blob/master/docs/podman-generate-kube.1.md'
Name       `podman-generate-kube'
Parent URL https://podman.io/whatis.html, line 56, col 3
Real URL   https://github.com/containers/libpod/blob/master/docs/podman-generate-kube.1.md
Check time 1.398 seconds
Result     Error: 404 Not Found

URL        `https://github.com/containers/libpod/blob/master/docs/podman-play-kube.1.md'
Name       `podman-play-kube'
Parent URL https://podman.io/whatis.html, line 54, col 3
Real URL   https://github.com/containers/libpod/blob/master/docs/podman-play-kube.1.md
Check time 1.858 seconds
Result     Error: 404 Not Found

Statistics:
Downloaded: 3KB.
Content types: 2 image, 6 text, 0 video, 0 audio, 0 application, 0 mail and 0 other.
URL lengths: min=29, max=79, avg=44.

That's it. 8 links in 9 URLs checked. 0 warnings found. 2 errors found.
Stopped checking at 2020-01-06 15:50:35-004 (2 seconds)
[1]anarcat@curie:~(master)$ 

the w3c also maintains their own crawler, called w3c-linkchecker, although I have less experience with it. i started using linkchecker because:

  1. w3c-linkchecker respects robots.txt and I wanted to bypass that: even if I'm a bot, I should be able to check if a resource exists at all
  2. w3c-linkchecker was very unlikely to accept a patch to change that, for obvious reasons
  3. I am tired of perl and linkchecker was written in Python

anyways, long story short: use a linkchecker, any linkchecker. :)

rhatdan pushed a commit that referenced this issue Jan 7, 2020
Those were found by accident, there are probably others on site.

See #174.
@dschier-wtd
Copy link

Another solution would to use a tool like textlint for markdown files. This can also be used for many other use cases like line length, trailing whitespace, wording, spell checks, etc.

If you want to stick with ruby (because of jekyll), there is a tool called htmlproofer, which can be used right after the jekyll build to check the generated html for validity.

@rhatdan
Copy link
Member

rhatdan commented Jan 15, 2020

I don't think we are against any of these tools. If we get contributors who want to add PRs to verify the content, then we would definitely consider it.

@rhatdan
Copy link
Member

rhatdan commented Apr 18, 2020

@anarcat @daniel-wtd Did you guys ever work on this?

@dschier-wtd
Copy link

Nope, I have started to work on another issue. If you want, I can have a look at some basic testing afterwards.

@rhatdan
Copy link
Member

rhatdan commented Apr 18, 2020

Well any help you can give is appreciated. I don't have priority of one over the other.

@dschier-wtd
Copy link

dschier-wtd commented May 15, 2020

Phew, thats a toooon of links. Is there currently any automation process to run checks on pull-request like travis-ci or similar?

@TomSweeneyRedHat
Copy link
Member

@cevich @edsantiago are either of you aware of automatic link checks we could use per @daniel-wtd question above?

@dschier-wtd
Copy link

the link check can be provided from me. My interest is more like:

"what kind of automation options do we have for checks, based on pull requests" ;)

There is a ton of stuff out there like travis-ci, cricle-ci, cirrus, etc. Depending on your preferences, we can use one of them or I can provide some simple tests to be run manually.

@cevich
Copy link
Member

cevich commented May 15, 2020

There is a ton of stuff out there like travis-ci, cricle-ci, cirrus, etc.

My preference would be to use Cirrus-CI since it's already in such wide-spread use. Running tasks in containers doesn't require any special setup. In fact, before I go on PTO, I'll add this repo. to the github permissions list...

@cevich
Copy link
Member

cevich commented May 15, 2020

...it's done. All you need is a .cirrus.yml file.

@dschier-wtd
Copy link

@cevich thank you a ton :) I will start this weekend with some initial markdown / link checks.

@TomSweeneyRedHat
Copy link
Member

@cevich @daniel-wtd AWESOME! Both of you get a Gold Star today! 🥇

@dschier-wtd
Copy link

dschier-wtd commented May 18, 2020

Just a short note for me/for anybody who may be interested. I am working on implementing some basic checks like described here:

@parkr
Copy link
Contributor

parkr commented Jan 9, 2021

This code worked well for me:

~/code/podman.io#master$ cat test_site_links.sh
#!/bin/bash

set -ex

docker run --rm \
  --volume="$PWD:/srv/jekyll" \
  -it jekyll/jekyll:pages \
  jekyll build

export HTML_PROOFER_VERSION=3.18.5
docker run --rm \
  --volume="$PWD/_site:/srv/podman.io" \
  parkr/html-proofer:3.18.5 \
  /srv/podman.io

Running on macOS.

@parkr
Copy link
Contributor

parkr commented Jan 9, 2021

This could be fairly easily converted into a GitHub Actions workflow if that is an acceptable platform to use for the Containers org.

@dschier-wtd
Copy link

Ouh, I totally forgot about this in the meantime. Thanks for the reminder. I will plan it for January.

@rhatdan
Copy link
Member

rhatdan commented Jan 10, 2021

I am open to ideas on how to keep these blogs working. So whatever the community decides is fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants