Skip to content

BillDietrich/linkcheckerhtml

Repository files navigation

HTML / XML / RSS link checker

VSCode extension that checks for broken links in an HTML, XML, RSS, PHP, or Markdown file.

Functionality

Checks for broken links in anchor-href, link-href, img-src, and script-src tags in currently-open HTML or PHP file. It checks HTTP/HTTPS links by trying to access them on the internet, and checks relative links (../folder/file.html) by checking if the file exists on the local file system.

Checks both clearnet and onion (Tor) links.

Also checks for badly-formatted mailto links, and duplicate local anchors (anchor-name, anchor-id).

Also checks for working HTTPS equivalents of HTTP links.

Also checks for broken links in currently-open XML or RSS or Markdown file.

Optionally checks for invalid characters and common mistakes (missing tag content, empty attribute value, more).

Also checks for errors in a small subset of semantic HTML tags (in HTML and PHP files): checks that each page has header, main, footer; checks that each heading is inside a section, article, or aside; checks that each section/article/aside has exactly one heading in it; checks that heading values are nested properly.

Use

Open an editor window on an HTML, XML, RSS, PHP, or Markdown file, and then press Alt+H.

Broken links are reported via the standard error/warning/information diagnostic icons in lower-left of UI.

Click on the diagnostic icons and numbers to open the diagnostics pane.

Click on a diagnostic line, see that link highlighted in the source file, press Alt+T to open that URL in your browser.

If it's an HTTP link, press Alt+M to try to open the HTTPS equivalent of that URL in your browser.

Press Alt+L to clear all diagnostic messages generated by this extension.

Using the extension

Tip: After you do Alt+H and get diagnostics, work on the problems from bottom (last diagnostic) to top (first diagnostic). That way the line numbers in the diagnostics don't change as you delete or add lines in the source.

To see/change settings for this extension, open Settings (Ctrl+,) / Extensions / "HTML / XML / RSS link checker".

To change the key-combinations for this extension, open File / Preferences / Keyboard Shortcuts and search for Alt+H or Alt+T or Alt+M or Alt+L.

Onion (Tor) links

Onion URLs look like https://1234567890123456.onion/something (16 chars before '.onion') or https://12345678901234567890123456789012345678901234567890123456.onion/something (56 chars before '.onion'). They are used to access dark-web sites through Tor Browser (usually).

Checking validity of Onion (Tor) links

To use Alt+H to check onion links, you must have a Tor/socks proxy listening on 127.0.0.1:9050. On Linux:

sudo systemctl status tor   # should show an active Tor service
# if it's not active, try:
sudo systemctl start tor

sudo ss -lptu | grep :9050  # should show an active Tor listener

For more information see https://github.com/talmobi/tor-request#requirements If you don't have a Tor/socks proxy listening, each onion link will give an error "Can't check onion URLs: no Tor/socks service listening on 127.0.0.1:9050".

While checking links, the Tor Browser can be running or not, it doesn't matter. Only the proxy is used.

Opening Onion (Tor) links in Tor Browser

[THIS FEATURE SEEMS TO BE BROKEN]

To use Alt+T or Alt+M to open onion links in the Tor Browser, you must have Tor Browser installed and running already. You have to launch it yourself; this extension won't launch it.

Also, on Linux, you must install "xdotool":

sudo apt install xdotool
xdotool --version

Then for any bad onion link reported in the diagnostics, do Alt+T on it. If it's an "http://" onion link (illegal, I think), also you can do Alt+M on it. You should see focus switch to the Tor Browser, and the URL will be typed in the address bar, then accessed.

The connection to Tor Browser is not 100% reliable. The extension is using xdotools to send key-presses to the Tor Browser, and it's fairly timing-dependent and one-way. If your system is busy, or Tor Browser is busy, or something else goes wrong, you may see the wrong things happen in Tor Browser (chars missing from the URL, or some dialogs popping open).

Semantic HTML

The body of the HTML page is expected to be structured like:

<body>
<header>STUFF</header>
<main>

<section>
<h1>HEADING</h1>
CONTENT

<section>
<h2>HEADING</h2>
MORE CONTENT
</section>

...

</section>

</main>
<footer>STUFF</footer>
</body>

This structure should increase the SEO and accessibility of your web pages.

If your pages are not structured like this, or you just don't want to bother checking Semantic HTML, change the setting "reportSemanticErrors" to "Don't report".

If a heading outside of any section and outside of main is found, it is assumed that your page is not using Semantic HTML at all, and no further checking of Semantic HTML is done.

Settings

  • addExtensionToLocalURLsWithNone: If a local file URL has no extension, add this extension to the filename before checking (default is ""; don't include "." in the setting).

  • checkInternalLinks: Check #name links to targets inside current file (default is true).

  • checkMailtoDestFormat: Check format of email addresses in mailto links (default is true).

  • dontCheckURLsThatStartWith: Don't check URLs that start with any sequence in this comma-separated list (default is "127.,192.,localhost,[::1],[FC00:,[FD00:").

  • localRoot: String prepended to links that start with "/" (default is ".").

  • maxParallelThreads: Maximum number of links to check in parallel (range is 1 to 20; default is 20).

  • processIdAttributeInAnyTag: #name link can be to any tag with ID attribute inside current file (default is true).

  • reportBadChars: Report possible bad characters ? (default is [check and report] "as Information")

  • patternBadChars: RegEx pattern to match possible bad characters (default is "[^\\\\x09-\\\\x7E]"; if you use lots of non-English characters, maybe use "[\\x7F-\\x9F]" instead; thought "[\\x00-\\x08,\\x0E-\\x1F,\\x7F-\\xFF]" would be good but it fails).

  • reportHTTPSAvailable: Report if HTTP links have HTTPS equivalents that work ? (default is [check and report] "as Information")

  • reportNonHandledSchemes: Report links with URI schemes not checked by the checker, such as FTP and Telnet (default is "as Information").

  • reportPossibleMistakes: Report possible mistakes such as empty tags or attributes ? (default is [check and report] "as Warning")

  • patternsPossibleMistakes: Comma-separated list of RegEx patterns to match possible mistakes (default is " href=\"\", src=\"\", hreef=\",\"></a>,<h1></h1>,<h2></h2>,<h3></h3>,<h4></h4>,<b></b>,<i></i>,<u></u>").

  • reportRedirect: Report links that get redirected (default is "as Warning").

  • reportSemanticErrors: Report errors in semantic HTML tags such as main, section, article, aside, h1, etc (default is "as Information").

  • timeout: Timeout (seconds) for accessing a link (range is 5 to 30; default is 15).

  • torOpenURLCmd1: command (1) to open an URL in Tor Browser ('URL' will be appended; default is "xdotool search --onlyvisible --name 'Tor Browser' windowactivate --sync key --clearmodifiers --window 0 ctrl+t type --delay 100 "

  • torOpenURLCmd2: command (2) to open an URL in Tor Browser (default is "xdotool search --onlyvisible --name 'Tor Browser' windowactivate --sync key --clearmodifiers --window 0 Return").

  • userAgent: User-Agent value used in Get requests (default is "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0").

Limitations

  • HTML and PHP: Tag name and href/src/id attribute must be on the same line.

  • XML and RSS: Entire tag (for link, guid, and url tags) must be on the same line.

  • Doesn't know about comments; will find and check tags inside comments.

  • Checks "#name" links to targets in current file, but not in other local or remote files.

  • Doesn't check EVERY detail of the email address spec in mailto links. Just a cursory check.

  • XML: There are no standard tag and attribute names, so some links may not be checked.

Note that checking for broken links is more of an art than a science. Some sites don't actually return 404, but send you to a landing page. For example, Azure.com works this way. You can go to https://Azure.com/foo/bar and it will happily redirect you to a sub-page of https://azure.microsoft.com/, with no 404 status returned. So take a status of "OK" with a grain of salt - you may not be arriving at the page you intend.

Also, browsers seem to be more tolerant than the library used by this extension. This extension will report a lot of certificate-errors and such that browsers mostly ignore.

And checking is getting harder, with more URLs redirecting through GDPR-consent or cookie pages and such, or redirecting to same URL with a tracking parameter added, causing false positives.

Quirks

  • If there are multiple identical tags with identical link-targets on same line (for example two Anchor tags with identical href targets), clicking on diagnostic for any of them takes you to first one in the source line.

  • Doesn't check ANY of the email address format after "?", as in "mailto:a@b.com?subject=xyz".

  • "://" is prepended to items in dontCheckURLsThatStartWith before matching; e.g. if you specify "localhost" the code searches for "://localhost" in URLs.

  • The checking in XML and RSS files is permissive, allowing known stuff from RSS, and likely stuff that could be in XML. Any attribute of the form *url="something" or *href="something" will be checked, as well as the standard RSS tags: link, guid, url.

  • Onion: an URL is considered "onion" if it starts with "https://" and contains ".onion" ANYWHERE in it.

  • Onion: if an URL starts with "http://", it will be treated as non-Onion, and the "https://" form of it will be checked as non-Onion too.

  • Semantic HTML: assumes that a section/article/aside will have a heading in it before it has any sub-section/article/aside.

  • Semantic HTML: section/article/aside without heading will be flagged (correctly) but may screw up the tracking of headings from that point on. Fix first such error and then scan again.

  • PHP: links are found if they look like links in HTML. For example, the PHP code is expected to look something like:

    echo '<a href="https://example.com">test1</a><br />';

    and not like:

    echo '<a href="https://' + theDomainName + '">test1</a><br />';

    But the following would be okay:

    echo '<a href="https://example.com">' + theTextOfTheLink + '</a><br />';

  • Markdown: in reference-style links, the URL will be checked, but there will be no check that both halves of a reference-style link exist. A reference-style link looks like:

    [hobbit-hole] [1]
    
    [1]: <https://example.com/hobbithole.html> "Hobbit hole"
    

Install

From the Marketplace

Open Visual Studio Code and press F1; a field will appear at the top of the window. Type ext install linkcheckerhtml, hit enter, and reload the window to enable.

From VSIX file

Either:

  • In CLI, do
code --install-extension linkcheckerhtml-n.n.n.vsix

or

  • In VSCode GUI, in the Extensions view "..." drop-down, select the "Install from VSIX" command.

From source code

  • Do a git clone to copy the source code to "linkcheckerhtml" in your home directory.
  • In CLI, cd linkcheckerhtml and then ./CopyToHomeToRunInNormal.sh

Releases

0.2.0

  • Copied from "Microsoft / linkcheckermd" and then greatly modified.
  • Extension works, but probably has memory leaks, not much testing.

0.3.0

0.4.0

  • Finally nailed that hang bug.
  • Added setting for timeout.
  • Fixed timeout and redirect settings.

0.5.0

  • Added Alt+T to open an URL in a browser.
  • First release with a VSIX file.

0.6.0

  • Got rid of: "href" or "src" has to be first attribute in the tag.
  • Require at least one "." in mailto address's domain.
  • Try to dispose memory properly to avoid leaks.
  • Handle local files with "?args" on the end.

0.7.0

  • Added localRoot setting.
  • Fixed mailto that ends with "?".
  • Added userAgent setting, and it definitely makes some sites happier.

1.0.0

  • Increased default timeout to 12.
  • Check local anchors (#name) in current file.
  • Support anchor-id (HTML5) as well as anchor-name.

1.1.0

  • Added settings about checking local anchors (#name) and ID attributes in current file.

1.2.0

  • Moved repeated add-diagnostic code into a function.

1.3.0

  • Added setting and code to check if HTTPS equivalent exists for HTTP address.
  • Added Alt+M to open current HTTP URL as an HTTPS URL in browser.

1.4.0

  • Briefly tested IPv6 addresses to see that at least they don't cause anything to blow up.
  • Set default user-agent string to latest Firefox.
  • Added dontCheckURLsThatStartWith setting and code.
  • Increased default timeout to 15.

1.5.0

  • Added "Using the extension" image.
  • Better message when 0 files left to do.
  • Added addExtensionToLocalURLsWithNone setting and code.

1.6.0

  • Added Alt+L to clear all diagnostics belonging to this extension.
  • Changed my email address.

1.7.0

  • Fixed README.

2.0.0

  • Added support for XML and RSS files.

2.1.0

  • Changed to Axios 0.19.0.
  • On redirected link, give new URL.

2.2.0

  • Updated package dependencies because of security warnings.
  • Don't report link that redirects to same link (but rare, usually something is different).
  • Don't report link that redirects to same link with a tracking parameter added (but rare, usually something else is different too).
  • Fix status when file contains zero links.

3.0.0

  • Added support for checking onion links. Simple pass or fail, consider redirect as pass, no way to control timeout or user-agent.
  • Onto new versions of VSCode and npm and node.
  • Updated default user-agent string to Firefox 76.

3.1.0

  • Made Alt+T or Alt+M on onion link open it in Tor Browser, using xdotool.

3.2.0

  • Flag onion links where domain name is illegal length.
  • Moved xdotool command line strings into settings. (Wayland will use ydotool ?)
  • Treat onion links that start with "http:" as clearnet links.

3.3.0

  • Somehow using xdotool to send onion links to Tor Browser has stopped working.

4.0.0

  • Added checking for bad characters and possible mistakes.
  • Onto new versions of VSCode and npm and node.
  • Updated default user-agent string to Firefox 83.
  • Various code cleanup.
  • Made regex's case-insensitive.

5.0.0

  • Added checking for semantic HTML errors.
  • Updated default user-agent string to Firefox 84.

5.1.0

  • Added support of PHP files as if they were HTML files.
  • Updated default user-agent string to Firefox 86.

6.0.0

  • Added support of Markdown files.

6.1.0

  • Added support for local Markdown links to heading IDs as in [link1](#heading1).
  • Added support for Markdown headings automatically becoming local IDs, with spaces converted to dashes.
  • Updated default user-agent string to Firefox 89.

6.2.0

  • Tweaked support for Markdown headings automatically becoming local IDs: take a heading, remove any leading spaces, change it to lowercase, remove everything not letter digit hyphen space, then change spaces to hyphens.

6.3.0

  • In Markdown, added requirement for [identifier at start of link.
  • Updated npm and modules.

Development

To-Do list

  • In Markdown, prevent collisions when generating implicit header IDs ? "# H" twice should generate IDs "h" and "h-1" ?
  • Add tasks to open and close all HTML files in directory, so linter reports any errors.
  • Maybe new axios has broken timeout ?
  • Somehow using xdotool to open onion link in Tor Browser has gotten broken.
  • Test onion links a lot more, maybe indicate redirects, any way to control timeout, set user-agent.
  • Better way to open onion link in Tor Browser ?
  • Way to open onion link in Tor Browser on Windows ?
  • Add setting "do/don't check onion links".
  • Snap version of VSCode uses Alt+H for Help menu.
  • Create automated tests.
  • Extension really is supposed to remove each diagnostic line after the corresponding source line is edited.
  • Bundle extension to make it smaller/faster ? https://code.visualstudio.com/api/working-with-extensions/bundling-extension https://webpack.js.org/guides/getting-started/
  • Can't really test IPv6 because my system and ISP have it turned off.
  • Allow single-quotes on attributes ? I thought HTMLHint didn't allow them, so I didn't support them.
  • Don't check a link if it has rel="nofollow" ? Probably should leave it as-is: check it.
  • Any way to do retries inside axios ? Apparently not.
  • Memory leaks ? Doesn't seem to be any tool to check an extension for leaking. Maybe not possible, since extensions are running inside a huge framework of Electron or Node or something.
  • Display a "busy" cursor ? Can't. Window.withProgress could put up a dialog, but then user would have to close the dialog manually every time, don't want that. Doesn't seem to be a way to close that dialog programmatically.
  • Click on diagnostic, do Alt+T or Alt+M to browser, come back to VSCode, cursor is in filter field of diagnostics pane instead of in source file. More convenient if in source file. But seems to be no way to do it.
  • Multi-line tag (tag name and href/src attribute on different lines) silently ignored. Would be a lot of work to deal with, given the simple way the code does parsing.

Development Environment

I'm no expert on this stuff, maybe I'm doing some things stupidly.

Now using:

  • Fedora 34 KDE with X.
  • VSCode deb 1.58.2 (which says Node.js: 14.16.0)
  • node --version # v14.17.0
  • npm --version # 6.14.13
  • axios
  • path
  • fs
  • tor-request

I did:

  • sudo apt install npm

  • sudo npm install -g vsce In project directory:

  • npm install

  • npm audit

  • npm audit fix

  • sudo npm -g install --save axios

GitHub repo for this extension

Visual Studio Marketplace page for this extension

My web site


Privacy Policy

This extension doesn't collect, store or transmit your identity or personal information in any way. All it does is read the current editor window, do existence-tests on local files, open links to internet sites, and send internet links to your browser.