Introduce fragment checking for links to markdown files. #1126

HU90m · 2023-06-27T14:31:16Z

This covers part of what has been asked for in #185, specifically checking fragments for links to markdown files.

I've introduced the concept of a fragment generator because there doesn't appear to be a standard way in which fragments/anchors are generated from markdown files. HeadingAttributes uses the heading attribute syntax. UniqueKebabCase uses the method used by GitHub and mdBook to generate fragments.

At the moment, all fragment generators ignore id attributes within inline HTML. These could be included using html5ever in the same way as is done with links. It may make sense to add support for fragment checks for html pages at the same time. NB: this comment.

This probably needs a little polishing. Opening a draft PR because I wanted some feedback before doing any more.

One change that is definitely needed is some unit tests for extract_markdown_fragments.

The general approach is simple but as a result the file being linked to is parsed for fragment link for each link. This is probably worth the simplicity because parsing a file isn't very slow compared to a network access.

lychee-bin/src/options.rs

lychee-lib/src/extract/markdown.rs

lychee-lib/src/client.rs

mre · 2023-06-28T06:11:03Z

Awesome progress and a nice PR! Thanks for looking into this @HU90m.

I like your approach and that you created a draft first, to discuss the overall design.
The first draft looks good, and I'd love to move forward with this.
My main concern is the repeated file parsing, which you already mentioned; but I guess there is a workaround, which is relatively easy and performant by using a FragmentCache (see my comments for more info).

MichaIng · 2023-06-28T19:30:55Z

Many thanks for working on this. Just to assure and/or clarify:

This is to check fragments if the "target" of a link is a Markdown file, right?
So the use case is when you run lychee on the Markdown source code of e.g. an MkDocs site where links are defined to other Markdown files, while building the actual website translates them to HTML page links, like the website itself?

There are two relevant Markdown extension I am aware of:

And of course raw HTML can be usually embedded. Hence the ways and options about how IDs are defined and how the resulting ID in HTML looks like are quite many. Hence a concern I have is that implementing fragment checks for Markdown targets is either very complex (hence the multiple options/formats you planned), or it will only work for a small number of users.

But that does not mean that I am not fully supporting it. It is always good to have a start done for the use case/format you see most relevant. It can be extended at a later time after/if receiving feedback/requests about it.

I personally run lychee on the built HTML files of our MkDocs site, to have consistently HTML input and link targets checked. Those have the benefit of a consistent element ID syntax/standard which can be easily extracted. So it would work for everyone, and for external links as well, implied that the whole HTML document is downloaded. However, I know link checks are often done on source files with sometimes mostly internal links. We might even change our CI to benefit from this feature, if the markdown.extensions.toc auto-generated header IDs were supported.

HU90m · 2023-06-29T19:00:32Z

Hey @MichaIng , thanks for sharing your thoughts.

Yep this is to check fragments if the "target" of a link is a Markdown file, right

So the use case is when you run lychee on the Markdown source code of e.g. an MkDocs site where links are defined to other Markdown files, while building the actual website translates them to HTML page links, like the website itself?

Spot on, but I would like to be able to check html directly as well in the future.

I tend to put fragment generation in one of two categories: explicit and automatic. Heading attributes seems by far the most popular explicit method. It is what is an option in this PR at the moment (HeadingAttributes). This is just using the implementation in pulldown-cmark, which is compatible with the https://python-markdown.github.io/extensions/attr_list/ so long as you don't use the optional :, e.g. { #thing } not {: #thing }.

It looks like https://python-markdown.github.io/extensions/toc/ gives a lot of configuration options. I couldn't quite work out the default settings. Could you describe them?

I've laid out the big open questions (as I see them) in #185 . I thought I would move higher level discussions back to the issue. Then when we have a rough consensus of what is wanted, we can review implementation details in this (or possibly a different) PR.

mre · 2023-07-07T22:51:47Z

@HU90m, I just found this old branch, which contains some test cases for HTML fragment extraction. Feel free to copy some of that stuff over into your PR if you like. master...anchors

HU90m · 2023-07-12T17:55:22Z

Removed fragment generator options. Both unque kebab case and heading attribute fragments are always used.
Now using convert_case for kebab case conversion.
Added fragment cache.

Will add some tests tomorrow.

@HU90m, I just found this old branch, which contains some test cases for HTML fragment extraction. Feel free to copy some of that stuff over into your PR if you like.

Thank you! This looks useful.

HU90m · 2023-07-12T18:36:28Z

Rebased PR

mre · 2023-07-12T21:33:59Z

Awesome work! This will be extremely helpful once merged. I will do a review as soon as I get around.

mre

Great progress!
I've added a bunch of comments; mostly around splitting the logic into separate structs for separation of concerns. Not all of these suggestions have to be implemented for this PR to land, though. 😃

lychee-lib/src/client.rs

lychee-lib/src/extract/markdown.rs

lychee-lib/src/utils/request.rs

HU90m · 2023-07-13T19:07:44Z

FragmentChecker lives!

HU90m · 2023-07-13T19:30:33Z

Rebased. Also squashed a lot of my commits to make the rebase simpler.

HU90m · 2023-07-14T16:40:23Z

Fixed the failing tests

mre · 2023-07-17T11:10:52Z

I made a few adjustments to (hopefully) make the code a bit more idiomatic in 1001ccd.
I tried to write an integration test in 52b6dbf, but it's failing right now; don't know why.

mre · 2023-07-17T12:42:21Z

Sorry for the additional commits. We can squash them in the end.

HU90m · 2023-07-20T08:29:50Z

Thanks for adding some tests! This saved me quite a bit of time getting my head around the testing infrastructure.

The heading attributes test was failing because:

I hadn't set the parser up for the heading attributes extension
The pulldown-cmark parser doesn't parse a heading attribute if it's on the last line. This is probably a bug. To fix it, I moved the heading with an attributes so it wasn't on the last line.

HU90m · 2023-07-20T09:03:08Z

Rebased on master

HU90m · 2023-07-21T12:46:22Z

I was using lychee with this patch and noticed the convert_case crate doesn't kebab-ify in the same way as mdBook, GitHub and GitLab. It replaces _ with - whereas the others leave _ be, e.g. Hello_there friend becomes hello-there-friend as oppose to hello_there-friend.

The hand rolled algorithm we had before is the same as that used by mdBook. Should I revert the change to using the convert_case crate.

mre · 2023-07-21T13:26:42Z

Oh. Nice catch. Yes, maybe we should revert it then. We could still take over some of those tests for edge cases from the convert_case crate. If you find the time, you could also open an upstream issue or maybe even contribute your algorithm there.

mre

fyi @HU90m: I went ahead and changed expect to ErrorKind. This way, we can handle errors a little more up the call chain. We don't differentiate between the errors at the moment, but technically we could do so in the future. Right now, I've added to separate variants, InvalidFilenameUnicode and NoFileName, but we can also merge it into one and call it InvalidFile or so.

mre · 2023-07-21T22:03:53Z

lychee-lib/src/types/error.rs

+            (Self::TooManyRedirects(e1), Self::TooManyRedirects(e2)) => {
+                e1.to_string() == e2.to_string()
+            }
+            (Self::BasicAuthExtractorError(e1), Self::BasicAuthExtractorError(e2)) => {
+                e1.to_string() == e2.to_string()
+            }
+            (Self::Cookies(e1), Self::Cookies(e2)) => e1 == e2,


These changes aren't related to fragment checking, but I added the missing branches anyway while I was at it. I keep forgetting to do that in a separate PR. 😅

mre · 2023-07-28T08:58:08Z

Looks like you need to rebase on top of master one more time. Other than that, this should be good to go, right? Or do you plan any additional changes?

Oh! And maybe we can mention fragments in the feature comparison table in the README. ✨ It would be required to check if the other link checkers mentioned in the table have this feature, but it might be quick?

These fragments are ignored during the file link check. This commit allows their use in specific cases in the future.

Fragments/anchors are generated from markdown files using *unique kebab case* and the *heading attributes* commonmark extension.

- Enabled the extension in the parser. - Moved the heading with an attribute in the test file, because the parser doesn't pick them up if they are at the end of a file.

HU90m · 2023-07-30T13:38:14Z

I went ahead and changed expect to ErrorKind.

I never said. Thanks for doing this!

Looks like you need to rebase on top of master one more time. Other than that, this should be good to go, right? Or do you plan any additional changes?

Just rebased. Yep it's all good to go 😃

Oh! And maybe we can mention fragments in the feature comparison table in the README. sparkles It would be required to check if the other link checkers mentioned in the table have this feature, but it might be quick?

Yeah, it shouldn't take too long. I started doing it, but stopped because we don't have fragment checks up and running for html and inline html yet. I feel like we should wait until we've got those PRs in before advertising this feature.

mre · 2023-07-31T14:01:45Z

I decided to merge all file errors into InvalidFile as they are very similar.
This is now ready for prime time! 😎 🍿
Thanks for the PR and congrats on shipping this big feature. You 🪨 !

HU90m mentioned this pull request Jun 27, 2023

Add support for anchor tags #185

Closed

mre reviewed Jun 28, 2023

View reviewed changes

HU90m force-pushed the md_frags branch from 020572b to f300527 Compare July 12, 2023 18:36

HU90m mentioned this pull request Jul 13, 2023

[doc] Link to the section downloads the .hjson lowRISC/opentitan#19183

Closed

mre reviewed Jul 13, 2023

View reviewed changes

lychee-lib/src/client.rs Outdated Show resolved Hide resolved

lychee-lib/src/client.rs Show resolved Hide resolved

lychee-lib/src/extract/markdown.rs Outdated Show resolved Hide resolved

lychee-lib/src/utils/request.rs Outdated Show resolved Hide resolved

HU90m force-pushed the md_frags branch from 2f3f33a to af7db48 Compare July 13, 2023 19:29

HU90m force-pushed the md_frags branch from af7db48 to d0392dd Compare July 14, 2023 16:39

HU90m marked this pull request as ready for review July 14, 2023 16:39

HU90m force-pushed the md_frags branch 2 times, most recently from 48d4972 to 845f48f Compare July 14, 2023 17:28

HU90m force-pushed the md_frags branch from ae7efcd to e3c21de Compare July 20, 2023 09:02

HU90m force-pushed the md_frags branch 3 times, most recently from a63a7da to 918e708 Compare July 20, 2023 10:01

HU90m mentioned this pull request Jul 21, 2023

[doc] Transitioning more register tables to CMDGEN lowRISC/opentitan#19266

Merged

mre reviewed Jul 21, 2023

View reviewed changes

HU90m and others added 12 commits July 30, 2023 14:29

Fragments are now included in file links.

90b09d4

These fragments are ignored during the file link check. This commit allows their use in specific cases in the future.

Links to markdown files with fragments now have the fragments checked.

c3dc559

Fragments/anchors are generated from markdown files using *unique kebab case* and the *heading attributes* commonmark extension.

Make code more idiomatic

62cca36

Add integration test

b3941f1

Update docs

373ea86

Better default

3ff992f

Add missing CLI param for fragment test

35f9068

Remove html anchor from fragment test

30e2c48

Fixed heading attributes fragments.

4bacb3f

- Enabled the extension in the parser. - Moved the heading with an attribute in the test file, because the parser doesn't pick them up if they are at the end of a file.

Fragments weren't being ignored for non markdown files.

d1eeb29

Replace expect with ErrorKind

c743735

Changed to GitHab complient fragment generator

a082398

HU90m force-pushed the md_frags branch from e50d692 to a082398 Compare July 30, 2023 13:31

Combine file errors into one ErrorKind::InvalidFile variant

633a2d1

mre merged commit 8e63693 into lycheeverse:master Jul 31, 2023
6 checks passed

HU90m mentioned this pull request Aug 6, 2023

Check fragments in HTML files #1198

Merged

mre added the enhancement New feature or request label Aug 24, 2023

mre mentioned this pull request Oct 27, 2023

Many links in the table of contents are not working analysis-tools-dev/static-analysis#1439

Closed

BrewTestBot mentioned this pull request Jan 5, 2024

lychee 0.14.0 Homebrew/homebrew-core#159076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce fragment checking for links to markdown files. #1126

Introduce fragment checking for links to markdown files. #1126

HU90m commented Jun 27, 2023

mre commented Jun 28, 2023

MichaIng commented Jun 28, 2023 •

edited

HU90m commented Jun 29, 2023 •

edited

mre commented Jul 7, 2023

HU90m commented Jul 12, 2023

HU90m commented Jul 12, 2023

mre commented Jul 12, 2023

mre left a comment

HU90m commented Jul 13, 2023 •

edited

HU90m commented Jul 13, 2023

HU90m commented Jul 14, 2023 •

edited

mre commented Jul 17, 2023

mre commented Jul 17, 2023

HU90m commented Jul 20, 2023

HU90m commented Jul 20, 2023

HU90m commented Jul 21, 2023 •

edited

mre commented Jul 21, 2023

mre left a comment

mre Jul 21, 2023

mre commented Jul 28, 2023

HU90m commented Jul 30, 2023

mre commented Jul 31, 2023

Introduce fragment checking for links to markdown files. #1126

Introduce fragment checking for links to markdown files. #1126

Conversation

HU90m commented Jun 27, 2023

mre commented Jun 28, 2023

MichaIng commented Jun 28, 2023 • edited

HU90m commented Jun 29, 2023 • edited

mre commented Jul 7, 2023

HU90m commented Jul 12, 2023

HU90m commented Jul 12, 2023

mre commented Jul 12, 2023

mre left a comment

Choose a reason for hiding this comment

HU90m commented Jul 13, 2023 • edited

HU90m commented Jul 13, 2023

HU90m commented Jul 14, 2023 • edited

mre commented Jul 17, 2023

mre commented Jul 17, 2023

HU90m commented Jul 20, 2023

HU90m commented Jul 20, 2023

HU90m commented Jul 21, 2023 • edited

mre commented Jul 21, 2023

mre left a comment

Choose a reason for hiding this comment

mre Jul 21, 2023

Choose a reason for hiding this comment

mre commented Jul 28, 2023

HU90m commented Jul 30, 2023

mre commented Jul 31, 2023

MichaIng commented Jun 28, 2023 •

edited

HU90m commented Jun 29, 2023 •

edited

HU90m commented Jul 13, 2023 •

edited

HU90m commented Jul 14, 2023 •

edited

HU90m commented Jul 21, 2023 •

edited