Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce fragment checking for links to markdown files. #1126

Merged
merged 13 commits into from
Jul 31, 2023

Conversation

HU90m
Copy link
Contributor

@HU90m HU90m commented Jun 27, 2023

This covers part of what has been asked for in #185, specifically checking fragments for links to markdown files.

I've introduced the concept of a fragment generator because there doesn't appear to be a standard way in which fragments/anchors are generated from markdown files. HeadingAttributes uses the heading attribute syntax. UniqueKebabCase uses the method used by GitHub and mdBook to generate fragments.

At the moment, all fragment generators ignore id attributes within inline HTML. These could be included using html5ever in the same way as is done with links. It may make sense to add support for fragment checks for html pages at the same time. NB: this comment.

This probably needs a little polishing. Opening a draft PR because I wanted some feedback before doing any more.

One change that is definitely needed is some unit tests for extract_markdown_fragments.

The general approach is simple but as a result the file being linked to is parsed for fragment link for each link. This is probably worth the simplicity because parsing a file isn't very slow compared to a network access.

lychee-bin/src/options.rs Outdated Show resolved Hide resolved
lychee-lib/src/extract/markdown.rs Show resolved Hide resolved
lychee-lib/src/extract/markdown.rs Outdated Show resolved Hide resolved
lychee-lib/src/extract/markdown.rs Outdated Show resolved Hide resolved
lychee-lib/src/client.rs Outdated Show resolved Hide resolved
lychee-lib/src/client.rs Outdated Show resolved Hide resolved
@mre
Copy link
Member

mre commented Jun 28, 2023

Awesome progress and a nice PR! Thanks for looking into this @HU90m.

I like your approach and that you created a draft first, to discuss the overall design.
The first draft looks good, and I'd love to move forward with this.
My main concern is the repeated file parsing, which you already mentioned; but I guess there is a workaround, which is relatively easy and performant by using a FragmentCache (see my comments for more info).

@MichaIng
Copy link
Member

MichaIng commented Jun 28, 2023

Many thanks for working on this. Just to assure and/or clarify:

  • This is to check fragments if the "target" of a link is a Markdown file, right?
  • So the use case is when you run lychee on the Markdown source code of e.g. an MkDocs site where links are defined to other Markdown files, while building the actual website translates them to HTML page links, like the website itself?

There are two relevant Markdown extension I am aware of:

And of course raw HTML can be usually embedded. Hence the ways and options about how IDs are defined and how the resulting ID in HTML looks like are quite many. Hence a concern I have is that implementing fragment checks for Markdown targets is either very complex (hence the multiple options/formats you planned), or it will only work for a small number of users.

But that does not mean that I am not fully supporting it. It is always good to have a start done for the use case/format you see most relevant. It can be extended at a later time after/if receiving feedback/requests about it.

I personally run lychee on the built HTML files of our MkDocs site, to have consistently HTML input and link targets checked. Those have the benefit of a consistent element ID syntax/standard which can be easily extracted. So it would work for everyone, and for external links as well, implied that the whole HTML document is downloaded. However, I know link checks are often done on source files with sometimes mostly internal links. We might even change our CI to benefit from this feature, if the markdown.extensions.toc auto-generated header IDs were supported.

@HU90m
Copy link
Contributor Author

HU90m commented Jun 29, 2023

Hey @MichaIng , thanks for sharing your thoughts.

Yep this is to check fragments if the "target" of a link is a Markdown file, right

So the use case is when you run lychee on the Markdown source code of e.g. an MkDocs site where links are defined to other Markdown files, while building the actual website translates them to HTML page links, like the website itself?

Spot on, but I would like to be able to check html directly as well in the future.

I tend to put fragment generation in one of two categories: explicit and automatic. Heading attributes seems by far the most popular explicit method. It is what is an option in this PR at the moment (HeadingAttributes). This is just using the implementation in pulldown-cmark, which is compatible with the https://python-markdown.github.io/extensions/attr_list/ so long as you don't use the optional :, e.g. { #thing } not {: #thing }.

It looks like https://python-markdown.github.io/extensions/toc/ gives a lot of configuration options. I couldn't quite work out the default settings. Could you describe them?

I've laid out the big open questions (as I see them) in #185 . I thought I would move higher level discussions back to the issue. Then when we have a rough consensus of what is wanted, we can review implementation details in this (or possibly a different) PR.

@mre
Copy link
Member

mre commented Jul 7, 2023

@HU90m, I just found this old branch, which contains some test cases for HTML fragment extraction. Feel free to copy some of that stuff over into your PR if you like. master...anchors

@HU90m
Copy link
Contributor Author

HU90m commented Jul 12, 2023

  • Removed fragment generator options. Both unque kebab case and heading attribute fragments are always used.
  • Now using convert_case for kebab case conversion.
  • Added fragment cache.

Will add some tests tomorrow.

@HU90m, I just found this old branch, which contains some test cases for HTML fragment extraction. Feel free to copy some of that stuff over into your PR if you like.

Thank you! This looks useful.

@HU90m
Copy link
Contributor Author

HU90m commented Jul 12, 2023

Rebased PR

@mre
Copy link
Member

mre commented Jul 12, 2023

Awesome work! This will be extremely helpful once merged. I will do a review as soon as I get around.

Copy link
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress!
I've added a bunch of comments; mostly around splitting the logic into separate structs for separation of concerns. Not all of these suggestions have to be implemented for this PR to land, though. 😃

lychee-lib/src/client.rs Outdated Show resolved Hide resolved
lychee-lib/src/client.rs Show resolved Hide resolved
lychee-lib/src/extract/markdown.rs Outdated Show resolved Hide resolved
lychee-lib/src/utils/request.rs Outdated Show resolved Hide resolved
@HU90m
Copy link
Contributor Author

HU90m commented Jul 13, 2023

FragmentChecker lives!

@HU90m
Copy link
Contributor Author

HU90m commented Jul 13, 2023

Rebased. Also squashed a lot of my commits to make the rebase simpler.

@HU90m HU90m marked this pull request as ready for review July 14, 2023 16:39
@HU90m
Copy link
Contributor Author

HU90m commented Jul 14, 2023

Fixed the failing tests

@HU90m HU90m force-pushed the md_frags branch 2 times, most recently from 48d4972 to 845f48f Compare July 14, 2023 17:28
@mre
Copy link
Member

mre commented Jul 17, 2023

I made a few adjustments to (hopefully) make the code a bit more idiomatic in 1001ccd.
I tried to write an integration test in 52b6dbf, but it's failing right now; don't know why.

@mre
Copy link
Member

mre commented Jul 17, 2023

Sorry for the additional commits. We can squash them in the end.

@HU90m
Copy link
Contributor Author

HU90m commented Jul 20, 2023

Thanks for adding some tests! This saved me quite a bit of time getting my head around the testing infrastructure.

The heading attributes test was failing because:

  1. I hadn't set the parser up for the heading attributes extension
  2. The pulldown-cmark parser doesn't parse a heading attribute if it's on the last line. This is probably a bug. To fix it, I moved the heading with an attributes so it wasn't on the last line.

@HU90m
Copy link
Contributor Author

HU90m commented Jul 20, 2023

Rebased on master

@HU90m HU90m force-pushed the md_frags branch 3 times, most recently from a63a7da to 918e708 Compare July 20, 2023 10:01
@HU90m
Copy link
Contributor Author

HU90m commented Jul 21, 2023

I was using lychee with this patch and noticed the convert_case crate doesn't kebab-ify in the same way as mdBook, GitHub and GitLab. It replaces _ with - whereas the others leave _ be, e.g. Hello_there friend becomes hello-there-friend as oppose to hello_there-friend.

The hand rolled algorithm we had before is the same as that used by mdBook. Should I revert the change to using the convert_case crate.

@mre
Copy link
Member

mre commented Jul 21, 2023

Oh. Nice catch. Yes, maybe we should revert it then. We could still take over some of those tests for edge cases from the convert_case crate. If you find the time, you could also open an upstream issue or maybe even contribute your algorithm there.

Copy link
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @HU90m: I went ahead and changed expect to ErrorKind. This way, we can handle errors a little more up the call chain. We don't differentiate between the errors at the moment, but technically we could do so in the future. Right now, I've added to separate variants, InvalidFilenameUnicode and NoFileName, but we can also merge it into one and call it InvalidFile or so.

Comment on lines +241 to +247
(Self::TooManyRedirects(e1), Self::TooManyRedirects(e2)) => {
e1.to_string() == e2.to_string()
}
(Self::BasicAuthExtractorError(e1), Self::BasicAuthExtractorError(e2)) => {
e1.to_string() == e2.to_string()
}
(Self::Cookies(e1), Self::Cookies(e2)) => e1 == e2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't related to fragment checking, but I added the missing branches anyway while I was at it. I keep forgetting to do that in a separate PR. 😅

@mre
Copy link
Member

mre commented Jul 28, 2023

Looks like you need to rebase on top of master one more time. Other than that, this should be good to go, right? Or do you plan any additional changes?

Oh! And maybe we can mention fragments in the feature comparison table in the README. ✨ It would be required to check if the other link checkers mentioned in the table have this feature, but it might be quick?

HU90m and others added 12 commits July 30, 2023 14:29
These fragments are ignored during the file link check.
This commit allows their use in specific cases in the future.
Fragments/anchors are generated from markdown files using
*unique kebab case* and the *heading attributes* commonmark extension.
- Enabled the extension in the parser.
- Moved the heading with an attribute in the test file, because the
  parser doesn't pick them up if they are at the end of a file.
@HU90m
Copy link
Contributor Author

HU90m commented Jul 30, 2023

I went ahead and changed expect to ErrorKind.

I never said. Thanks for doing this!

Looks like you need to rebase on top of master one more time. Other than that, this should be good to go, right? Or do you plan any additional changes?

Just rebased. Yep it's all good to go 😃

Oh! And maybe we can mention fragments in the feature comparison table in the README. sparkles It would be required to check if the other link checkers mentioned in the table have this feature, but it might be quick?

Yeah, it shouldn't take too long. I started doing it, but stopped because we don't have fragment checks up and running for html and inline html yet. I feel like we should wait until we've got those PRs in before advertising this feature.

@mre
Copy link
Member

mre commented Jul 31, 2023

I decided to merge all file errors into InvalidFile as they are very similar.
This is now ready for prime time! 😎 🍿
Thanks for the PR and congrats on shipping this big feature. You 🪨 !

@mre mre merged commit 8e63693 into lycheeverse:master Jul 31, 2023
6 checks passed
@mre mre added the enhancement New feature or request label Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants