Double encoding of special chars in URL doesn't work well in all cases #45

NicolasMassart · 2021-03-11T11:18:45Z

See for instance tcort/markdown-link-check#155
User is looking to check https://en.wikipedia.org/wiki/%3F:
But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :.
We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.

The text was updated successfully, but these errors were encountered:

jan-guenter · 2023-04-18T10:05:30Z

I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree

With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.

In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.

Example:

new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()

results in

https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded

So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.

diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
 
         let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
 
-        // Decoding and encoding is required to prevent encoding already encoded URLs
-        // We decode using the decodeURIComponent as it will decode a wider range of 
-        // characters that were not necessary to be encoded at first, then we re-encode
-        // only the required ones using encodeURI.
-        // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
-        // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
-        const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+        // rebase relative urls and normalize url encoding
+        const url = new URL(link, opts.baseUrl).toString();

         const options = {
             user_agent: user_agent,

Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.

NicolasMassart self-assigned this Mar 11, 2021

NicolasMassart added the bug label Mar 11, 2021

NicolasMassart added this to New issues in Issues progress via automation Mar 11, 2021

NicolasMassart moved this from New issues to In Progress in Issues progress Mar 11, 2021

NicolasMassart removed their assignment May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double encoding of special chars in URL doesn't work well in all cases #45

Double encoding of special chars in URL doesn't work well in all cases #45

NicolasMassart commented Mar 11, 2021

jan-guenter commented Apr 18, 2023

Double encoding of special chars in URL doesn't work well in all cases #45

Double encoding of special chars in URL doesn't work well in all cases #45

Comments

NicolasMassart commented Mar 11, 2021

jan-guenter commented Apr 18, 2023