fix(core): avoid hash collision when generating chunk names #8538

Josh-Cena · 2023-01-10T20:21:02Z

Pre-flight checklist

I have read the Contributing Guidelines on pull requests.
If this is a code change: I have written unit tests and/or added dogfooding pages to fully verify the new behavior.
If this is a new API or substantial change: the PR has an accompanying issue (closes Same post shown twice in the blog list #8536) and the maintainers have approved on my working plan.

Motivation

Test Plan

Added a unit test that didn't pass on master. Ideally we would have dogfooding as well, but this is hard because we have to make sure that (1) two blog posts have the same chunk name (2) they appear on the same page in the paginated view. It's possible if we create another blog plugin solely for this test, but that doesn't seem worthwhile.

Test links

Deploy preview: https://deploy-preview-_____--docusaurus-2.netlify.app/

Related issues/PRs

netlify · 2023-01-10T20:26:00Z

✅ [V2]

Name	Link
🔨 Latest commit	`f12075b`
🔍 Latest deploy log	https://app.netlify.com/sites/docusaurus-2/deploys/63c630204a28cf00099eaf11
😎 Deploy Preview	https://deploy-preview-8538--docusaurus-2.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

github-actions · 2023-01-10T20:26:31Z

⚡️ Lighthouse report for the deploy preview of this PR

URL	Performance	Accessibility	Best Practices	SEO	PWA	Report
/	🟠 85	🟢 97	🟢 92	🟢 100	🟢 90	Report
/docs/installation	🟠 75	🟢 100	🟢 100	🟢 100	🟢 90	Report

github-actions · 2023-01-10T20:38:25Z

Size Change: 0 B

Total Size: 890 kB

ℹ️ View Unchanged

Filename	Size
`website/.docusaurus/globalData.json`	71 kB
`website/build/assets/css/styles.********.css`	112 kB
`website/build/assets/js/main.********.js`	666 kB
`website/build/index.html`	40.9 kB

_{compressed-size-action}

Zwyx · 2023-01-11T04:20:25Z

packages/docusaurus/src/server/routes.ts

      const name = str === '/' ? 'index' : docuHash(str);
      chunkName = prefix ? `${prefix}---${name}` : name;
    }
+    const seenCount = (chunkNameCount.get(chunkName) ?? 0) + 1;
+    if (seenCount > 1) {
+      chunkName += seenCount;
+    }


Ah yes so it was 1 chance out of only 4096 to get a collision (simpleHash(modulePath, 3) returning 3 hex characters).

Not sure how much you want to refactor/clean this, I'll just add a few notes (feel free to ignore of course, I don't even know how implicated you are in the project 🙂 ):

if we go for a counter, should we keep the hash? It seems unnecessary to have both.

it seems unnecessary to do a docuHash after having done a simpleHash: if two simpleHashs produce the same output (c99 in our case), then the two docuHashs will produce the same output as well (485).

docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.

and a more general question: why don't we do simpleHash(modulePath, 8) in dev too?

For anyone discovering this PR, I've put together a small script to reproduce the issue we're having. Just run this with ts-node:

Details

import { createHash } from "crypto"; export function md5Hash(str: string): string { return createHash("md5").update(str).digest("hex"); } export function docuHash(str: string): string { if (str === "/") { return "index"; } const shortHash = simpleHash(str, 3); const parsedPath = `${str}-${shortHash}`; console.log(`parsed path: ${parsedPath}`); return parsedPath; } export function simpleHash(str: string, length: number): string { return md5Hash(str).substring(0, length); } export function genChunkName( modulePath: string, prefix?: string, preferredName?: string, shortId: boolean = process.env.NODE_ENV === "production", ): string { let str = modulePath; if (preferredName) { const shortHash = simpleHash(modulePath, 3); console.log(`short hash of ${modulePath}: ${shortHash}`); str = `${preferredName}${shortHash}`; } const name = str === "/" ? "index" : docuHash(str); console.log(`docuHash of ${str}: ${name}`); let chunkName = prefix ? `${prefix}---${name}` : name; return chunkName; } console.log( genChunkName( "@site/blog/2022-11-18-bye-medium/index.mdx?truncated=true", "content", "/blog", ), ); console.log( genChunkName( "@site/blog/2019-10-05-react-nfc/index.mdx?truncated=true", "content", "/blog", ), );

Output:

short hash of @site/blog/2022-11-18-bye-medium/index.mdx?truncated=true: c99 parsed path: /blogc99-485 docuHash of /blogc99: /blogc99-485 content---/blogc99-485 <----- SAME RESULT HERE short hash of @site/blog/2019-10-05-react-nfc/index.mdx?truncated=true: c99 parsed path: /blogc99-485 docuHash of /blogc99: /blogc99-485 content---/blogc99-485 <----- AND HERE

Thanks for the review and the thoughts put into them.

I don't even know how implicated you are in the project

I'm a community contributor just like you, so of course I'll take them into account seriously :)

if we go for a counter, should we keep the hash?

The hash is also to make the IDs have a uniform and reasonably-short length. The ID is directly present in the client data, so any bytes we can shave off here is a net gain for bundle size.

I anticipate the counter to be very rarely used, and even if it is used, it would only add two bytes, which is fine.

it seems unnecessary to do a docuHash after having done a simpleHash

docuHash's name is confusing: hashing is probably the least that it does. It also normalizes the chunk name's capitalization and truncates the name, which makes sure the generated files can be properly emitted.

docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.

Thanks! This code has been there since forever (early in the conception of Docusaurus 2), so things probably have evolved since then.

why don't we do simpleHash(modulePath, 8) in dev too?

For debuggability. The chunk name has some information of where it's coming from instead of an unreadable hash.

Excellent. Thanks a lot for your answers, it all makes sense.

slorber · 2023-01-18T18:17:37Z

Thanks 👍 looks good enough for now, but maybe we could do better.

Agree this historical fn is a bit confusing 🤪

Hopefully we won't get 36 collisions at once. We could as well use 4 instead of 3 for the hash 🤷‍♂️

Josh-Cena · 2023-01-19T07:37:11Z

Hash collision is inevitable—by the pigeonhole principle, even for a 4-char hash, collision is guaranteed when there are $16^{4} + 1 = 65537$ chunks. (No website is this big, but this is just a worst-case upper bound, and you can expect it to happen far below that scale—birthday paradox.) At this level, the average length of the 3-char + counter hash is $3 + \log_{36} \frac{65537}{4096} = 3.77$ , still smaller than 4. We are using a smaller price to avoid an otherwise non-avoidable issue.

Josh-Cena · 2023-01-19T21:48:19Z

For my own sake of mathematical sanity: I neglected the fact that the length must be an integer, so for the vast number of cases where there are less than 36 collisions the hash still reaches a length of 4. However, even if we approximate this as a binomial distribution independent for each of the 4096 hashes, the expected hash length is still

$E = 3 + P (2 \leq X \leq 36) + 2 P (37 \leq X \leq 1296) + \dots$

where

$P (X = x) = (\begin{matrix} 65537 \\ x \end{matrix}) {(\frac{1}{4096})}^{x} {(\frac{4095}{4096})}^{65537 - x}$

Some calculations show me $4.000 < E < 4.001$ , which means at least we aren't doing much worse. This is because despite there being a tiny, tiny ( $0.0001$ ) chance of >36 collisions, there's also a chance of there being no collisions at all, offsetting the difference.

(I hope I get it right this time—have to live up to my expertise.)

slorber · 2023-01-20T11:47:26Z

🤯

fix(core): avoid hash collision when generating chunk names

d5d7f7f

Josh-Cena added pr: bug fix to backport labels Jan 10, 2023

Josh-Cena requested review from slorber and lex111 as code owners January 10, 2023 20:21

facebook-github-bot added the CLA Signed label Jan 10, 2023

Zwyx reviewed Jan 11, 2023

View reviewed changes

Josh-Cena added 2 commits January 10, 2023 23:42

Remove unnecessary condition

5fc8107

Test both conditions

f12075b

slorber merged commit 00023c2 into main Jan 18, 2023

slorber deleted the jc/avoid-chunk-clash branch January 18, 2023 18:19

slorber pushed a commit that referenced this pull request Jan 26, 2023

fix(core): avoid hash collision when generating chunk names (#8538)

41d0043

slorber added backported and removed to backport labels Jan 27, 2023

slorber mentioned this pull request Oct 19, 2023

chore: v3.0.0-rc.0 release #9418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): avoid hash collision when generating chunk names #8538

fix(core): avoid hash collision when generating chunk names #8538

Josh-Cena commented Jan 10, 2023

netlify bot commented Jan 10, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2023 •

edited

Loading

Zwyx Jan 11, 2023 •

edited

Loading

Josh-Cena Jan 11, 2023 •

edited

Loading

Zwyx Jan 11, 2023

slorber commented Jan 18, 2023

Josh-Cena commented Jan 19, 2023 •

edited

Loading

Josh-Cena commented Jan 19, 2023 •

edited

Loading

slorber commented Jan 20, 2023

fix(core): avoid hash collision when generating chunk names #8538

fix(core): avoid hash collision when generating chunk names #8538

Conversation

Josh-Cena commented Jan 10, 2023

Pre-flight checklist

Motivation

Test Plan

Test links

Related issues/PRs

netlify bot commented Jan 10, 2023 • edited Loading

✅ [V2]

github-actions bot commented Jan 10, 2023 • edited Loading

⚡️ Lighthouse report for the deploy preview of this PR

github-actions bot commented Jan 10, 2023 • edited Loading

Zwyx Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Josh-Cena Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Zwyx Jan 11, 2023

Choose a reason for hiding this comment

slorber commented Jan 18, 2023

Josh-Cena commented Jan 19, 2023 • edited Loading

Josh-Cena commented Jan 19, 2023 • edited Loading

slorber commented Jan 20, 2023

netlify bot commented Jan 10, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2023 •

edited

Loading

Zwyx Jan 11, 2023 •

edited

Loading

Josh-Cena Jan 11, 2023 •

edited

Loading

Josh-Cena commented Jan 19, 2023 •

edited

Loading

Josh-Cena commented Jan 19, 2023 •

edited

Loading