-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): avoid hash collision when generating chunk names #8538
Conversation
✅ [V2]
To edit notification comments on pull requests, go to your Netlify site settings. |
⚡️ Lighthouse report for the deploy preview of this PR
|
Size Change: 0 B Total Size: 890 kB ℹ️ View Unchanged
|
const name = str === '/' ? 'index' : docuHash(str); | ||
chunkName = prefix ? `${prefix}---${name}` : name; | ||
} | ||
const seenCount = (chunkNameCount.get(chunkName) ?? 0) + 1; | ||
if (seenCount > 1) { | ||
chunkName += seenCount; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes so it was 1 chance out of only 4096 to get a collision (simpleHash(modulePath, 3)
returning 3 hex characters).
Not sure how much you want to refactor/clean this, I'll just add a few notes (feel free to ignore of course, I don't even know how implicated you are in the project 🙂 ):
- if we go for a counter, should we keep the hash? It seems unnecessary to have both.
- it seems unnecessary to do a
docuHash
after having done asimpleHash
: if twosimpleHash
s produce the same output (c99
in our case), then the twodocuHash
s will produce the same output as well (485
). docuHash
tests ifstr
equals/
, so we don't need to dostr === '/' ? 'index' : docuHash(str)
here.- and a more general question: why don't we do
simpleHash(modulePath, 8)
in dev too?
For anyone discovering this PR, I've put together a small script to reproduce the issue we're having. Just run this with ts-node
:
Details
import { createHash } from "crypto";
export function md5Hash(str: string): string {
return createHash("md5").update(str).digest("hex");
}
export function docuHash(str: string): string {
if (str === "/") {
return "index";
}
const shortHash = simpleHash(str, 3);
const parsedPath = `${str}-${shortHash}`;
console.log(`parsed path: ${parsedPath}`);
return parsedPath;
}
export function simpleHash(str: string, length: number): string {
return md5Hash(str).substring(0, length);
}
export function genChunkName(
modulePath: string,
prefix?: string,
preferredName?: string,
shortId: boolean = process.env.NODE_ENV === "production",
): string {
let str = modulePath;
if (preferredName) {
const shortHash = simpleHash(modulePath, 3);
console.log(`short hash of ${modulePath}: ${shortHash}`);
str = `${preferredName}${shortHash}`;
}
const name = str === "/" ? "index" : docuHash(str);
console.log(`docuHash of ${str}: ${name}`);
let chunkName = prefix ? `${prefix}---${name}` : name;
return chunkName;
}
console.log(
genChunkName(
"@site/blog/2022-11-18-bye-medium/index.mdx?truncated=true",
"content",
"/blog",
),
);
console.log(
genChunkName(
"@site/blog/2019-10-05-react-nfc/index.mdx?truncated=true",
"content",
"/blog",
),
);
Output:
short hash of @site/blog/2022-11-18-bye-medium/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485 <----- SAME RESULT HERE
short hash of @site/blog/2019-10-05-react-nfc/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485 <----- AND HERE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review and the thoughts put into them.
I don't even know how implicated you are in the project
I'm a community contributor just like you, so of course I'll take them into account seriously :)
if we go for a counter, should we keep the hash?
The hash is also to make the IDs have a uniform and reasonably-short length. The ID is directly present in the client data, so any bytes we can shave off here is a net gain for bundle size.
I anticipate the counter to be very rarely used, and even if it is used, it would only add two bytes, which is fine.
it seems unnecessary to do a
docuHash
after having done asimpleHash
docuHash
's name is confusing: hashing is probably the least that it does. It also normalizes the chunk name's capitalization and truncates the name, which makes sure the generated files can be properly emitted.
docuHash
tests ifstr
equals/
, so we don't need to dostr === '/' ? 'index' : docuHash(str)
here.
Thanks! This code has been there since forever (early in the conception of Docusaurus 2), so things probably have evolved since then.
why don't we do
simpleHash(modulePath, 8)
in dev too?
For debuggability. The chunk name has some information of where it's coming from instead of an unreadable hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. Thanks a lot for your answers, it all makes sense.
Thanks 👍 looks good enough for now, but maybe we could do better. Agree this historical fn is a bit confusing 🤪 Hopefully we won't get 36 collisions at once. We could as well use 4 instead of 3 for the hash 🤷♂️ |
Hash collision is inevitable—by the pigeonhole principle, even for a 4-char hash, collision is guaranteed when there are |
For my own sake of mathematical sanity: I neglected the fact that the length must be an integer, so for the vast number of cases where there are less than 36 collisions the hash still reaches a length of 4. However, even if we approximate this as a binomial distribution independent for each of the 4096 hashes, the expected hash length is still where Some calculations show me (I hope I get it right this time—have to live up to my expertise.) |
🤯 |
Pre-flight checklist
Motivation
Test Plan
Added a unit test that didn't pass on master. Ideally we would have dogfooding as well, but this is hard because we have to make sure that (1) two blog posts have the same chunk name (2) they appear on the same page in the paginated view. It's possible if we create another blog plugin solely for this test, but that doesn't seem worthwhile.
Test links
Deploy preview: https://deploy-preview-_____--docusaurus-2.netlify.app/
Related issues/PRs