Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): avoid hash collision when generating chunk names #8538

Merged
merged 3 commits into from Jan 18, 2023

Conversation

Josh-Cena
Copy link
Collaborator

Pre-flight checklist

Motivation

Test Plan

Added a unit test that didn't pass on master. Ideally we would have dogfooding as well, but this is hard because we have to make sure that (1) two blog posts have the same chunk name (2) they appear on the same page in the paginated view. It's possible if we create another blog plugin solely for this test, but that doesn't seem worthwhile.

Test links

Deploy preview: https://deploy-preview-_____--docusaurus-2.netlify.app/

Related issues/PRs

@Josh-Cena Josh-Cena added pr: bug fix This PR fixes a bug in a past release. to backport This PR is planned to be backported to a stable version of Docusaurus labels Jan 10, 2023
@facebook-github-bot facebook-github-bot added the CLA Signed Signed Facebook CLA label Jan 10, 2023
@netlify
Copy link

netlify bot commented Jan 10, 2023

[V2]

Name Link
🔨 Latest commit f12075b
🔍 Latest deploy log https://app.netlify.com/sites/docusaurus-2/deploys/63c630204a28cf00099eaf11
😎 Deploy Preview https://deploy-preview-8538--docusaurus-2.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions
Copy link

github-actions bot commented Jan 10, 2023

⚡️ Lighthouse report for the deploy preview of this PR

URL Performance Accessibility Best Practices SEO PWA Report
/ 🟠 85 🟢 97 🟢 92 🟢 100 🟢 90 Report
/docs/installation 🟠 75 🟢 100 🟢 100 🟢 100 🟢 90 Report

@github-actions
Copy link

github-actions bot commented Jan 10, 2023

Size Change: 0 B

Total Size: 890 kB

ℹ️ View Unchanged
Filename Size
website/.docusaurus/globalData.json 71 kB
website/build/assets/css/styles.********.css 112 kB
website/build/assets/js/main.********.js 666 kB
website/build/index.html 40.9 kB

compressed-size-action

Comment on lines 83 to 89
const name = str === '/' ? 'index' : docuHash(str);
chunkName = prefix ? `${prefix}---${name}` : name;
}
const seenCount = (chunkNameCount.get(chunkName) ?? 0) + 1;
if (seenCount > 1) {
chunkName += seenCount;
}
Copy link
Contributor

@Zwyx Zwyx Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes so it was 1 chance out of only 4096 to get a collision (simpleHash(modulePath, 3) returning 3 hex characters).

Not sure how much you want to refactor/clean this, I'll just add a few notes (feel free to ignore of course, I don't even know how implicated you are in the project 🙂 ):

  • if we go for a counter, should we keep the hash? It seems unnecessary to have both.
  • it seems unnecessary to do a docuHash after having done a simpleHash: if two simpleHashs produce the same output (c99 in our case), then the two docuHashs will produce the same output as well (485).
  • docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.
  • and a more general question: why don't we do simpleHash(modulePath, 8) in dev too?

For anyone discovering this PR, I've put together a small script to reproduce the issue we're having. Just run this with ts-node:

Details
import { createHash } from "crypto";

export function md5Hash(str: string): string {
	return createHash("md5").update(str).digest("hex");
}

export function docuHash(str: string): string {
	if (str === "/") {
		return "index";
	}
	const shortHash = simpleHash(str, 3);
	const parsedPath = `${str}-${shortHash}`;
	console.log(`parsed path: ${parsedPath}`);
	return parsedPath;
}

export function simpleHash(str: string, length: number): string {
	return md5Hash(str).substring(0, length);
}

export function genChunkName(
	modulePath: string,
	prefix?: string,
	preferredName?: string,
	shortId: boolean = process.env.NODE_ENV === "production",
): string {
	let str = modulePath;

	if (preferredName) {
		const shortHash = simpleHash(modulePath, 3);

		console.log(`short hash of ${modulePath}: ${shortHash}`);

		str = `${preferredName}${shortHash}`;
	}

	const name = str === "/" ? "index" : docuHash(str);

	console.log(`docuHash of ${str}: ${name}`);

	let chunkName = prefix ? `${prefix}---${name}` : name;

	return chunkName;
}

console.log(
	genChunkName(
		"@site/blog/2022-11-18-bye-medium/index.mdx?truncated=true",
		"content",
		"/blog",
	),
);

console.log(
	genChunkName(
		"@site/blog/2019-10-05-react-nfc/index.mdx?truncated=true",
		"content",
		"/blog",
	),
);

Output:

short hash of @site/blog/2022-11-18-bye-medium/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485   <----- SAME RESULT HERE
short hash of @site/blog/2019-10-05-react-nfc/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485   <----- AND HERE

Copy link
Collaborator Author

@Josh-Cena Josh-Cena Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the thoughts put into them.

I don't even know how implicated you are in the project

I'm a community contributor just like you, so of course I'll take them into account seriously :)

if we go for a counter, should we keep the hash?

The hash is also to make the IDs have a uniform and reasonably-short length. The ID is directly present in the client data, so any bytes we can shave off here is a net gain for bundle size.

I anticipate the counter to be very rarely used, and even if it is used, it would only add two bytes, which is fine.

it seems unnecessary to do a docuHash after having done a simpleHash

docuHash's name is confusing: hashing is probably the least that it does. It also normalizes the chunk name's capitalization and truncates the name, which makes sure the generated files can be properly emitted.

docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.

Thanks! This code has been there since forever (early in the conception of Docusaurus 2), so things probably have evolved since then.

why don't we do simpleHash(modulePath, 8) in dev too?

For debuggability. The chunk name has some information of where it's coming from instead of an unreadable hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Thanks a lot for your answers, it all makes sense.

@slorber
Copy link
Collaborator

slorber commented Jan 18, 2023

Thanks 👍 looks good enough for now, but maybe we could do better.

Agree this historical fn is a bit confusing 🤪

Hopefully we won't get 36 collisions at once. We could as well use 4 instead of 3 for the hash 🤷‍♂️

@slorber slorber merged commit 00023c2 into main Jan 18, 2023
@slorber slorber deleted the jc/avoid-chunk-clash branch January 18, 2023 18:19
@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Jan 19, 2023

Hash collision is inevitable—by the pigeonhole principle, even for a 4-char hash, collision is guaranteed when there are $16^4 + 1 = 65537$ chunks. (No website is this big, but this is just a worst-case upper bound, and you can expect it to happen far below that scale—birthday paradox.) At this level, the average length of the 3-char + counter hash is $\displaystyle 3+\log_{36}\frac{65537}{4096}=3.77$, still smaller than 4. We are using a smaller price to avoid an otherwise non-avoidable issue.

@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Jan 19, 2023

For my own sake of mathematical sanity: I neglected the fact that the length must be an integer, so for the vast number of cases where there are less than 36 collisions the hash still reaches a length of 4. However, even if we approximate this as a binomial distribution independent for each of the 4096 hashes, the expected hash length is still

$$E=3 + P(2\le X\le 36) + 2P(37\le X\le 1296) + \dots$$

where

$$P(X = x) = \begin{pmatrix} 65537\\x \end{pmatrix} \left(\frac{1}{4096}\right)^x \left(\frac{4095}{4096}\right)^{65537-x} $$

Some calculations show me $4.000 &lt; E &lt; 4.001$, which means at least we aren't doing much worse. This is because despite there being a tiny, tiny ($0.0001%$) chance of >36 collisions, there's also a chance of there being no collisions at all, offsetting the difference.

(I hope I get it right this time—have to live up to my expertise.)

@slorber
Copy link
Collaborator

slorber commented Jan 20, 2023

🤯

@slorber slorber added backported This PR has been backported to a stable version of Docusaurus and removed to backport This PR is planned to be backported to a stable version of Docusaurus labels Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported This PR has been backported to a stable version of Docusaurus CLA Signed Signed Facebook CLA pr: bug fix This PR fixes a bug in a past release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Same post shown twice in the blog list
4 participants