Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): avoid hash collision when generating chunk names #8538

Merged
merged 3 commits into from Jan 18, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 17 additions & 0 deletions packages/docusaurus/src/server/__tests__/routes.test.ts
Expand Up @@ -56,6 +56,23 @@ describe('genChunkName', () => {
});
expect(genChunkName('d', undefined, undefined, true)).toBe('8277e091');
});

// https://github.com/facebook/docusaurus/issues/8536
it('avoids hash collisions', () => {
expect(
genChunkName(
'@site/blog/2022-11-18-bye-medium/index.mdx?truncated=true',
'content',
'blog',
),
).not.toBe(
genChunkName(
'@site/blog/2019-10-05-react-nfc/index.mdx?truncated=true',
'content',
'blog',
),
);
});
});

describe('handleDuplicateRoutes', () => {
Expand Down
6 changes: 6 additions & 0 deletions packages/docusaurus/src/server/routes.ts
Expand Up @@ -51,6 +51,7 @@ function indent(str: string) {
}

const chunkNameCache = new Map<string, string>();
const chunkNameCount = new Map<string, number>();

/**
* Generates a unique chunk name that can be used in the chunk registry.
Expand Down Expand Up @@ -82,7 +83,12 @@ export function genChunkName(
const name = str === '/' ? 'index' : docuHash(str);
chunkName = prefix ? `${prefix}---${name}` : name;
}
const seenCount = (chunkNameCount.get(chunkName) ?? 0) + 1;
if (seenCount > 1) {
chunkName += seenCount;
}
Copy link
Contributor

@Zwyx Zwyx Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes so it was 1 chance out of only 4096 to get a collision (simpleHash(modulePath, 3) returning 3 hex characters).

Not sure how much you want to refactor/clean this, I'll just add a few notes (feel free to ignore of course, I don't even know how implicated you are in the project 🙂 ):

  • if we go for a counter, should we keep the hash? It seems unnecessary to have both.
  • it seems unnecessary to do a docuHash after having done a simpleHash: if two simpleHashs produce the same output (c99 in our case), then the two docuHashs will produce the same output as well (485).
  • docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.
  • and a more general question: why don't we do simpleHash(modulePath, 8) in dev too?

For anyone discovering this PR, I've put together a small script to reproduce the issue we're having. Just run this with ts-node:

Details
import { createHash } from "crypto";

export function md5Hash(str: string): string {
	return createHash("md5").update(str).digest("hex");
}

export function docuHash(str: string): string {
	if (str === "/") {
		return "index";
	}
	const shortHash = simpleHash(str, 3);
	const parsedPath = `${str}-${shortHash}`;
	console.log(`parsed path: ${parsedPath}`);
	return parsedPath;
}

export function simpleHash(str: string, length: number): string {
	return md5Hash(str).substring(0, length);
}

export function genChunkName(
	modulePath: string,
	prefix?: string,
	preferredName?: string,
	shortId: boolean = process.env.NODE_ENV === "production",
): string {
	let str = modulePath;

	if (preferredName) {
		const shortHash = simpleHash(modulePath, 3);

		console.log(`short hash of ${modulePath}: ${shortHash}`);

		str = `${preferredName}${shortHash}`;
	}

	const name = str === "/" ? "index" : docuHash(str);

	console.log(`docuHash of ${str}: ${name}`);

	let chunkName = prefix ? `${prefix}---${name}` : name;

	return chunkName;
}

console.log(
	genChunkName(
		"@site/blog/2022-11-18-bye-medium/index.mdx?truncated=true",
		"content",
		"/blog",
	),
);

console.log(
	genChunkName(
		"@site/blog/2019-10-05-react-nfc/index.mdx?truncated=true",
		"content",
		"/blog",
	),
);

Output:

short hash of @site/blog/2022-11-18-bye-medium/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485   <----- SAME RESULT HERE
short hash of @site/blog/2019-10-05-react-nfc/index.mdx?truncated=true: c99
parsed path: /blogc99-485
docuHash of /blogc99: /blogc99-485
content---/blogc99-485   <----- AND HERE

Copy link
Collaborator Author

@Josh-Cena Josh-Cena Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the thoughts put into them.

I don't even know how implicated you are in the project

I'm a community contributor just like you, so of course I'll take them into account seriously :)

if we go for a counter, should we keep the hash?

The hash is also to make the IDs have a uniform and reasonably-short length. The ID is directly present in the client data, so any bytes we can shave off here is a net gain for bundle size.

I anticipate the counter to be very rarely used, and even if it is used, it would only add two bytes, which is fine.

it seems unnecessary to do a docuHash after having done a simpleHash

docuHash's name is confusing: hashing is probably the least that it does. It also normalizes the chunk name's capitalization and truncates the name, which makes sure the generated files can be properly emitted.

docuHash tests if str equals /, so we don't need to do str === '/' ? 'index' : docuHash(str) here.

Thanks! This code has been there since forever (early in the conception of Docusaurus 2), so things probably have evolved since then.

why don't we do simpleHash(modulePath, 8) in dev too?

For debuggability. The chunk name has some information of where it's coming from instead of an unreadable hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Thanks a lot for your answers, it all makes sense.

chunkNameCache.set(modulePath, chunkName);
chunkNameCount.set(chunkName, seenCount);
}
return chunkName;
}
Expand Down