Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Currently no way to use zlib optimally in Node.js #47709

Closed
JoshuaWise opened this issue Apr 24, 2023 · 9 comments
Closed

Currently no way to use zlib optimally in Node.js #47709

JoshuaWise opened this issue Apr 24, 2023 · 9 comments
Labels
feature request Issues that request new features to be added to Node.js. zlib Issues and PRs related to the zlib subsystem.

Comments

@JoshuaWise
Copy link

JoshuaWise commented Apr 24, 2023

What is the problem this feature will solve?

Zlib compression is inherently a CPU-bound task. Therefore, it's only logical that when you need to compress lots of things, you'd want to utilize all your CPUs. However, Node.js's built-in zlib module does compression within libuv's thread pool, which is designed for I/O-bound tasks, not CPU-bound tasks. In fact, libuv spawns exactly 4 threads by default; it has no correlation to the number of CPUs available.

Node.js does provide synchronous convenience methods which do the work in the current thread. This, in combination with worker_threads, allows you to actually utilize all your CPUs for bulk compression tasks. Unfortunately, these synchronous methods are unable to process chunks piecewise; if you need to compress a large file, you have to put the entire thing into memory.

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

What is the feature you are proposing to solve the problem?

I propose a new API within the built-in zlib module which allows you to synchronously compress data, chunk-by-chunk. The API can mirror the existing Hash class that's provided by the crypto module.

const zlib = require('node:zlib');

const gzip = new zlib.GzipSync();

// Each update() call returns the available output, if any (could be an empty Buffer)
const compressedOutput1 = gzip.update(chunk1);
const compressedOutput2 = gzip.update(chunk2);

// The digest() call flushes the output and returns the last output chunk
const compressedOutput3 = gzip.digest();

// The complete result is just the concatenation of all output Buffers
const result = Buffer.concat([compressedOutput1, compressedOutput2, compressedOutput3]);

An example of real-world usage might look like this:

const fs = require('node:fs'); 
const zlib = require('node:zlib');

const CHUNK_SIZE = 1024 * 16;

function* gzipFile(filename, zlibOptions) {
	const gzip = new zlib.GzipSync(zlibOptions);
	const fd = fs.openSync(filename);
	try {
		while (true) {
			const buffer = Buffer.alloc(CHUNK_SIZE);
			const bytesRead = fs.readSync(fd, buffer, 0, CHUNK_SIZE);
			if (bytesRead > 0) {
				yield gzip.update(buffer.subarray(0, bytesRead));
			}
			if (bytesRead < CHUNK_SIZE) {
				yield gzip.digest();
				break;
			}
		}
	} finally {
		fs.closeSync(fd);
	}
}

What alternatives have you considered?

I have searched npm and found only two packages that claim to achieve this while using native zlib bindings:

Both of these packages utilize very scary, dirty hacks around the undocumented _processChunk() method, and manipulating the undocumented _handle property (which holds the native zlib instance), used internally by Node.js. I don't find these solutions to be remotely safe or future-proof.

@JoshuaWise JoshuaWise added the feature request Issues that request new features to be added to Node.js. label Apr 24, 2023
@mscdex
Copy link
Contributor

mscdex commented Apr 25, 2023

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

You can always change the size of the libuv thread pool (by setting the UV_THREADPOOL_SIZE environment variable) to whatever makes sense for your hardware and/or application.

@JoshuaWise
Copy link
Author

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

You can always change the size of the libuv thread pool (by setting the UV_THREADPOOL_SIZE environment variable) to whatever makes sense for your hardware and/or application.

I did consider this, and that may be an acceptable solution for some simple situations where you have total control over where your code is being run, but in my case, the code is going to be run within many different Node programs across many teams in a large organization. It's unreasonable for me to expect that I can force every person in the organization to set this environment variable, especially when the programs in question are hosted across a wide variety of infrastructure.

In other words, it should be possible to create an npm package that optimally compresses files, without forcing end-users to set an environment variable which affects their entire program.

@bnoordhuis
Copy link
Member

It's already possible to do what you're proposing by compiling zlib to wasm or using an existing JS compression library.

You're welcome to open a pull request but it's a niche enough feature that I won't guarantee it's going to get merged.

@VoltrexKeyva VoltrexKeyva added the zlib Issues and PRs related to the zlib subsystem. label Apr 25, 2023
@JoshuaWise
Copy link
Author

JoshuaWise commented Apr 25, 2023

It's already possible to do what you're proposing by compiling zlib to wasm or using an existing JS compression library.

You're welcome to open a pull request but it's a niche enough feature that I won't guarantee it's going to get merged.

Sure, but wasm and JS are slower than using native zlib. They're not optimal. And it seems silly to reach for a third-party library for zlib when Node.js is supposed to come with it out-of-the-box.

I think if Node.js started out with worker threads, there's a good chance the zlib module would never have been written to use the libuv event loop. Even ignoring the issue with thread pool size, the libuv event loop isn't supposed to be used for CPU-bound tasks, so the current zlib implementation seems like an anti-pattern to me. With the current API, if I want to perform CPU-bound tasks like compression, it will actually block I/O-bound tasks such as reading files, because all the event loop threads will be busy doing actual work (instead of just sitting around waiting for I/O, which is what they're designed for).

The idea that the current API is an anti-pattern is reinforced by the Node.js docs themselves, which warn about memory fragmentation due to using the event loop: https://nodejs.org/api/zlib.html#threadpool-usage-and-performance-considerations.

Compression is inherently a synchronous job, so it makes (to me) to optimize the API around using it synchronously.

I'm happy to create a PR.

@bnoordhuis
Copy link
Member

Sure, but wasm and JS are slower than using native zlib. They're not optimal.

You don't know that until you measure it. Reviewers will ask you to quantify performance gains compared to the status quo.

if I want to perform CPU-bound tasks like compression, it will actually block I/O-bound tasks such as reading files

That seems like an argument for smarter thread pool scheduling to me, not a reason to work around it by adding sync compression.

The idea that the current API is an anti-pattern is reinforced by the Node.js docs themselves, which warn about memory fragmentation due to using the event loop

It's a lot more subtle than that, see #8871 (comment).

The docs are not intended to be an expose on the intricacies of glibc so the warning was reduced to something that lacks nuance but saying it's an anti-pattern is factually wrong.

@timotejroiko
Copy link

timotejroiko commented Apr 27, 2023

Hello, I am the author of one of the mentioned libraries and I just want to drop my two cents on this topic.

It appears to me that the current zlib implementation is geared towards compression of files and/or other sources of large quantities of data, however one very common use case that is currently suboptimal to implement is transport compression for network streams, tcp and websocket.

This type of compression requires keeping a shared zlib context for as long as the connection lives, and most of its time will be spent processing small chunks of data, which can potentially be done more efficiently in a synchronous fashion, which also avoids the issues discussed in #8871 as well as in ws#1369.

As an example, the Discord API uses of this type of compression in its websocket connections, and most of its NodeJS client libraries ended up relying on third party solutions such as https://github.com/abalabahaha/zlib-sync or one of the mentioned libraries to provide this functionality instead of using the built in zlib.

Event though my use case is a bit different from Joshua's, I would personally love to see a synchronous shared context zlib class implemented some day, even if just for the sake of having more options available to developers, just like many other Node APIs have both sync and async variants.

@bnoordhuis
Copy link
Member

@JoshuaWise @timotejroiko you may want to coordinate amongst yourselves. This is basically at the "pull request welcome and we'll take it from there" stage. No promises but no outright "no" either.

@silverwind
Copy link
Contributor

silverwind commented May 24, 2023

UV_THREADPOOL_SIZE

Can this be set from within a running node application? I assume the answer is "no"?

Edit: If this is to be believed, it should be possible to set before any async stuff is done. An real API for it would be still nice, I guess.

@bnoordhuis
Copy link
Member

Not reliably, no.

I don't think there's anything left to discuss, we're basically waiting for a pull request to materialize, so I'm going to go ahead and close the issue.

@bnoordhuis bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. zlib Issues and PRs related to the zlib subsystem.
Projects
Status: Pending Triage
Development

No branches or pull requests

6 participants