Currently no way to use zlib optimally in Node.js #47709

JoshuaWise · 2023-04-24T23:32:16Z

What is the problem this feature will solve?

Zlib compression is inherently a CPU-bound task. Therefore, it's only logical that when you need to compress lots of things, you'd want to utilize all your CPUs. However, Node.js's built-in zlib module does compression within libuv's thread pool, which is designed for I/O-bound tasks, not CPU-bound tasks. In fact, libuv spawns exactly 4 threads by default; it has no correlation to the number of CPUs available.

Node.js does provide synchronous convenience methods which do the work in the current thread. This, in combination with worker_threads, allows you to actually utilize all your CPUs for bulk compression tasks. Unfortunately, these synchronous methods are unable to process chunks piecewise; if you need to compress a large file, you have to put the entire thing into memory.

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

What is the feature you are proposing to solve the problem?

I propose a new API within the built-in zlib module which allows you to synchronously compress data, chunk-by-chunk. The API can mirror the existing Hash class that's provided by the crypto module.

const zlib = require('node:zlib');

const gzip = new zlib.GzipSync();

// Each update() call returns the available output, if any (could be an empty Buffer)
const compressedOutput1 = gzip.update(chunk1);
const compressedOutput2 = gzip.update(chunk2);

// The digest() call flushes the output and returns the last output chunk
const compressedOutput3 = gzip.digest();

// The complete result is just the concatenation of all output Buffers
const result = Buffer.concat([compressedOutput1, compressedOutput2, compressedOutput3]);

An example of real-world usage might look like this:

const fs = require('node:fs'); 
const zlib = require('node:zlib');

const CHUNK_SIZE = 1024 * 16;

function* gzipFile(filename, zlibOptions) {
	const gzip = new zlib.GzipSync(zlibOptions);
	const fd = fs.openSync(filename);
	try {
		while (true) {
			const buffer = Buffer.alloc(CHUNK_SIZE);
			const bytesRead = fs.readSync(fd, buffer, 0, CHUNK_SIZE);
			if (bytesRead > 0) {
				yield gzip.update(buffer.subarray(0, bytesRead));
			}
			if (bytesRead < CHUNK_SIZE) {
				yield gzip.digest();
				break;
			}
		}
	} finally {
		fs.closeSync(fd);
	}
}

What alternatives have you considered?

I have searched npm and found only two packages that claim to achieve this while using native zlib bindings:

Both of these packages utilize very scary, dirty hacks around the undocumented _processChunk() method, and manipulating the undocumented _handle property (which holds the native zlib instance), used internally by Node.js. I don't find these solutions to be remotely safe or future-proof.

The text was updated successfully, but these errors were encountered:

mscdex · 2023-04-25T00:14:34Z

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

You can always change the size of the libuv thread pool (by setting the UV_THREADPOOL_SIZE environment variable) to whatever makes sense for your hardware and/or application.

JoshuaWise · 2023-04-25T00:24:01Z

Therefore, if you need to compress many large files, Node.js provides no public API for doing it in an optimal way.

You can always change the size of the libuv thread pool (by setting the UV_THREADPOOL_SIZE environment variable) to whatever makes sense for your hardware and/or application.

I did consider this, and that may be an acceptable solution for some simple situations where you have total control over where your code is being run, but in my case, the code is going to be run within many different Node programs across many teams in a large organization. It's unreasonable for me to expect that I can force every person in the organization to set this environment variable, especially when the programs in question are hosted across a wide variety of infrastructure.

In other words, it should be possible to create an npm package that optimally compresses files, without forcing end-users to set an environment variable which affects their entire program.

bnoordhuis · 2023-04-25T08:03:20Z

It's already possible to do what you're proposing by compiling zlib to wasm or using an existing JS compression library.

You're welcome to open a pull request but it's a niche enough feature that I won't guarantee it's going to get merged.

JoshuaWise · 2023-04-25T20:10:07Z

It's already possible to do what you're proposing by compiling zlib to wasm or using an existing JS compression library.

You're welcome to open a pull request but it's a niche enough feature that I won't guarantee it's going to get merged.

Sure, but wasm and JS are slower than using native zlib. They're not optimal. And it seems silly to reach for a third-party library for zlib when Node.js is supposed to come with it out-of-the-box.

I think if Node.js started out with worker threads, there's a good chance the zlib module would never have been written to use the libuv event loop. Even ignoring the issue with thread pool size, the libuv event loop isn't supposed to be used for CPU-bound tasks, so the current zlib implementation seems like an anti-pattern to me. With the current API, if I want to perform CPU-bound tasks like compression, it will actually block I/O-bound tasks such as reading files, because all the event loop threads will be busy doing actual work (instead of just sitting around waiting for I/O, which is what they're designed for).

The idea that the current API is an anti-pattern is reinforced by the Node.js docs themselves, which warn about memory fragmentation due to using the event loop: https://nodejs.org/api/zlib.html#threadpool-usage-and-performance-considerations.

Compression is inherently a synchronous job, so it makes (to me) to optimize the API around using it synchronously.

I'm happy to create a PR.

bnoordhuis · 2023-04-26T03:53:35Z

Sure, but wasm and JS are slower than using native zlib. They're not optimal.

You don't know that until you measure it. Reviewers will ask you to quantify performance gains compared to the status quo.

if I want to perform CPU-bound tasks like compression, it will actually block I/O-bound tasks such as reading files

That seems like an argument for smarter thread pool scheduling to me, not a reason to work around it by adding sync compression.

The idea that the current API is an anti-pattern is reinforced by the Node.js docs themselves, which warn about memory fragmentation due to using the event loop

It's a lot more subtle than that, see #8871 (comment).

The docs are not intended to be an expose on the intricacies of glibc so the warning was reduced to something that lacks nuance but saying it's an anti-pattern is factually wrong.

timotejroiko · 2023-04-27T22:57:32Z

Hello, I am the author of one of the mentioned libraries and I just want to drop my two cents on this topic.

It appears to me that the current zlib implementation is geared towards compression of files and/or other sources of large quantities of data, however one very common use case that is currently suboptimal to implement is transport compression for network streams, tcp and websocket.

This type of compression requires keeping a shared zlib context for as long as the connection lives, and most of its time will be spent processing small chunks of data, which can potentially be done more efficiently in a synchronous fashion, which also avoids the issues discussed in #8871 as well as in ws#1369.

As an example, the Discord API uses of this type of compression in its websocket connections, and most of its NodeJS client libraries ended up relying on third party solutions such as https://github.com/abalabahaha/zlib-sync or one of the mentioned libraries to provide this functionality instead of using the built in zlib.

Event though my use case is a bit different from Joshua's, I would personally love to see a synchronous shared context zlib class implemented some day, even if just for the sake of having more options available to developers, just like many other Node APIs have both sync and async variants.

bnoordhuis · 2023-04-30T10:39:13Z

@JoshuaWise @timotejroiko you may want to coordinate amongst yourselves. This is basically at the "pull request welcome and we'll take it from there" stage. No promises but no outright "no" either.

silverwind · 2023-05-24T15:56:56Z

UV_THREADPOOL_SIZE

Can this be set from within a running node application? I assume the answer is "no"?

Edit: If this is to be believed, it should be possible to set before any async stuff is done. An real API for it would be still nice, I guess.

bnoordhuis · 2023-05-25T08:31:50Z

Not reliably, no.

I don't think there's anything left to discuss, we're basically waiting for a pull request to materialize, so I'm going to go ahead and close the issue.

JoshuaWise added the feature request Issues that request new features to be added to Node.js. label Apr 24, 2023

VoltrexKeyva added the zlib Issues and PRs related to the zlib subsystem. label Apr 25, 2023

bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Currently no way to use zlib optimally in Node.js #47709

Currently no way to use zlib optimally in Node.js #47709

JoshuaWise commented Apr 24, 2023 •

edited

mscdex commented Apr 25, 2023

JoshuaWise commented Apr 25, 2023

bnoordhuis commented Apr 25, 2023

JoshuaWise commented Apr 25, 2023 •

edited

bnoordhuis commented Apr 26, 2023

timotejroiko commented Apr 27, 2023 •

edited

bnoordhuis commented Apr 30, 2023

silverwind commented May 24, 2023 •

edited

bnoordhuis commented May 25, 2023

Currently no way to use zlib optimally in Node.js #47709

Currently no way to use zlib optimally in Node.js #47709

Comments

JoshuaWise commented Apr 24, 2023 • edited

What is the problem this feature will solve?

What is the feature you are proposing to solve the problem?

What alternatives have you considered?

mscdex commented Apr 25, 2023

JoshuaWise commented Apr 25, 2023

bnoordhuis commented Apr 25, 2023

JoshuaWise commented Apr 25, 2023 • edited

bnoordhuis commented Apr 26, 2023

timotejroiko commented Apr 27, 2023 • edited

bnoordhuis commented Apr 30, 2023

silverwind commented May 24, 2023 • edited

bnoordhuis commented May 25, 2023

JoshuaWise commented Apr 24, 2023 •

edited

JoshuaWise commented Apr 25, 2023 •

edited

timotejroiko commented Apr 27, 2023 •

edited

silverwind commented May 24, 2023 •

edited