Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

busboy seems to read ahead too fast for many small (< 16KiB) files #346

Open
KernelDeimos opened this issue Oct 8, 2023 · 5 comments
Open
Labels

Comments

@KernelDeimos
Copy link

KernelDeimos commented Oct 8, 2023

short version

When uploading many files under 16KiB, busboy will read ahead as fast as possible, emitting to all the .on('file', ...) listeners, even if the streams for previous files haven't been read yet.

long version

For large files I have a mechanism in place to apply backpressure through a consistent sleep time after each chunk, and the sleep time is based on data about uploads of "parts" of the file to the destination (ex: S3). The plan is to support multiple destinations, so that's why this rate is per-file (so each file's stream as potentially a different upload rate).

For very small files, this mechanism is effectively useless, but that's what I expect. However, I also expected that after busboy gives me some number of small files (let's say 64 of 1KiB files, because 64KiB was the mode chunk size I measured while uploading large files), and I haven't read the streams for any of these files yet (the streams which would only have one 1KiB chunk if read and then end immediately), that busboy would pause the stream to apply backpressure. This does not happen.

However, for files in the range [~16KiB, 64KiB] backpressure appears to work, despite that I'm only reading one chunk from each file. I'm not sure what makes the difference between [1B, ~16KiB) and [~16KiB, 64KiB], so I suspect that 16KiB might be a significant value in busboy somewhere.

@mscdex
Copy link
Owner

mscdex commented Oct 8, 2023

even if the streams for previous files haven't been read yet

This is due to how node.js streams work. They buffer up to a highWaterMark amount of bytes (16KB is currently the default for readable streams -- which is what file streams are). The defaults used by this library are the same as the node.js streams defaults. You can however configure them to whatever you like via the fileHwm option for individual file streams and highWaterMark for the parser.

@mscdex mscdex added the question label Oct 8, 2023
@KernelDeimos
Copy link
Author

That makes sense, but why would more than 16KiB worth of files be emitted before a more previous file's stream has ended?

@mscdex
Copy link
Owner

mscdex commented Oct 9, 2023

If you're suggesting that busboy stops parsing immediately (and stores the current chunk somewhere temporarily) when it sees the end of a part until the file stream has been completely read, that would not only complicate things (especially because writable streams do not have an unshift() method like readable streams do -- but that would cause performance issues anyway) but more importantly could potentially break things for existing users who may not be expecting that kind of behavior.

My suggestion would be to override one or both of the limits I previously mentioned.

@KernelDeimos
Copy link
Author

That almost sounds like what I was thinking, but the description of when to pause sounds different and I don't understand why that implies a chunk needs to be stored somewhere temporarily.

Although highWatermark defaults to 16KiB, I get a lot more than 16KiB worth of files after a file who's stream is still unread. I imagine that busboy would emit all the file events for the chunks its consumed; that would be 16 - maybe 32 or or 48 - file events for my 1KiB files, right? Or, because I measured the mode chunk size from file streams as 64 KiB, it would be 64 file events (assuming the request stream produces chunks of the same size). I can also see why it might be 128 file events because grabbing the next chunk before any of the previous chunk's file's streams are read might be more performent.

Instead I get file events for all the files in my test case (8000 files) when I've only read the streams for the first < 100 files. With this result I assume busboy would read ahead indefinitely and consume all server memory if the request had enough really small files (as it is, my test case consumes 8MB of server memory immediately).

@mscdex
Copy link
Owner

mscdex commented Oct 9, 2023

I think if you really want to minimize this sort of thing, the best you're going to be able to do is just have something that stops writing to the busboy instance until you've sufficiently determined that it's ok to continue parsing.

This could be accomplished by creating a custom Transform stream or similar that sits in between your upstream and the busboy instance. It would check how many file streams have not been completely read and then either continue passing data on or stop until the file streams have all been read or whatever logic you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants