Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access the raw buffer instead of stream #226

Closed
Kostanos opened this issue Nov 28, 2020 · 10 comments
Closed

Access the raw buffer instead of stream #226

Kostanos opened this issue Nov 28, 2020 · 10 comments
Labels

Comments

@Kostanos
Copy link

Is it possible to access the raw buffer of the uploaded files instead of the stream?

As the file already stored in the buffer, there is waste of resources if I will convert the readable stream back to the buffer again, would be great to have direct access to the buffer instead.

@jaydenseric
Copy link
Owner

jaydenseric commented Nov 29, 2020

@mike-marcacci is the best person to answer this question.

I'm not 100% following what you mean. Usually you want a stream so the file can be processed as it's streaming in; e.g. streaming up to cloud storage. I'm not sure why you would want to buffer the whole file in memory in your resolver, unless you are doing some sort of editing of the file in the resolver code.

@mike-marcacci
Copy link
Collaborator

Hi @Kostanos, to be clear here: this library actually abstracts away the details of uploads so that resolvers can use them without special caveats around coordinating the processing of streams, etc. Without this abstraction, it would be exceedingly difficult to write resolvers that exhibit predictable behavior in any but the most simple upload scenarios.

This kind of abstraction would be fairly trivial to implement by reading all uploads into node Buffer objects before calling the GraphQL schema, and IIRC this was actually the strategy used in the very first prototype of this library. However, this had some significant drawbacks:

  1. All the uploads needed to finish before any resolvers were called
  2. All the uploads needed to fit in memory

These were such significant problems that the library was rewritten to use streams, which are designed for cases where data may not all be available at once and which can handle quantities of data in excess of system memory. However, once a chunk of data flows through a stream, there is no way to "replay" it, because it isn't actually stored anywhere. This means that two resolvers using the same upload variable would have to start in the same processor tick. Additionally, an HTTP request is a single stream of data, and so if the same request contains two uploads and they are used by the query in the opposite order that they are sent, the desired behavior becomes impossible.

To solve this problem, I built fs-capacitor which eagerly begins buffering the upload to the filesystem, while allowing resolvers to read the bytes as soon as they become available.

So I think you were asking for a node Buffer of the upload, to which the answer is: there is no such thing to make accessible. However, you essentially already have access to the upload's bytes which are also being "buffered" to the filesystem. If you read these into a Buffer you will not be duplicating the bytes in memory, as they aren't in memory... however, I suspect that most things you would want do with an upload can be better achieved with streams.

Does this answer your question?

@Kostanos
Copy link
Author

Thank you, for your explanation.
In my particular situation, I need to have a buffer only for upload the file to the 3rd party S3 like service, but this service is not working with S3.upload, but only with S3.putObject, which is not working with a stream for some reason.
I managed to convert a stream to a buffer on my side.

From what you say, it seems to me, having to store uploads in a file could be optimized, by storing in memory instead, don't you think?

@jaydenseric
Copy link
Owner

@Kostanos

is not working with a stream for some reason

I wouldn’t give up on streams without knowing exactly why they are not working.

it seems to me, having to store uploads in a file could be optimized, by storing in memory instead

Servers generally have much more filesystem space available than memory.

@mike-marcacci
Copy link
Collaborator

From what you say, it seems to me, having to store uploads in a file could be optimized, by storing in memory instead, don't you think?

I suppose the question is really, "optimized for what?" While not as fast as memory, modern SSDs are very, very fast. A server needs to have sufficient buffer space for all concurrent uploads, and a slow client might take many minutes to make a single large request. Unless you're primarily accepting requests from other servers within the same datacenter, it's very unlikely that your SSD is going to be the bottleneck. The default "general purpose" SSDs of AWS/Google/Azure perform at several Gbps, and all providers offer optimized SSDs (named things like "local SSDs" etc) that are designed for this purpose and outperform the network. A 256GB SSD is orders of magnitude cheaper than 256GB of memory, and while the latter would certainly win a benchmark, it would cost an enormous amount more with no detectable real-world benefit.

@jaydenseric perhaps we should add a note on the readme suggesting that folks use SSDs in production? My understanding is that this is the default choice for most people though (we no longer use spinning disks for anything as "archival" functionalities are far better served by cloud storage services) so it's probably unnecessary. @Kostanos are you seeing actual performance issues with uploads, and if so are you using spinning disks?

@Kostanos
Copy link
Author

I think you forget the main variable in SSD which still makes RAM memory a preferable choice for small and short-timing temporary data, which is the Number of Write Cycles, as I know SSD is limited at a maximum of 100,000, at least for the consumer's models, and if you have a good, expensive SSD.
Now, even my DEV PC, which is running hundreds of unit tests per day, with this library, I'm forced to find the workaround and to configure the on-memory temp disk in the docker, not a big problem for me so, as I already doing it for DB engine in unit-tests.

And now, bringing it to the server, on the high loaded sites (the one I'm working on currently), or on the testing machines, using memory vs disk for this data will increase the SSD lifespan a lot, I assume.
I don't think performance is the issue here, as you mentioned.

@mike-marcacci
Copy link
Collaborator

Ah yes, haha well I suppose we are looking at server hardware from different perspectives: the owner vs the renter. My perspective here is as someone who primarily uses cloud (aka rented) servers in production environments. I pay the same for an hour of SSD time whether I write 1 or 1000 times. (And honestly it would still be far less expensive to replace owned SSDs than purchase and run the equivalent amount of modern RAM.)

Of course, as you mentioned, it's fairly trivial to use memory drives or spinning drives for temp storage, so that option is always there for those with special use cases.

Again, though – regardless of how you choose to back the filesystem – the bytes are made available via a stream, such that there is minimal duplication of stored data.

I know there has been some interest in an implementation of this library that better supports the lambda/functions use case given that the serverless framework pre-loads the entire request into memory before simulating the "http request".

If avoiding the disk is important to you and you're willing to sacrifice variable reuse, order independence, and limit yourself to your working memory, then the graphql-upload-mimimal fork might be for you. I generally discourage its use, because those caveats are significant, but there clearly are workloads where that trade-off might make sense.

@olso
Copy link

olso commented Feb 26, 2021

@jaydenseric @mike-marcacci

I just ran into this while applying glue https://github.com/imagemin/imagemin which only accepts a Buffer

@mike-marcacci
Copy link
Collaborator

mike-marcacci commented Feb 26, 2021

Hi @olso, I would probably encourage you to avoid image processors that don't have streaming APIs. The vast majority of common image transformations can be done with streams, avoiding the need to store the entire pre- and post-transformation images in memory. If you're really set on using something like this, you can simply read the contents of the stream into a buffer (which is what anything that reads a HTTP request body into a buffer does under the hood). There are dozens of approaches, from using a pre-made library to writing your own WritableStream to just listening to events:

const chunks = [];
stream.on('data',(chunk) => { chunks.push(chunk); });
stream.on('end',() => {
  const buffer = Buffer.concat(chunks);
  // ... do something with that buffer
}

(Note the lack of error handling in this above excerpt.)

Again, a stream is the "raw" form here, which can trivially be read into a single buffer if needed; going the other direction would have enormous performance/scalability consequences, hence the design choice.

@olso
Copy link

olso commented Feb 26, 2021

@mike-marcacci Thx for the answer, I've already did what you suggested. I'm now trying to find something like imagemin but with streams

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants