-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting appropriate endByte for range download request #3513
Comments
I also observed through testing that if the offset = 0 the range header is never set, since 0 is not (not None):
This causes a full object request to take place on the S3 side regardless of read size. |
Thanks for the thorough report, @codybum . To make sure I'm on the same page, in the DSA, you are using the Looking at how that's implemented, the behavior you've documented is expected, and if it's suboptimal for your use case due to unexpectedly closed connections, I think it'd be better to use a different entry point altogether, as Could you point me to the place in DSA where you're invoking these calls? That would help me come up with how best to augment the file and assetstore interface to support your use case. |
@zachmullen I am using the standard DSA container launched through the provided compose file. I then create an S3 asset store. I have been digging around in the code trying to figure out why the S3 servers were getting hit so hard by Girder/DSA. I would not have ever noticed this behavior if we were not developing our own S3 frontends for distributed backends. This being said, AWS and others that make use of distributed storage and systems to provide S3 will have the same backend issue. The fundamental issue is that no matter how much data is read from the S3 object, the full size (from last seek) of the file is requested from the S3 server. In the S3 and other adapters, you have provisions for start and end byte request. However, since the S3 stream is only set on seek and not on read, we don't have any way to tell on read how much data for the S3 server to send. So if you make N request for some part of a file, which is normal for DSA, you will request most of the file from the backend server N times. This might not be an obvious issue for DSA or Girder, but it could be a major issue for S3 providers that make use of distributed storage networks to provide S3 services. From a distributed systems standpoint (backend S3) this is very problematic, as the remote stream has to be terminated when the socket is abruptly closed (frontend S3), and the delay in transit creates a significant amount of in-transit data (S3 network) that will never be used. If the S3 stream was set on both seek (no endByte) and read (endByte set) I think this would solve the issue. It is not clear to me where this might cause you problems, as the virtual pointer in the abstract store would be in the same position at the end of the read either way, but it would prevent unnecessary transfers on S3 storage networks. |
Fundamentally, the |
I would tend to agree, except the code that reads the images expects the files to be on a local block device. The current DSA implementation uses FUSE -> Girder Assetstore, so whatever the client libraries want to do gets passed through. While I have dug into the implementation trying to find the issue, this is DSA code as released. Cody |
@manthey any thoughts on where downstream this fix belongs? Should it be in the FUSE adapter or the DSA itself? |
I see the call that sets the download stream has been moved from the seek function to the read function, which gets us almost all the way there:
We have the following changes in place, which sets the endByte when a size is provided on read. Setting the endByte on stream config lets the S3 adapter appropriately make ranged requests. This does not appear to negatively impact local assetstores, but testing has been limited.
|
I have been chasing down a develop issue related to DSA and a local S3 server as documented here: DigitalSlideArchive/digital_slide_archive#318
We observe a full object download followed by numerous ranged requests as part of the DSA file import process. The ranged request have various offsets, but the same endByte (size of file), regardless of the read size. In each ranged request, Girder closes the socket before the transfer is complete. On closer inspection, Girder is setting the appropriate offset on seek, which configures the stream for subsequent reading, but does not set the endByte, which defaults to the size of the file. On read, the S3 server expects to transfer all data from the offset to the end of the file (endByte), but the connection is closed when the ranged read is completed on the Girder side. Ideally, the appropriate range for reads would be passed to the S3 backend allowing appropriate buffer allocation and graceful handling of sockets.
There might be a good reason for the current implementation, but perhaps there is a better way to configure the S3 stream.
It appears that the requested offset is correctly set in the abstract_assetstore_adapter.py on seek, where no endByte can be provided:
girder/girder/utility/abstract_assetstore_adapter.py
Line 106 in 89ab997
When no endByte is provided a default endByte is set based on the size of the entire file (
girder/girder/utility/s3_assetstore_adapter.py
Line 416 in 89ab997
If the stream was configured on read it would be possible to set the appropriate endByte value:
girder/girder/utility/abstract_assetstore_adapter.py
Line 61 in 89ab997
The text was updated successfully, but these errors were encountered: