Replace interior Reader without resetting decompress data #276

jnbooth · 2021-07-09T20:48:25Z

If I'm parsing a continuous encoded stream that is (unfortunately) divided up across several successive inputs, my only option appears to be using the Decompress object directly, which is a huge hassle without access to the zio:: functions.

It seems like it would be easy to create a "replace" method identical to the various "reset" methods, but with the reset_decoder_data(self); call removed. That way, I could switch out one Read object for the next while preserving the stream state.

The text was updated successfully, but these errors were encountered:

Gelbpunkt · 2021-10-26T17:56:07Z

I would love this exact same feature but for ZlibEncoder - right now I can't reuse a ZlibEncoder object, but using Compress directly (https://gist.github.com/Gelbpunkt/44a6fb4cfe15f15e4c3757416b7021b1) doesn't work when I reuse the object.
It does work fine when using reset on the Compress object before decompressing, but that seems to not preserve encoding parameters. Zlib gives invalid stored block lengths when using the same decoder (in python) to decode multiple payloads that were created by encoding and then resetting the Compress object.

Byron · 2023-09-13T07:21:37Z

I am probably not following on some intricacies of the problem, but wonder if it's possible to create a custom Write/Read that is Clone and uses shared ownership internally. That way it would be possible to keep a handle to it after passing it to types in this crate, and use interior mutability to swap out any state as you see fit.

I'd find that, in theory, more suitable as first step towards solving this particular problem as it's much less involved. Of course, it also delegates the hard parts of managing the internal decompressor state correctly which certainly wouldn't be trivial, but should be possible.

nc-x · 2024-06-05T02:26:30Z

Tried swapping out the internal reader for GzDecoder using interior mutability but it does not work.

You have to swap when the first internal reader is completely read
However, when the first reader is completely read, you get an error: corrupt deflate stream (because the stream reader only had partial data).
In this case, ignoring the error and swapping the reader does nothing as GzDecoder is internally in an error state and it goes into a infinite loop.
Also, if there was a way to know the decoded size of a reader, we could have read_exact it. But knowing it in advance is impossible, so...

I don't think this can be done without additional support from the library itself.

Byron · 2024-06-05T04:57:08Z

Couldn't a multi-member (continuous) stream be accomplished by a MultiGzDecoder.

I think what would really help here is an example program with real (but minimal) input data.

nc-x · 2024-06-05T08:04:45Z

AFAICT, MultiGzDecoder is for gzip having multiple files inside it which is not the usecase here. The usecase here, for example, is to download huge gzip files using range requests. Reqwest's Response object implements Read, so it can be used with GzDecoder or MultiGzDecoder, but once a given response is decoded you make another range request to get the next piece of data and decode it in a streaming fashion, probably because the data is huge and you want to filter out some stuff or parse it directly etc.

But because every range request yields a different response, an ability to swap the internal reader is required while keeping the internal state sane.

I do have some code for the same but I don't have any public gzip url where i can try this on. Let me see if I can do something about the example code.

the8472 · 2024-06-05T08:38:14Z

However, when the first reader is completely read, you get an error: corrupt deflate stream (because the stream reader only had partial data).

Perhaps handling this as a WouldBlock error like this test does could work?

flate2-rs/src/gz/read.rs

Lines 332 to 350 in 1a0daec

    
           impl Read for BlockingCursor { 
        
               fn read(&mut self, buf: &mut [u8]) -> Result<usize> { 
        
                   //use the cursor, except it turns eof into blocking error 
        
                   let r = self.cursor.read(buf); 
        
                   match r { 
        
                       Err(ref err) => { 
        
                           if err.kind() == ErrorKind::UnexpectedEof { 
        
                               return Err(ErrorKind::WouldBlock.into()); 
        
                           } 
        
                       } 
        
                       Ok(0) => { 
        
                           //regular EOF turned into blocking error 
        
                           return Err(ErrorKind::WouldBlock.into()); 
        
                       } 
        
                       Ok(_n) => {} 
        
                   } 
        
                   return r; 
        
               } 
        
           }

Then on error the next chunk can be filled in.

jongiddy · 2024-06-05T08:52:53Z

Either:

Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in Byron's comment above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.
Use a write::GzDecoder. Create a loop to read Range requests and write the returned data to the write::GzDecoder. This is the only solution if the Range requests use async code.

nc-x · 2024-06-06T03:28:57Z

I have created a repo with an example program: https://github.com/nc-x/flate-276

Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in #276 (comment) above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.

The issue is that when you try reading from GzDecoder, it will simply go past the end of the partial data which it was provided and fail, which is expected because technically the input is incomplete and hence invalid.

So, basically the first reading itself fails and you do not even get to the point where you can swap underlying readers or return 0 etc. AFAICT end users of GzDecoder cannot do anything about this and there needs to be a way tell GzDecoder that it will be given partial data, so it should't fail.

Have not tried your 2nd solution and am busy for a couple of days now but presumably that would have the same issue as well?

jongiddy · 2024-06-08T07:40:22Z

I made a pull request to your repo to demonstrate the correct solution using a Read implementation.

The Read implementation of Download turns the chunked range requests into a single Read stream. The GzDecoder is outside Download and only has its reader set once in the new method. It never replaces the reader.

EricFecteau mentioned this issue Sep 13, 2023

Decompress Example #312

Closed

Byron added the question label Sep 13, 2023

PierreV23 mentioned this issue Sep 13, 2023

Add continue_with function to ZlibDecoder that allows refreshing of the stream without resetting the bufreader and fix resetting of ZlibDecoder #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace interior Reader without resetting decompress data #276

Replace interior Reader without resetting decompress data #276

jnbooth commented Jul 9, 2021

Gelbpunkt commented Oct 26, 2021

Byron commented Sep 13, 2023

nc-x commented Jun 5, 2024

Byron commented Jun 5, 2024 •

edited

nc-x commented Jun 5, 2024

the8472 commented Jun 5, 2024

jongiddy commented Jun 5, 2024 •

edited

nc-x commented Jun 6, 2024 •

edited

jongiddy commented Jun 8, 2024

Replace interior Reader without resetting decompress data #276

Replace interior Reader without resetting decompress data #276

Comments

jnbooth commented Jul 9, 2021

Gelbpunkt commented Oct 26, 2021

Byron commented Sep 13, 2023

nc-x commented Jun 5, 2024

Byron commented Jun 5, 2024 • edited

nc-x commented Jun 5, 2024

the8472 commented Jun 5, 2024

jongiddy commented Jun 5, 2024 • edited

nc-x commented Jun 6, 2024 • edited

jongiddy commented Jun 8, 2024

Byron commented Jun 5, 2024 •

edited

jongiddy commented Jun 5, 2024 •

edited

nc-x commented Jun 6, 2024 •

edited