Cloud compatibility #85

Farkal · 2020-08-03T20:14:06Z

I open this issue to speak about the best way to have cloud compatibility.

What does it mean ? Simply that we should be able to decode the header step by step. Because the ifd can be anywhere in the file we should be able to decode ifd one by one and set the data to decode each time.

Example:

I get the first 1024 bytes of a file
Get the first ifd and decode it (here we could keep the current init method that decode the header and the first ifd)
If there is another ifd and it's outside the 1024 bytes i should be able to call next_image(new_bytes) and get an error if we can't decode the next ifd (better would be to tell the size we need = 2 + the entry count at the begining of the IFD * 12 )

If there is a way to dynamicaly add data to cursor and fake the position we should be able to already have something working but i didn't find any way to do that. I wait for your advices about this use case before making a PR 😉

fintelia · 2020-08-03T20:32:42Z

I'm not sure if there is a crate for it, but it should be possible to make your own struct that takes a Reader and implements Read+Seek by buffering all data that is read. Something like:

struct BufferCursor<R: Read> { 
    r: R,
    data: Vec<u8>
    cursor: usize,
}
impl Read for BufferCursor {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        while cusor >= data.len() { 
             // read data into buffer
         }
        // copy data from self.data into buf
    }
}
impl Seek for BufferCursor {
    fn seek(&mut self, ...) {
         // Update cursor, but don't modify self.data or self.r
    }
}

If you do implement something like this, you should totally release it on crates.io though!

The bigger question is whether image-tiffwill actually do something sane when passed a reader like this. From skimming a bit of the code it looks plausible, but I can't promise that the decoder won't seek to the end of the stream without warning/before it is actually necessary.

Farkal · 2020-08-07T13:58:53Z

After some investigation I found that the best way was to add a new limit var:

/// Decoding limits
#[derive(Clone, Debug)]
pub struct Limits {
    /// The maximum size of any `DecodingResult` in bytes, the default is
    /// 256MiB. If the entire image is decoded at once, then this will
    /// be the maximum size of the image. If it is decoded one strip at a
    /// time, this will be the maximum size of a strip.
    pub decoding_buffer_size: usize,
    /// The maximum size of any ifd value in bytes, the default is
    /// 1MiB.
    pub ifd_value_size: usize,
    // Allow to specify a file limit to raise an error if decoding of the data is impossible (for cloud range request purpose)
    pub file_limit: Option<u64>,
    /// The purpose of this is to prevent all the fields of the struct from
    /// being public, as this would make adding new fields a major version
    /// bump.
    _non_exhaustive: (),
}

Then on each call of the goto_offset_u64 i check the value and raise if it's upper the file_limit value

#[inline]
    pub fn goto_offset_u64(&mut self, offset: u64) -> TiffResult<()> {
        if let Some(file_limit) = self.limits.file_limit {
            if (file_limit as u64) < offset {
                return Err(TiffError::DataUnreachable(offset))
            }
        }
        self.reader
            .seek(io::SeekFrom::Start(offset))?;
        Ok(())
    }

I do the same in the read_ifd function

let num_tags = if self.bigtiff { self.read_long8()? } else { self.read_short()?.into() };

        if let Some(file_limit) = self.limits.file_limit {
            let tag_size = if self.bigtiff { 20 } else { 12 };
            if (file_limit as u64) < offset + num_tags * tag_size {
                return Err(TiffError::DataUnreachable(offset + num_tags * tag_size))
            }
        }

And i will add something on the decoding_offset function also.
Then the lib that is calling and fixing the limit can manage how to handle this. The buffer cursor is not very usefull if we have ifd at the end of the file, we will need to fetch lot of data for nothing.

HeroicKatora · 2020-08-08T21:55:58Z

If there is another ifd and it's outside the 1024 bytes i should be able to call next_image(new_bytes) and get an error if we can't decode the next ifd (better would be to tell the size we need = 2 + the entry count at the begining of the IFD * 12 )

Could you explain how this error has to differ from the one already produced by any beyond the end seek or read? What doesn't exist is a recovery after such an error, in other words reading a strip or ifd isn't transactional and any error may leave it in an improper state for resumption.

Farkal · 2020-08-10T08:08:51Z

There is two issues.
The first is that the error leave the reader in an improper state for resumption so you have to decode everything again.
The second (and what i am trying to fix here) is that you don't know where and how many bytes you need to read. But everything is in the ifd.
For each ifd you know that a tag is 12 bytes (20 for bigtiff) and you have the number of tags at the begining of the ifd so you can easily find if you will be able to read the whole tags or if you need to fetch more data.
For each offset of a tag you have the type of the value (BYTE, SBYTE, SHORT, LONG, FLOAT...) and the count, so you can also find if you will be able to access the whole data just by reading the ifd.
I have implemented this in my custom branch and here is the commit about managing offset -> Farkal@2adf263

We could solve the first issue by using custom reader and if the seek or the read fail we could fetch missing data BUT we don't know how much we need and on high resolution image this could lead to lot of requests 😕
That's why i think returning an error with the data needed could solve the issue and keep the decoder in good state (because we check before seeking).
This work great when all the ifd and tileoffset sor stripoffsets are packed at the beginning of the file (that the case in cloud optimised geotiff) but if we have ifd1 image1 ifd2 image2 ifd3 image3 we will need to fetch also the images to get all the ifd and to solve this case we should be able to provide the buffer when we run next_image (it would be better to have decode_ifd(data: Option<Vec<u8>>) if the data is set we use it else we fetch the file or having kind of CouldDecoder).

Farkal · 2020-10-14T16:22:22Z

Also for better cloud support we should provide a step between the parsing of the tag type and the decoding of their values.
For example TileOffset can have lot of data and we don't want to get all the TileOffset but we prefer to know the index and the size of the values so then we can get the tile index easily for 2D image:

// tile_offset { index: 1253, value_size: 4 }
let tiles_per_row = (level.width as f32 / level.tile_width as f32).ceil() as u32;
let index = tile_offset.index + (y*tiles_per_row + x) * tile_offset.value_size

If we just add a method get_ifd() on Decoder this should solve this issue

HeroicKatora mentioned this issue Oct 14, 2020

Ghost header or; data outside IFDs and Images #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud compatibility #85

Cloud compatibility #85

Farkal commented Aug 3, 2020

fintelia commented Aug 3, 2020

Farkal commented Aug 7, 2020

HeroicKatora commented Aug 8, 2020

Farkal commented Aug 10, 2020

Farkal commented Oct 14, 2020

Cloud compatibility #85

Cloud compatibility #85

Comments

Farkal commented Aug 3, 2020

fintelia commented Aug 3, 2020

Farkal commented Aug 7, 2020

HeroicKatora commented Aug 8, 2020

Farkal commented Aug 10, 2020

Farkal commented Oct 14, 2020