Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How one can cache Dataset #425

Open
REASY opened this issue Jul 21, 2023 · 10 comments
Open

How one can cache Dataset #425

REASY opened this issue Jul 21, 2023 · 10 comments

Comments

@REASY
Copy link

REASY commented Jul 21, 2023

Hello, team,

I have a slippy server that serves Slippy Tiles implemented as HTTP server using gdal-rs. Actual rasters are partitioned in many Cloud Optimized GeoTIFF (COG) files with overviews. On high level, I extract tile information from the request that looks like /:prefix/:layer/:z/:x/:y and map it to overview and offset to read from COG. My COG files are stored in S3 and I use vsis3. In the beginning of request I open Dataset, in the end it is implicitly closed because of drop. Interestingly, if I query the same slippy tile twice, the only first request has high latency, the second one is much faster (is it because of VSI cache?):

2023-07-21T02:13:12.185123Z  INFO tokio-runtime-worker ThreadId(03) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 475 ms
2023-07-21T02:13:42.265197Z  INFO tokio-runtime-worker ThreadId(02) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 3 ms

Does it make sense in such scenario to cache the C descriptor of Dataset and reuse it? Or VSI_CACHE_SIZE together with GDAL_CACHEMAX should be enough?

Thank you.

@lnicola
Copy link
Member

lnicola commented Jul 21, 2023

Yeah, it's a bit unfortunate. GDAL doesn't allow you to read from a Dataset from multiple threads at once, even though cURL could probably support it just fine.

So I think your options are to either:

  • open and close the dataset on each read, which will incur a good bit of overhead (the TLS handshake and and reading the IFDs, I guess)
  • have a thread or pool of threads where each opens the file, gets a read request from a channel, does the actual read, sends the results back, then loops; this should work pretty well, but you'll be storing duplicate data in the GDAL cache

I should probably ask on the mailing list for clarification, though.

@rouault
Copy link
Contributor

rouault commented Aug 19, 2023

Starting with GDAL 3.6.0, if the GDAL_NUM_THREADS config option is set, reading in a TIFF/COG file a window of interest that intersects multiple tiles at one will use multithreaded decompression (cf https://github.com/OSGeo/gdal/blob/v3.6.0/NEWS.md), and in GDAL 3.7.0 this was further improved to trigger parallel network requests

@lnicola
Copy link
Member

lnicola commented Aug 20, 2023

I don't think multi-threaded decoding helps in this case (a tile server), since each request will read a single block if everything is set up properly. But we can't have everything just yet :⁠-⁠).

@metasim
Copy link
Contributor

metasim commented Aug 21, 2023

@REASY

Not sure if this could be considered canonical or even acceptable (YMMV), but we have a production tile server written in Axum + georust/gdal and have been caching without problems using this (GdalPath in an internal type which basically combines a GDAL vsi path + band specifiers):

use crate::raster::GdalPath;
use crate::Error;
use gdal::Dataset;
use moka::sync::Cache;
use once_cell::sync::Lazy;
use std::ops::Deref;
use std::sync::{Arc, Mutex};
use std::time::Duration;

pub(crate) struct DatasetCache(Cache<GdalPath, Arc<Mutex<Dataset>>>);

static INSTANCE: Lazy<DatasetCache> = Lazy::new(DatasetCache::new);

impl DatasetCache {
    fn new() -> Self {
        Self(
            Cache::builder()
                .time_to_idle(Duration::from_secs(3600))
                .max_capacity(5)
                .build(),
        )
    }
    pub(crate) fn dataset_for(path: &GdalPath) -> crate::Result<Arc<Mutex<Dataset>>> {
        let ds = INSTANCE.0.try_get_with(path.clone(), || {
            let ds: Result<Dataset> = path.open();
            ds.map(|d| Arc::new(Mutex::new(d)))
                .map_err(|e| e.to_string())
        });
        ds.map_err(|e| Error::Unexpected(e.deref().clone()))
    }
}

@ChristianBeilschmidt
Copy link
Contributor

ChristianBeilschmidt commented Jan 30, 2024

Isn't the problem that Datasets are not Send? You can add Mutexes around it, so that it is Sync , but you cannot enforce the Send.

There are shared datasets in GDAL, but we haven't implemented them since they cannot simply be used with all the stuff currently implemented for a dataset.

We have done the thread + channel thing that @lnicola mentioned 😆 .

EDIT: Was wrong, they are Send but subtypes like bands aren't. So for datasets, you are ready to go.

@lnicola
Copy link
Member

lnicola commented Jan 30, 2024

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

In the beginning of request I open Dataset, in the end it is implicitly closed because of drop.

You can stick them in an Arc<Mutex<HashMap>> or something, of course. They don't have to disappear at the end of the scope.

@rouault
Copy link
Contributor

rouault commented Jan 30, 2024

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

no, you don't. You just get the same dataset (if calling GDALOpenShared() from the same thread from which the initial one was opened. Otherwise you'll get a different instance)

@lnicola
Copy link
Member

lnicola commented Jan 30, 2024

Oh, right. Well that's an argument for Dataset not being Send, because otherwise you can open a shared one twice and pass it to a different thread, which is bad.

@ChristianBeilschmidt
Copy link
Contributor

ChristianBeilschmidt commented Feb 4, 2024

You can't call GDALOpenShared with this library at the moment. This is why we can say that Dataset: Send.

There would need to be a second type of Dataset , e.g., SharedDataset, which would call GDALOpenShared under the hood but then not being Send.

@lnicola
Copy link
Member

lnicola commented Feb 4, 2024

You're right, there's even a note in the docs:

Note that the GDAL_OF_SHARED option is removed from the set of allowed option because it subverts the Send implementation that allow passing the dataset the another thread. See #154.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants