Using direct I/O on Linux #808

adamreichold · 2024-05-12T13:22:14Z

Understanding that redb manages it owns user space page cache, I used the small diff

changes to make the lmdb_benchmark use direct I/O

diff --git a/Cargo.toml b/Cargo.toml
index 649c21b..0819edd 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -20,6 +20,7 @@ pyo3-build-config = { version = "0.20.0", optional = true }
 [dependencies]
 log = {version = "0.4.17", optional = true }
 pyo3 = {version = "0.20.0", features=["extension-module", "abi3-py37"], optional = true }
+rustix = { version = "0.38.34", features = ["fs"] }
 
 [target.'cfg(unix)'.dependencies]
 libc = "0.2.104"
diff --git a/benches/lmdb_benchmark.rs b/benches/lmdb_benchmark.rs
index 53b79b1..1cf2f88 100644
--- a/benches/lmdb_benchmark.rs
+++ b/benches/lmdb_benchmark.rs
@@ -297,7 +297,25 @@ fn main() {
         benchmark(table)
     };
 
-    let lmdb_results = {
+    let redb_directio_results = {
+        let tmpfile: NamedTempFile = NamedTempFile::new_in(&tmpdir).unwrap();
+        use std::fs::OpenOptions;
+        use std::os::unix::fs::OpenOptionsExt;
+        let file = OpenOptions::new()
+            .read(true)
+            .write(true)
+            .custom_flags(libc::O_DIRECT)
+            .open(tmpfile.path())
+            .unwrap();
+        let db = redb::Database::builder()
+            .set_cache_size(4 * 1024 * 1024 * 1024)
+            .create_file(file)
+            .unwrap();
+        let table = RedbBenchDatabase::new(&db);
+        benchmark(table)
+    };
+
+    /*let lmdb_results = {
         let tmpfile: TempDir = tempfile::tempdir_in(&tmpdir).unwrap();
         let env = lmdb::Environment::new().open(tmpfile.path()).unwrap();
         env.set_map_size(4096 * 1024 * 1024).unwrap();
@@ -325,7 +343,7 @@ fn main() {
         let db = sanakirja::Env::new(tmpfile.path(), 4096 * 1024 * 1024, 2).unwrap();
         let table = SanakirjaBenchDatabase::new(&db);
         benchmark(table)
-    };
+    };*/
 
     fs::remove_dir_all(&tmpdir).unwrap();
 
@@ -337,10 +355,11 @@ fn main() {
 
     for results in [
         redb_latency_results,
-        lmdb_results,
+        redb_directio_results,
+        /*lmdb_results,
         rocksdb_results,
         sled_results,
-        sanakirja_results,
+        sanakirja_results,*/
     ] {
         for (i, (_benchmark, duration)) in results.iter().enumerate() {
             rows[i].push(format!("{}ms", duration.as_millis()));
diff --git a/src/tree_store/page_store/file_backend/unix.rs b/src/tree_store/page_store/file_backend/unix.rs
index db24c48..b49cfb6 100644
--- a/src/tree_store/page_store/file_backend/unix.rs
+++ b/src/tree_store/page_store/file_backend/unix.rs
@@ -19,6 +19,8 @@ use std::os::wasi::{fs::FileExt, io::AsRawFd};
 #[derive(Debug)]
 pub struct FileBackend {
     file: File,
+    mem_align: usize,
+    offset_align: usize,
 }
 
 impl FileBackend {
@@ -33,6 +35,15 @@ impl FileBackend {
     /// Creates a new backend which stores data to the given file.
     #[cfg(unix)] // remove this line when wasi-libc gets flock
     pub fn new(file: File) -> Result<Self, DatabaseError> {
+        let stat = rustix::fs::statx(
+            &file,
+            "",
+            rustix::fs::AtFlags::EMPTY_PATH,
+            rustix::fs::StatxFlags::DIOALIGN,
+        )
+        .unwrap();
+        let mem_align = stat.stx_dio_mem_align as usize;
+        let offset_align = stat.stx_dio_offset_align as usize;
         let fd = file.as_raw_fd();
         let result = unsafe { libc::flock(fd, libc::LOCK_EX | libc::LOCK_NB) };
         if result != 0 {
@@ -43,7 +54,11 @@ impl FileBackend {
                 Err(err.into())
             }
         } else {
-            Ok(Self { file })
+            Ok(Self {
+                file,
+                mem_align,
+                offset_align,
+            })
         }
     }
 }
@@ -54,7 +69,10 @@ impl StorageBackend for FileBackend {
     }
 
     fn read(&self, offset: u64, len: usize) -> Result<Vec<u8>, io::Error> {
+        assert!(offset % self.offset_align as u64 == 0);
+        assert!(len % self.offset_align == 0);
         let mut buffer = vec![0; len];
+        assert!(buffer.as_ptr() as usize % self.mem_align == 0);
         self.file.read_exact_at(&mut buffer, offset)?;
         Ok(buffer)
     }
@@ -83,6 +101,9 @@ impl StorageBackend for FileBackend {
     }
 
     fn write(&self, offset: u64, data: &[u8]) -> Result<(), io::Error> {
+        assert!(offset % self.offset_align as u64 == 0);
+        assert!(data.len() % self.offset_align == 0);
+        assert!(data.as_ptr() as usize % self.mem_align == 0);
         self.file.write_all_at(data, offset)
     }
 }
diff --git a/src/tree_store/page_store/header.rs b/src/tree_store/page_store/header.rs
index cd4be9d..4809a6a 100644
--- a/src/tree_store/page_store/header.rs
+++ b/src/tree_store/page_store/header.rs
@@ -53,7 +53,7 @@ const REGION_TRACKER_PAGE_NUMBER_OFFSET: usize =
 const TRANSACTION_SIZE: usize = 128;
 const TRANSACTION_0_OFFSET: usize = 64;
 const TRANSACTION_1_OFFSET: usize = TRANSACTION_0_OFFSET + TRANSACTION_SIZE;
-pub(super) const DB_HEADER_SIZE: usize = TRANSACTION_1_OFFSET + TRANSACTION_SIZE;
+pub(super) const DB_HEADER_SIZE: usize = 512;
 
 // God byte flags
 const PRIMARY_BIT: u8 = 1;

to run the lmdb_benchmark using direct I/O which appears to works but significantly regresses performance, especially for the "batch writes" scenario which became three times slower (7s versus 24s) on my system.

While this is obviously not a supported use case and hence this not meant as a bug report but rather a question, this result did surprise me. Maybe someone else has an idea why bypassing the kernel's page cache actually slows redb down?

The text was updated successfully, but these errors were encountered:

adamreichold · 2024-05-12T16:55:48Z

So with the changes from #809, direct I/O still ends up being slightly slower for me:

	buffered	direct
bulk load	4369ms	4639ms
individual writes	385ms	436ms
batch writes	6668ms	6989ms
len()	0ms	0ms
random reads	1065ms	1071ms
random reads	1062ms	1067ms
random range reads	2759ms	2765ms
random range reads	2750ms	2770ms
random reads (4 threads)	287ms	287ms
random reads (8 threads)	169ms	169ms
random reads (16 threads)	148ms	149ms
random reads (32 threads)	127ms	127ms
removals	3401ms	4981ms

except for removals which is still takes almost twice as long.

Any ideas where these remaining effects are coming from?

cberner · 2024-05-12T22:02:08Z

Hmm, I haven't played with direct i/o a whole lot, but here are a few hypotheses:

write back buffering and coalescing. Without direct i/o all the write()s which are mostly small go into the page cache, and then are flushed to disk before/at the call to fsync() and so they can be coalesced into larger writes. With O_DIRECT, I believe every call to write() goes straight to disk, so it's likely doing more write operations to disk
it might have something to do with the user space cache invalidation. I invalidate entries on any free, so in a delete heavy workload like removals it'll free a page, evict it from the cache, and then immediately recreate it in the cache.
maybe it has something to do with readahead? but since the read benchmarks aren't effected I think this is unlikely

adamreichold · 2024-05-13T06:03:21Z

With O_DIRECT, I believe every call to write() goes straight to disk, so it's likely doing more write operations to disk

Not necessarily, or rather only if O_DSYNC is passed, which makes things significantly slower again.

But I agree the opportunities for coalescing might be reduced to what is queued in the I/O layer. I guess direct I/O would only really make sense when combined with asynchronous I/O so that the program can issue all its I/O "immediately" so the block layer can do almost as much coalescing as the page cache could have done before the fsync hits.

it might have something to do with the user space cache invalidation. I invalidate entries on any free, so in a delete heavy workload like removals it'll free a page, evict it from the cache, and then immediately recreate it in the cache.

So this sounds like something that could be optimized? For a free'd page, its contents should not matter, should it? So it could just be declared clean and but into the read cache? Maybe after zeroing out the in-memory representation? We'd just need to be sure that when the supposedly clean page is evicted and recreated from disk, that leads to the same result.

I think this also the main benefit of running with direct I/O, it shines a light on hidden inefficiencies which the kernel's page cache currently smooths over. But on a fully laden production system without cache memory to spare, not depending on the page cache's help is presumably rather helpful.

cberner · 2024-05-15T02:36:57Z

So this sounds like something that could be optimized?

It's less trivial than it seems. Because pages can be allocated at different orders, the size of the cached page might change. If you want to hack it into the benchmark to see if it has an effect though, it's probably safe to just assume the page size doesn't change since I think the benchmark only writes small values.

adamreichold mentioned this issue May 12, 2024

Move cleaned pages into read cache instead of discarding them. #809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using direct I/O on Linux #808

Using direct I/O on Linux #808

adamreichold commented May 12, 2024 •

edited

adamreichold commented May 12, 2024 •

edited

cberner commented May 12, 2024

adamreichold commented May 13, 2024 •

edited

cberner commented May 15, 2024

Using direct I/O on Linux #808

Using direct I/O on Linux #808

Comments

adamreichold commented May 12, 2024 • edited

adamreichold commented May 12, 2024 • edited

cberner commented May 12, 2024

adamreichold commented May 13, 2024 • edited

cberner commented May 15, 2024

adamreichold commented May 12, 2024 •

edited

adamreichold commented May 12, 2024 •

edited

adamreichold commented May 13, 2024 •

edited