`Pool::close` does not always wait for all connections to close #3217

madadam · 2024-04-29T06:52:10Z

Bug Description

Despite what the documentation says, calling Pool::close does not actually always wait for all the connections to close before returning. This is because of a bug / race condition in PoolInner::close.

This issue is hard to detect because it has almost no observable effects. Probably the only db engine where this is detectable is sqlite. We use sqlite in WAL mode which means there are two additional files next to the main database file - one whose name ends in -wal and the other in -shm. According to the sqlite documentation, those should be deleted when the last connection closes (unless the last connection is read-only). We use a setup where we have a pool with read-only connection and a separate, single read-write connection. We first close the read-only pool using Pool::close and only when it returns we close the read-write connection. The expectation is that the read-write connection should be the last connection at that point and so it should perform the WAL checkpoint and delete the two auxiliary files. We noticed this wasn't always the case. Eventually we traced this issue down to a bug (actually, two bugs) in PoolInner::close which in some cases didn't wait for all the connections to close and so there were still some read-only connections open by the time we closed the read-write one. This prevented it from running the WAL checkpoint and deleting the auxiliary files.

The first bug is a race condition when idle_conns to not always empty by the time the loop finishes. This means any connections still in idle_conns are not closed until the pool itself has been dropped.

Second bug is that when idle_conns is drained, the idle connections are temporarily treated as live which means a semaphore permit is created for them out of thin air. Then when this permit is released it causes the semaphore to get more available permits than it had initially which on the last iteration of the loop, causes this acquire to complete prematurely.

Info

SQLx version: 0.7.4
SQLx features enabled: runtime-tokio, sqlite
Database server and version: sqlite 3.45.3
Operating system: linux
rustc --version: rustc 1.77.1 (7cf61ebde 2024-03-27)

The text was updated successfully, but these errors were encountered:

madadam · 2024-04-29T06:57:15Z

This is my attempt at a fix which seems to work for us:

    pub(super) fn close<'a>(self: &'a Arc<Self>) -> impl Future<Output = ()> + 'a {
        self.mark_closed();

        async move {
            let _permits = self.semaphore.acquire(self.options.max_connections).await;

            while let Some(idle) = self.idle_conns.pop() {
                let _ = idle.live.raw.close().await;
            }

            self.num_idle.store(0, Ordering::Release);
            self.size.store(0, Ordering::Release);
        }
    }

madadam added the bug label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Pool::close` does not always wait for all connections to close #3217

`Pool::close` does not always wait for all connections to close #3217

madadam commented Apr 29, 2024

madadam commented Apr 29, 2024

Pool::close does not always wait for all connections to close #3217

Pool::close does not always wait for all connections to close #3217

Comments

madadam commented Apr 29, 2024

Bug Description

Info

madadam commented Apr 29, 2024

`Pool::close` does not always wait for all connections to close #3217

`Pool::close` does not always wait for all connections to close #3217