Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool::close does not always wait for all connections to close #3217

Open
madadam opened this issue Apr 29, 2024 · 1 comment
Open

Pool::close does not always wait for all connections to close #3217

madadam opened this issue Apr 29, 2024 · 1 comment
Labels

Comments

@madadam
Copy link
Contributor

madadam commented Apr 29, 2024

Bug Description

Despite what the documentation says, calling Pool::close does not actually always wait for all the connections to close before returning. This is because of a bug / race condition in PoolInner::close.

This issue is hard to detect because it has almost no observable effects. Probably the only db engine where this is detectable is sqlite. We use sqlite in WAL mode which means there are two additional files next to the main database file - one whose name ends in -wal and the other in -shm. According to the sqlite documentation, those should be deleted when the last connection closes (unless the last connection is read-only). We use a setup where we have a pool with read-only connection and a separate, single read-write connection. We first close the read-only pool using Pool::close and only when it returns we close the read-write connection. The expectation is that the read-write connection should be the last connection at that point and so it should perform the WAL checkpoint and delete the two auxiliary files. We noticed this wasn't always the case. Eventually we traced this issue down to a bug (actually, two bugs) in PoolInner::close which in some cases didn't wait for all the connections to close and so there were still some read-only connections open by the time we closed the read-write one. This prevented it from running the WAL checkpoint and deleting the auxiliary files.

The first bug is a race condition when idle_conns to not always empty by the time the loop finishes. This means any connections still in idle_conns are not closed until the pool itself has been dropped.

Second bug is that when idle_conns is drained, the idle connections are temporarily treated as live which means a semaphore permit is created for them out of thin air. Then when this permit is released it causes the semaphore to get more available permits than it had initially which on the last iteration of the loop, causes this acquire to complete prematurely.

Info

  • SQLx version: 0.7.4
  • SQLx features enabled: runtime-tokio, sqlite
  • Database server and version: sqlite 3.45.3
  • Operating system: linux
  • rustc --version: rustc 1.77.1 (7cf61ebde 2024-03-27)
@madadam madadam added the bug label Apr 29, 2024
@madadam
Copy link
Contributor Author

madadam commented Apr 29, 2024

This is my attempt at a fix which seems to work for us:

    pub(super) fn close<'a>(self: &'a Arc<Self>) -> impl Future<Output = ()> + 'a {
        self.mark_closed();

        async move {
            let _permits = self.semaphore.acquire(self.options.max_connections).await;

            while let Some(idle) = self.idle_conns.pop() {
                let _ = idle.live.raw.close().await;
            }

            self.num_idle.store(0, Ordering::Release);
            self.size.store(0, Ordering::Release);
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant