Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency #469

Open
MattFenelon opened this issue Mar 19, 2024 · 8 comments · Fixed by #470 · May be fixed by #472
Open

ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency #469

MattFenelon opened this issue Mar 19, 2024 · 8 comments · Fixed by #470 · May be fixed by #472
Labels

Comments

@MattFenelon
Copy link
Contributor

I recently enabled concurrency on a system and it resulted in a lot of ActiveRecord::ConnectionTimeoutError errors.

I believe it's because ActiveRecord has a 1 thread = 1 connection policy and the database connection pool for this system is limited to the number of puma threads. As soon as the system receives enough requests to use up the connection pool, the concurrently side-loaded resource threads are unable to get a connection from the pool, and vice versa.

But I'm surprised no one has come across this issue before so perhaps I've misunderstood the problem.

One solution might be to increase the system's database connection pool size. This would allow Graphiti's threads to have enough connections to not exhaust the thread pool. The difficulty would be in sizing the db pool. Given each request can sideload x resources, the total number of database connections required at any given time would be x * number of concurrent requests.

Ideally though I think it would make sense to allow the thread pool size to be configured. Concurrent::Promise, used to side-loaded resources, has by default a thread pool equal to java's int max.

What do you think?

vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:in `block in wait_poll': could not obtain a connection from the pool within 5.000 seconds (waited 5.005 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:117:in `loop'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:117:in `wait_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:78:in `internal_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:204:in `internal_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:73:in `block in poll'
    from vendor/ruby-3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
    from vendor/ruby-3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:82:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:73:in `poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:663:in `acquire_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:353:in `checkout'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:182:in `connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_handler.rb:246:in `retrieve_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_handling.rb:287:in `retrieve_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_handling.rb:254:in `connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/querying.rb:62:in `_query_by_sql'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:998:in `block in exec_main_query'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:1018:in `skip_query_cache_if_necessary'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:984:in `exec_main_query'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:962:in `block in exec_queries'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:1018:in `skip_query_cache_if_necessary'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:956:in `exec_queries'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:742:in `load'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:264:in `records'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:259:in `to_ary'
    from lib/graphiti_adapters/an_adapter.rb:19:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/resource.rb:117:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:22:in `block in resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:81:in `block in broadcast_data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti.rb:65:in `block in broadcast'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications.rb:206:in `block in instrument'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications/instrumenter.rb:58:in `instrument'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications.rb:206:in `instrument'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti.rb:64:in `broadcast'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:80:in `broadcast_data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:20:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/resource_proxy.rb:68:in `data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/sideload.rb:231:in `load'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/sideload.rb:299:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:47:in `block (2 levels) in resolve_sideloads'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `block in synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:22:in `execute'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/promise.rb:564:in `block in realize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:359:in `run_task'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:350:in `block (3 levels) in create_worker'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `loop'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `block (2 levels) in create_worker'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `catch'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `block in create_worker'
Copy link

🎉 This issue has been resolved in version 1.6.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@jkeen
Copy link
Collaborator

jkeen commented Mar 26, 2024

@MattFenelon moving our convo from #470 into here.

In short I'm still having a ton of trouble with web server lockups after this change and it's hard to figure out what exactly is going on… but it seems like without this change everything is a-ok. My DB pool values and the sideload thread max values seem to be reasonable, and yet sometimes things will just hang causing the webserver to not respond to requests.

I think we need to roll this thing back and investigate further on an unreleased branch. The part that worries me most is that there's no error message or any sort of clue that this is the issue, meaning there could be people out there with apps breaking without a clear cut cause. Also for me in order to get things running properly again it has required a server restart, which is a real bummer.

I'm going to put out a new patch release without these changes, and then maybe we can work together to try and resolve what's going on by working off a branch?

jkeen added a commit that referenced this issue Mar 26, 2024
…n what's causing thread pool hangs. refs #469

revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (#471)"
revert: "add thread pool and concurrency_max_threads configuration option (#470)"

This reverts commit 51fb51c.
This reverts commit 697d761.
github-actions bot pushed a commit that referenced this issue Mar 26, 2024
## [1.6.3](v1.6.2...v1.6.3) (2024-03-26)

### Bug Fixes

* Remove thread pool executor logic until we get a better handle on what's causing thread pool hangs. refs [#469](#469) ([7941b6f](7941b6f)), closes [#471](#471) [#470](#470)
@MattFenelon
Copy link
Contributor Author

Sorry to hear that @jkeen. I see you've reverted, that makes sense until we can clear this up.

In terms of diagnosing:

  1. What're your puma thread and workers set to?
  2. Did you use the default concurrency_max_threads setting?
  3. What's pool set to in database.yml?
  4. Does your database have a connection limit? If so, what is it?
  5. How many dynos/containers/vms do you have running?

MattFenelon added a commit to MattFenelon/graphiti that referenced this issue Mar 27, 2024
…handle on what's causing thread pool hangs. refs graphiti-api#469"

This reverts commit 7941b6f.
MattFenelon added a commit to MattFenelon/graphiti that referenced this issue Mar 27, 2024
This option allows to limit the maximum number of resources that can be
sideloaded concurrently. With a properly configured connection pool,
this ensures that the activerecord's connection pool is not exhausted by
the sideloading process.

The thread pool configuration is based on ActiveRecord's
global_thread_pool_async_query_executor.

This was previously attempted but there were reports of deadlocks. This
code differs from the original by using Concurrency::Delay assigned to a
constant instead of a regular Mutex. The Delay+constant is how
concurrent-ruby sets up their global thread pools so it's more likely to
be correct.

Closes graphiti-api#469

See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287
See: graphiti-api#470
@MattFenelon
Copy link
Contributor Author

The fact that it's locking up completely makes me think it's a deadlock related to the use of the Mutex. I've gone with a different approach in #472 but I haven't been able to test those changes yet. Do you want to give it a try? I can test them in about a week or so.

jkeen added a commit to jkeen/graphiti that referenced this issue Mar 27, 2024
…n what's causing thread pool hangs. refs graphiti-api#469

revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (graphiti-api#471)"
revert: "add thread pool and concurrency_max_threads configuration option (graphiti-api#470)"

This reverts commit 51fb51c.
This reverts commit 697d761.
jkeen pushed a commit to jkeen/graphiti that referenced this issue Mar 27, 2024
## [1.6.3](graphiti-api/graphiti@v1.6.2...v1.6.3) (2024-03-26)

### Bug Fixes

* Remove thread pool executor logic until we get a better handle on what's causing thread pool hangs. refs [graphiti-api#469](graphiti-api#469) ([7941b6f](graphiti-api@7941b6f)), closes [graphiti-api#471](graphiti-api#471) [graphiti-api#470](graphiti-api#470)
@jkeen jkeen added bug and removed released labels Mar 27, 2024
@jkeen
Copy link
Collaborator

jkeen commented Mar 27, 2024

For the staging site that kept running into issues yesterday:

DB_POOL = 30
RAILS_MAX_THREADS = 10
GRAPHITI_CONCURRENCY_MAX_THREADS = 3
WEB_CONCURRENCY = 2

It's running on a VPS and when inspecting the process list it didn't seem like any processes had CPU spiked, or anything else that might usually jump out at me as a problem. I wasn't sure what I should be looking for in spotting any hung threads from the command line, even. Any pointers there?

I can give #472 a try this week sometime. Appreciate the follow up

@MattFenelon
Copy link
Contributor Author

I think I've found the issue. It only seems to happen when side-loading children of child resources. Does that tally up with the side loading you're doing, e.g. include=resource1.resource2(.resource3)+? I don't think this happens when the includes are only 1 level deep, e.g. include=resource1,resource2,(resource3)+.

For the sake of an example, let's imagine a thread pool of 1 and an unbounded queue:

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for Promise 1 to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 - waits for Promise 2 to finish
  6. Promise 2 can't run because the thread pool is exhausted. Thread 1 can't be released because Promise 1 is waiting on Promise 2. This is a deadlock.

Though it's technically a deadlock the graphiti code uses sleep to wait for the promises so it manifests itself as a hang. In my PR I replaced the sleep with Concurrent::Promise.zip(*promises).value!, which connects the threads together, allowing ruby to detect the deadlock and raise a fatal exception No live threads left. Deadlock?.

I think it can be fixed by making the child resource sideload logic non-blocking. That would remove the need for the child resources to wait, which would then free up the threads to be used for other promises in the queue. But it probably needs some kind of queue to allow the parent resource to know when all child resources are loaded.

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for all promises in the queue to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 finishes. Thread 1 is added back to the thread pool.
  6. Thread 1 is leased to run Promise 2.
  7. Thread 1 - Promise 2 - the books resource is loaded. There are no other resources to load.
  8. Thread 1 - Promise 2 finishes.
  9. The parent resource stops waiting as all promises in the queue have finished
  10. The parent resource returns. All resources have been loaded at this point.

A few ifs there but I'm going to try this approach and see how I get on.

I've added a successfully failing test case that shows the deadlock: e4de93e

@jkeen
Copy link
Collaborator

jkeen commented Apr 12, 2024

And just for the sake of clarity—how is that scenario you outlined working now in the released version of graphiti?

I guess what I'm trying to understand is if the strategy you're describing works, does it yield in performance gains, or does it yield a more safe approach to what is currently happening by default?

@MattFenelon
Copy link
Contributor Author

Sure I can do that. There's no (real) limit to the threads that the released version of graphiti can use. Instead the problem is that the unbounded threads eventually can exhaust the available database connection pool as ActiveRecord has a 1 thread = 1 connection policy.

For this failure example, we have a puma thread pool of 1 and a database connection pool of 1:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. The database connection pool is now empty
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. AR attempts to lease a database connection to the thread but can't because thread 1 has the only available connection.
  6. Thread 2 - raises ActiveRecord::ConnectionTimeoutError "all pooled connections were in use".

In the real world this is less likely to happen because there have to be enough web threads running at the same time to exhaust the database pool.

For example, consider we have 5 puma threads and a database connection pool of 5:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 4 connections left in the pool
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 3 connections left in the pool
  6. Thread 3 - a new request comes in. ActiveRecord leases the thread a connection from the connection pool. There are 2 connections left in the pool
  7. Thread 2 - finishes. The connection is made available to the pool again. There are 3 connections left in the pool
  8. Thread 1 - finishes waiting. The connection is made available to the pool again. There are 4 connections left in the pool

And so on. If someone has limited traffic, limited sideloads, or a large database connection pool, they may never see this issue.

A naive way to solve it is to just increase the database connection pool but the AR error will raise its head eventually with enough traffic and sideloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants