Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and down suspended process, #30323 #30335

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

patriknw
Copy link
Member

Early draft for discussion...

References #30323

@akka-ci akka-ci added the validating PR is currently being validated by Jenkins label Jun 21, 2021
if (suspendDetector.wasSuspended(suspendTimeout)) {
// note that suspend detection is running on all nodes
suspendDetected()
} else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes here is mostly indentation, added above if suspended case

log.warning(
ClusterLogMarker.sbrInstability,
"SBR detected that the process was suspended too long and will down itself.")
actOnDecision(DowningStrategy.DownSelfSuspended)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the logs from a kill -STOP followed by kill -CONT looks like:

[2021-06-21 12:17:18,570] [WARN] [akka.dispatch.SuspendDetector] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-32] - Process was suspended for [21 seconds]
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.ClusterHeartbeat] [akka://ShoppingCartService@127.0.0.1:2551] [akkaHeartbeatStarvation] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [21634] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, CPU overload, or GC.
[2021-06-21 12:17:18,584] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [akkaMemberChanged] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Marking node [akka://ShoppingCartService@127.0.0.1:2551] as [Down]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sharding.ShardRegion] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ShoppingCart: Self downed, stopping ShardRegion [akka://ShoppingCartService/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Self downed, stopping
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event MemberDowned(Member(akka://ShoppingCartService@127.0.0.1:2551, Down))
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Singleton manager stopping singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton]
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event LeaderChanged(Some(akka://ShoppingCartService@127.0.0.1:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - This node is not the leader any more and not responsible for taking SBR decisions.
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event RoleLeaderChanged(dc-default,Some(akka://ShoppingCartService@127.0.0.1:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ClusterSingletonManager state change [Oldest -> Stopping]
[2021-06-21 12:17:18,588] [DEBUG] [shopping.cart.ShoppingCart$] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - PostStop cart [cart2]
[2021-06-21 12:17:18,589] [DEBUG] [akka.cluster.sharding.Shard] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Shard [506] shutting down
[2021-06-21 12:17:18,591] [DEBUG] [akka.cluster.sharding.DDataShardCoordinator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Received termination message.
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Unwatching: [Actor[akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton/coordinator#1190468820] -> Actor[akka://ShoppingCartService@127.0.0.1:2552/system/sharding/ShoppingCart#279229579]]
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Cleanup self watch of [akka://ShoppingCartService@127.0.0.1:2552/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,593] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [akkaClusterSingletonTerminated] [ShoppingCartService-akka.actor.default-dispatcher-32] - Singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton] was terminated
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - is no longer leader
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:19,576] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Node has been marked as DOWN. Shutting down myself
[2021-06-21 12:17:19,577] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Shutting down...
[2021-06-21 12:17:19,577] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-14] - ClusterReceptionist [akka://ShoppingCartService@127.0.0.1:2551] - terminated/removed
[2021-06-21 12:17:19,578] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Successfully shut down
[2021-06-21 12:17:19,579] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-21] - Publish AddressTerminated [akka://ShoppingCartService@127.0.0.1:2552]
[2021-06-21 12:17:19,580] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-22] - Cancel all timers
[2021-06-21 12:17:19,583] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.ClusterUserAction$Down] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Running CoordinatedShutdown with reason [ClusterDowningReason]
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-service-unbind] with [0] tasks
[2021-06-21 12:17:19,584] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-unbind] with [1] tasks.
[2021-06-21 12:17:19,586] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [http-unbind-/127.0.0.1:8101] in CoordinatedShutdown phase [service-unbind]
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing phase [service-requests-done] with [1] tasks.
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing task [http-terminate-/127.0.0.1:8101] in CoordinatedShutdown phase [service-requests-done]
[2021-06-21 12:17:19,589] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://ShoppingCartService@127.0.0.1:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [2] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,590] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://ShoppingCartService@127.0.0.1:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-stop] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-cluster-shutdown] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-sharding-shutdown-region] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [region-shutdown] in CoordinatedShutdown phase [cluster-sharding-shutdown-region]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-leave] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [leave] in CoordinatedShutdown phase [cluster-leave]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting] with [4] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-singleton-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-2] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-1] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting-done] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [exiting-completed] in CoordinatedShutdown phase [cluster-exiting-done]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-shutdown] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-shutdown] in CoordinatedShutdown phase [cluster-shutdown]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-actor-system-terminate] with [0] tasks
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [actor-system-terminate] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate]
[2021-06-21 12:17:19,594] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Shutting down remote daemon.
[2021-06-21 12:17:19,595] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports.
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://ShoppingCartService@127.0.0.1:2552], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://ShoppingCartService@127.0.0.1:2552], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,639] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remoting shut down.
[2021-06-21 12:17:19,646] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown initiated...
[2021-06-21 12:17:19,650] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that sharding already acts quickly on the downing, see

ShoppingCart: Self downed, stopping ShardRegion
PostStop cart [cart2]

import SuspendDetector.SuspendDetected

// FIXME config
private val tickInterval = 100.millis
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not have to be this frequently, because I think the scheduled task will trigger immediately when the process is woken up.

@akka-ci akka-ci added needs-attention Indicates a PR validation failure (set by CI infrastructure) and removed validating PR is currently being validated by Jenkins labels Jun 21, 2021
@akka-ci
Copy link

akka-ci commented Jun 21, 2021

Test FAILed.

Copy link
Member

@johanandren johanandren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I guess it's mostly just missing config for the various timeouts?

@patriknw
Copy link
Member Author

patriknw commented Jun 21, 2021

I guess it's mostly just missing config for the various timeouts?

yeah, I wanted to throw out the idea before doing too much. Also thinking if there are more places where it should check? In persistent actors, or should that be in the plugins? Thinking it could abort any updates in flight (that haven't been sent to db yet).

@johanandren
Copy link
Member

Hmm, yeah, for persistence would be to avoid delayed writes when the actor was already started on another node?
If we want it for persistence I think it should be in the Akka Persistence journal infra and not in each individual journal plugin.

I can't think of any other place where it is immediately obvious to me that it would be good, I think I'll have to "marinate" the idea for a while and see what comes up.

@volatile private var aliveTime = System.nanoTime()
@volatile private var suspendDetectedTime = aliveTime - 1.day.toNanos

system.scheduler.scheduleWithFixedDelay(tickInterval, tickInterval) { () =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this approach (scheduling checkTime on the scheduler and dispatcher) yield a false positive in cases of thread starvation within the application? Thread starvation like that is definitely https://www.youtube.com/watch?v=MpUWrl3-mc8, but the response of pre-emptively tearing things down might be out of proportion; so we have a very loose deadline.

It seems like the intent is to capture scenarios where nothing in the JVM process gets scheduled (suspended process/VM, but also sufficiently-high levels of host load or priority inversions etc.), so maybe having the check run in a thread that is nearly always sleeping would be more effective and could have a faster reaction time? My understanding is that the kernel schedulers tend to be eager to give CPU to processes that haven't been runnable in a long time (e.g. because they're sleeping).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-attention Indicates a PR validation failure (set by CI infrastructure)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants