[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426

basavaraj29 · 2024-05-16T20:05:28Z

Description

cd5905e seems to have caused a regression in case of conflicting workloads.

In one of the locking benchmarks, we can see that the average latency is close to 3x of what it was earlier. The spike is first observed between 2.23.0.0-b189 and 2.23.0.0-b190.

Upon looking at the grafana metrics, we observed increased number of TabletServerService RPCs. Needs additional investigation on where the additional RPCs got introduced.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

…ng high latencies in conflicting workloads Summary: Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D35149

rthallamko3 · 2024-05-28T16:11:28Z

Reactivating for 2.20 backport.

…adlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 5ff199c / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: rthallam Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35365

…Deadlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 1302ccd / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35302

…adlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 1302ccd / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35303

basavaraj29 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage 2.20 Backport Required 2024.1 Backport Required labels May 16, 2024

basavaraj29 self-assigned this May 16, 2024

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 16, 2024

basavaraj29 changed the title ~~[DocDB] Redundant RPC~~ [DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies May 16, 2024

yugabyte-ci added 2024.1_blocker and removed status/awaiting-triage Issue awaiting triage labels May 17, 2024

rthallamko3 closed this as completed May 28, 2024

rthallamko3 reopened this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426

[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426

basavaraj29 commented May 16, 2024 •

edited

rthallamko3 commented May 28, 2024

[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426

[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426

Comments

basavaraj29 commented May 16, 2024 • edited

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented May 28, 2024

basavaraj29 commented May 16, 2024 •

edited