-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies #22426
Labels
2.20 Backport Required
2024.1 Backport Required
2024.1_blocker
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
basavaraj29
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
2.20 Backport Required
2024.1 Backport Required
labels
May 16, 2024
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
May 16, 2024
basavaraj29
changed the title
[DocDB] Redundant RPC
[DocDB] Redundant RPCs in conflicting workloads caused due to current approach of tracking wait-for dependencies
May 16, 2024
yugabyte-ci
added
2024.1_blocker
and removed
status/awaiting-triage
Issue awaiting triage
labels
May 17, 2024
basavaraj29
added a commit
that referenced
this issue
May 23, 2024
…ng high latencies in conflicting workloads Summary: Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D35149
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
…ng high latencies in conflicting workloads Summary: Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D35149
Reactivating for 2.20 backport. |
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…adlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 5ff199c / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: rthallam Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35365
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…Deadlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 1302ccd / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35302
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…adlock rpcs causing high latencies in conflicting workloads Summary: Original commit: 1302ccd / D35149 Commit [[ cd5905e | cd5905e ]] introduced a regression that caused latency spikes close to 3x in conflicting workloads. **A brief on how deadlock detection is done today** The deadlock detection algorithm contains 2 components - Local waiting txn registry and the detector itself. Every node has a local waiting transaction registry that tracks wait-for dependencies on that node. On receiving a wait-for dependency of a transaction from any tablet peer on that node, in addition to storing it locally (aggregated by status tablet), the registry forwards the information to the blocker’s detector. This is treated as a partial update from the source registry. Additionally, the registry periodically sends full updates to each detector where it includes all active wait-for dependencies collected for the corresponding status tablet. Every status tablet maintains a list of wait-for dependencies keyed by `<waiter txn, source tserver uuid>` pair in `waiters_`. On a full update we erase all dependencies from that tserver, and populate `waiters_` again. On every partial update, 1. Pre `cd5905e`, the detector checks the time of the incoming probe and overwrites the information only if the timestamp of the probe is newer. 2. Post `cd5905e`, the detector keeps appending the wait-for dependencies. Here’s why - 1. could lead to loss of a few wait-for dependencies in the following way - If a waiter txn is waiting at multiple tablets on a tserver, and is waiting on blockers with the same status tablet, the detector would receive 2 partial updates leading to a race and only the last incoming probe is preserved in memory. This leads to true deadlocks going undetected. Hence, the detector doesn’t overwrite information on partial updates, but updates it. Refer the commit description for further details. On a partial update, the detector launches probes for just the new dependencies alone. And then the probe forwarding algorithm kicks in, where each node on receiving a probe forwards it along its known blockers. For instance probe[w1 -> b1] is forwarded to b1's detector, and b1's detector in-turn forwards the same probe to b1's blockers. The detector prunes the dependencies periodically when it triggers full probe scans, i.e, when it launches probes for all existing wait-for dependencies at the detector. **Why cd5905e caused a regression in conflicting workloads** For simplicity, assume a universe with just 1 status tablet, and 1 tserver. Assume a scenario where 100s of waiters are trying to update the same row. `txn1` replicates intents and `txn2` … `txn100` enter the wait queue with blocker as `txn1`. So the detector would be tracking the following dependencies - `txn2 -> txn1`, ... , `txn100 -> txn1`. Now when txn1 commits, all the waiters in the wait-queue at the tablet are signaled. `txn2` replicates its intents, and the rest enter the wait-queue with `txn2` as the blocker. The detector would now receive probes `txn3 -> txn2`, ... , `txn100 -> txn2`. Pre `cd5905e`, this new partial update would have erased `txn3 -> txn1`, ... , `txn100 -> txn1`. But post `cd5905e`, the list is updatd with the new information. So `txn[3-100]` would launch redundant probes to `txn1` on every subsequent partial update (until a full update from the tserver comes in). Additionally, `txn[3-100] -> txn2` would also indirectly result in redundant `txn2 -> txn1` probes being launched. **Solution** The diff address the issue by marking inactive probes and prevents redundant rpcs along these probes. Additionally, when we receive a partial update, while forming the new list of dependencies, we prune inactive wait-for relationships. So in the above example, once `txn2` resolves and `txn3` blocks remaining transactions, no rpcs would be launched to `-> txn1` since those probes would have been marked inactive as part of the probe forwarding mechanism when the transactions were waiting on `txn2`. It follows that no rpcs would be launched along ` -> txn2` when transactions are waiting on `txn4`, and so on. Note that when probe `txn3 -> txn2` results in probe `txn2 -> txn1`, and the latter probe is inactive, the local probe processor marks `txn2 -> txn1` alone as inactive. This "inactiveness" isn't responded back along the callback of `txn3 -> txn2`, which is indeed the expected and correct behavior. **Upgrade/Rollback safety** Introduced a new field `should_erase_probe` in `ProbeTransactionDeadlockResponsePB`. For the duration of the upgrade/downgrade, nodes wouldn't prune inactive wait-for dependencies if either they are running older versions, or the rpc responses are from nodes running older versions. Hence, they might launch redundant `ProbeTransactionDeadlock` rpcs leading to increased latencies in conflicting workloads. But this shouldn't lead to any correctness issues. Jira: DB-11334 Test Plan: Jenkins Verified that the revision fixes the regression by running locking semantics workload with the changes. 1. this is with the first diff in this revision, where we see the latencies are back to old values {F179112} 2. though there aren't much changes between the third and first diff in the revision (except correctness fixes), there seems to be slight increase in latency. should be run to run variance. these comparisons are from multiple runs against the third revision. {F179315} {F179316} Reviewers: rthallam, pjain, sergei Reviewed By: sergei Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35303
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
2024.1 Backport Required
2024.1_blocker
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-11334
Description
cd5905e seems to have caused a regression in case of conflicting workloads.
In one of the locking benchmarks, we can see that the average latency is close to 3x of what it was earlier. The spike is first observed between 2.23.0.0-b189 and 2.23.0.0-b190.
Upon looking at the grafana metrics, we observed increased number of
TabletServerService
RPCs. Needs additional investigation on where the additional RPCs got introduced.Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: