-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: Improve handling of vplayer stalls #15797
base: main
Are you sure you want to change the base?
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
fe9ed6d
to
03a5b03
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
03a5b03
to
bee30d4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15797 +/- ##
==========================================
- Coverage 68.47% 68.23% -0.25%
==========================================
Files 1562 1541 -21
Lines 197083 197127 +44
==========================================
- Hits 134962 134517 -445
- Misses 62121 62610 +489 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
cb5fc2e
to
b78059d
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
b78059d
to
d4517f2
Compare
…eout Signed-off-by: Matt Lord <mattalord@gmail.com>
@rohit-nayak-ps I started on that work here: 7b5c4c5 There will still be more work to do, so I will let you know when it's ready for another review. Until then, you can see where I'm going if you like and let me know if you perhaps changed your mind (it's certainly much more involved this way, but I do think it's better overall). |
2e54673
to
fb0f29e
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
@rohit-nayak-ps after spending quite a bit of time on it, I ended up reverting the work: b060e09 There was still more work to do there, and in working on it, it became more clear to me that this really is a vplayer component setting rather than a workflow option. So passing a duration all the way through from client flags, for each type of stream, just to configure the vplayer on the tablet, seemed like a lot of work today and into the future — any command that does use a vplayer that we want to be able to configure this would need to have it too, e.g. OnlineDDL and ApplySchema also need them today as the original issue seen was with OnlineDDL. In such I think it makes sense for it to be a vttablet flag. If you feel strongly about it, however, I can pick that work back up. It was nearly to the point where you could set it, view it, and update it in the various client commands. From there we just need to read and unmarshal the value from the _vt.vreplication record in the vreplication engine and thread it through the controller, vreplicator, and vplayer. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check. I do believe that will still be able to catch all of the cases, including the one seen in the original production issue. In that case the position was not getting updated via the replicated transaction or the heartbeat recording, so we should still have gotten the stalled vttablet log and the issue noted in the workflow's message field. In any event, I did another test with the latest state:
|
This might indeed be a good path to choose, since your other change is catching the known issues. Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation. |
6050c19
to
d9a276c
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
d9a276c
to
3839e90
Compare
@rohit-nayak-ps Having more than 1 active/concurrent goroutine per I've added comments and a new unit test that demonstrate it now working as intended: 3839e90 I wanted to get it working so that even if I do end up discarding it here we'll still have it in the PR history if we ever want to revive it or reuse it for something else. With all that being said, let me know what you think. I'm fine removing it for now since it was only well into my testing and debugging that I realized the existing heartbeat mechanism combined with the |
6f9d494
to
d1f65b6
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
d1f65b6
to
d2fea8d
Compare
@mattlord now that's it's been said, it cannot be unsaid... :) |
Signed-off-by: Matt Lord <mattalord@gmail.com>
3ebfc1d
to
c666b14
Compare
@rohit-nayak-ps and @shlomi-noach this removes all of the What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events). If you both prefer that then I'll update the PR description accordingly. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattlord thank you, I think we should go with this change, seeing that it's so succinct.
Description
Please see the issue for details about the problems we're attempting to solve/improve in this PR.
The improvements here are about detecting when we're not making any progress and showing/logging meaningful errors to replace the eventual generic EOF errors.
There are two types of stalls that we now detect and error on:
Related Issue(s)
Checklist