VReplication: Improve handling of vplayer stalls #15797

mattlord · 2024-04-25T15:58:28Z

Description

Please see the issue for details about the problems we're attempting to solve/improve in this PR.

The improvements here are about detecting when we're not making any progress and showing/logging meaningful errors to replace the eventual generic EOF errors.

There are two types of stalls that we now detect and error on:

Being unable to record heartbeats in the stream
- This is a simple check and always enabled
Being unable to complete/commit a transaction consisting of replicated user events, which includes updating the vreplication record to record the latest position upon commit
- This is disabled by default via the new flag to control the timeout defaulting to 0s. This is a more complicated check and given that the scenario that we're improving is, to the best of my knowledge, very rare I did not want to potentially introduce new edge cases by default (especially since the heartbeat change should catch many of them).

Related Issue(s)

Fixes: Bug Report: VPlayer does not detect stalls #15974

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

vitess-bot · 2024-04-25T15:58:30Z

Signed-off-by: Matt Lord <mattalord@gmail.com>

codecov · 2024-04-25T16:26:38Z

Codecov Report

Attention: Patch coverage is 64.28571% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 68.23%. Comparing base (3b09eb2) to head (c666b14).
Report is 20 commits behind head on main.

Files	Patch %	Lines
.../vt/vttablet/tabletmanager/vreplication/vplayer.go	64.28%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15797      +/-   ##
==========================================
- Coverage   68.47%   68.23%   -0.25%     
==========================================
  Files        1562     1541      -21     
  Lines      197083   197127      +44     
==========================================
- Hits       134962   134517     -445     
- Misses      62121    62610     +489

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>

…eout Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-24T20:57:59Z

Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?

That's a good idea! With the new json field that will be much simpler and I like that better.

@rohit-nayak-ps I started on that work here: 7b5c4c5

There will still be more work to do, so I will let you know when it's ready for another review. Until then, you can see where I'm going if you like and let me know if you perhaps changed your mind (it's certainly much more involved this way, but I do think it's better overall).

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-25T06:13:21Z

Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?

That's a good idea! With the new json field that will be much simpler and I like that better.

@rohit-nayak-ps after spending quite a bit of time on it, I ended up reverting the work: b060e09

There was still more work to do there, and in working on it, it became more clear to me that this really is a vplayer component setting rather than a workflow option. So passing a duration all the way through from client flags, for each type of stream, just to configure the vplayer on the tablet, seemed like a lot of work today and into the future — any command that does use a vplayer that we want to be able to configure this would need to have it too, e.g. OnlineDDL and ApplySchema also need them today as the original issue seen was with OnlineDDL. In such I think it makes sense for it to be a vttablet flag. If you feel strongly about it, however, I can pick that work back up. It was nearly to the point where you could set it, view it, and update it in the various client commands. From there we just need to read and unmarshal the value from the _vt.vreplication record in the vreplication engine and thread it through the controller, vreplicator, and vplayer.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-25T06:55:54Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check. I do believe that will still be able to catch all of the cases, including the one seen in the original production issue. In that case the position was not getting updated via the replicated transaction or the heartbeat recording, so we should still have gotten the stalled vttablet log and the issue noted in the workflow's message field.

In any event, I did another test with the latest state:

❯ grep "stall handler" /opt/vtdataroot/tmp/*
...
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.532924   12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 422, active: 0 -- totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537210   12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537223   12895 vplayer.go:893] StallHandler-2527617803385045679: Starting the stall handler goroutine for 15s, total: 423, active: 1, totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537449   12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537781   12895 vplayer.go:917] StallHandler-2527617803385045679: Stopping the stall handler timer
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537791   12895 vplayer.go:905] StallHandler-2527617803385045679: the stall handler timer was stopped
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537793   12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 423, active: 0 -- totalStallHandlerCount: 36

rohit-nayak-ps · 2024-05-25T18:14:26Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.

This might indeed be a good path to choose, since your other change is catching the known issues.

Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-26T06:10:12Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.

This might indeed be a good path to choose, since your other change is catching the known issues.

Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.

@rohit-nayak-ps Having more than 1 active/concurrent goroutine per vplayer/stallHandler was a bug. The intention was for this to be a lock-free idempotent stall handler that was very difficult to mess up — precisely because it's a high QPS component with a somewhat awkward way of managing the transactions so you may do multiple BEGINs before a COMMIT e.g. There should only ever be 0 or 1 goroutines active/running per vplayer/stallHander instance. After spending so much time on the investigation/testing/debugging part of the work I was a bit too eager to mark this as ready for review once my manual test and the new unit test were passing. That was sloppy on my part — I missed the bug in my self review and I clearly didn't have adequate testing, I apologize. Your thorough review was much appreciated though as you caught it. ❤️

I've added comments and a new unit test that demonstrate it now working as intended: 3839e90

I wanted to get it working so that even if I do end up discarding it here we'll still have it in the PR history if we ever want to revive it or reuse it for something else.

With all that being said, let me know what you think. I'm fine removing it for now since it was only well into my testing and debugging that I realized the existing heartbeat mechanism combined with the vreplicationMinimumHeartbeatUpdateInterval flag/variable were already supposed to effectively be serving the role of detecting stalls. The problem was a bug in that handling where we cleared the pending heartbeat counter before we were successfully able to save/record the heartbeat (we assumed it would be immediately successful). So now that I'm well aware of it and have also confirmed that mechanism is working as it should (it's used in the ROW based test case in the new TestPlayerStalls unit test) I think that the stallHandler mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it. 🙂 That's obviously not a great reason to keep it and now that it's working and saved somewhere (commit history here), I'm totally fine removing it if you and others prefer it. Thanks again!

Signed-off-by: Matt Lord <mattalord@gmail.com>

shlomi-noach · 2024-05-27T11:59:52Z

I think that the stallHandler mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it.

@mattlord now that's it's been said, it cannot be unsaid... :)
Looking at the PR changes, the stall handler takes a considerable part of the actual logic changes. So I'd be curious to see a solution where the stall handler does not exist.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-28T16:52:06Z

@rohit-nayak-ps and @shlomi-noach this removes all of the stallHandler related code: c666b14

What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events).

If you both prefer that then I'll update the PR description accordingly. Thanks!

shlomi-noach

@mattlord thank you, I think we should go with this change, seeing that it's so succinct.

rohit-nayak-ps · 2024-05-29T09:40:19Z

@mattlord thank you, I think we should go with this change, seeing that it's so succinct.

Same here. Thanks @mattlord.

mattlord added this to In progress in VReplication via automation Apr 25, 2024

github-actions bot added this to the v20.0.0 milestone Apr 25, 2024

mattlord force-pushed the vplayer_batch_trx_timeout branch 2 times, most recently from fe9ed6d to 03a5b03 Compare April 25, 2024 16:03

Improve handling of vplayer stalls

bee30d4

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 03a5b03 to bee30d4 Compare April 25, 2024 16:09

mattlord added 3 commits May 15, 2024 22:06

Add progress timer and test

6e929fe

Signed-off-by: Matt Lord <mattalord@gmail.com>

Fix table name logging

5f05388

Signed-off-by: Matt Lord <mattalord@gmail.com>

Make error handling concurrency safe

e9e5fc5

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed Type: Bug labels May 16, 2024

mattlord added 3 commits May 16, 2024 11:07

More tweaks from self review

b72bfbe

Signed-off-by: Matt Lord <mattalord@gmail.com>

Lower progress timeout as the overall stall time is a multiple

ae275a5

Signed-off-by: Matt Lord <mattalord@gmail.com>

Move progress tracking to vplayer

99e2d86

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from cb5fc2e to b78059d Compare May 17, 2024 00:31

Only use the stall detection for replicated user events

d4517f2

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from b78059d to d4517f2 Compare May 17, 2024 01:27

Merge remote-tracking branch 'origin/main' into vplayer_batch_trx_tim…

c54c507

…eout Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord requested review from ajm188 and notfelineit as code owners May 24, 2024 20:56

mattlord requested a review from GuptaManan100 as a code owner May 24, 2024 23:19

WiP

fb0f29e

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 2e54673 to fb0f29e Compare May 25, 2024 05:57

Revert move to workflow option work

b060e09

Signed-off-by: Matt Lord <mattalord@gmail.com>

Reapply vplayer improvement reverted from flag work

bfd67e5

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord removed request for ajm188, harshit-gangal, notfelineit and GuptaManan100 May 25, 2024 06:19

Reapply one other comment improvement

eeba51b

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 6050c19 to d9a276c Compare May 26, 2024 05:48

Add leak checking unit test

3839e90

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from d9a276c to 3839e90 Compare May 26, 2024 06:01

mattlord force-pushed the vplayer_batch_trx_timeout branch from 6f9d494 to d1f65b6 Compare May 26, 2024 14:10

Add a few more expected TestMain goroutines to ignore list

d2fea8d

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from d1f65b6 to d2fea8d Compare May 26, 2024 14:44

Remove stallHandler related code

c666b14

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 3ebfc1d to c666b14 Compare May 28, 2024 16:48

shlomi-noach reviewed May 29, 2024

View reviewed changes

shlomi-noach approved these changes May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VReplication: Improve handling of vplayer stalls #15797

VReplication: Improve handling of vplayer stalls #15797

mattlord commented Apr 25, 2024 •

edited

vitess-bot bot commented Apr 25, 2024

codecov bot commented Apr 25, 2024 •

edited

mattlord commented May 24, 2024

mattlord commented May 25, 2024 •

edited

mattlord commented May 25, 2024 •

edited

rohit-nayak-ps commented May 25, 2024

mattlord commented May 26, 2024 •

edited

shlomi-noach commented May 27, 2024

mattlord commented May 28, 2024

shlomi-noach left a comment

rohit-nayak-ps commented May 29, 2024

VReplication: Improve handling of vplayer stalls #15797

Are you sure you want to change the base?

VReplication: Improve handling of vplayer stalls #15797

Conversation

mattlord commented Apr 25, 2024 • edited

Description

Related Issue(s)

Checklist

vitess-bot bot commented Apr 25, 2024

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented Apr 25, 2024 • edited

Codecov Report

mattlord commented May 24, 2024

mattlord commented May 25, 2024 • edited

mattlord commented May 25, 2024 • edited

rohit-nayak-ps commented May 25, 2024

mattlord commented May 26, 2024 • edited

shlomi-noach commented May 27, 2024

mattlord commented May 28, 2024

shlomi-noach left a comment

Choose a reason for hiding this comment

rohit-nayak-ps commented May 29, 2024

mattlord commented Apr 25, 2024 •

edited

codecov bot commented Apr 25, 2024 •

edited

mattlord commented May 25, 2024 •

edited

mattlord commented May 25, 2024 •

edited

mattlord commented May 26, 2024 •

edited