Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule tasks by timestamp #18508

Merged
merged 15 commits into from
May 27, 2024
Merged

Conversation

korthout
Copy link
Member

@korthout korthout commented May 14, 2024

Description

Unflake tests that rely on time travel to trigger timers.

This was unstable because the due date checker was scheduled with a delay, which is used to determine an actual deadline later during the actor's scheduling of the TimerSubscription. The instability arose because the task can be scheduled async, which means that the deadline was calculated async as well, so it could happen that the deadline was determined after the test had already traveled through time.

This problem is resolved by introducing a new runAt method, which allows scheduling a task at a specific timestamp. When the async scheduling is postponed until after the time traveling, the task will trigger immediately if the timestamp is now in the past.

This changes the following modules:

  • schedule introduces a new runAt method that can schedule Timer tasks (TimerSubscription) by timestamp
  • stream-platform introduces a new runAt method for the engine to schedule tasks at a timestamp that can produce processing results
  • engine uses the new runAt method to (re)schedule the DueDateChecker

Related issues

closes #17254
closes #10169
closes #18462

If the scheduling service is not yet up-and-running and it is used to
schedule a task, the task is simply ignored and nothing is scheduled.

While that could make sense in some cases, it could also hide a bug. It
therefore makes sense to log such cases at warn level.
The implementation of the TimerSubscription is specific to delayed
timers, where the scheduling in ActorTimerQueue determines the actual
timestamp to use.

I want to add a new TimerSubscription implementation that isn't based on
delays but simply uses a static timestamp as deadline.

Initially, I looked into replacing the delay code of the
TimerSubscription, but this isn't possible as it also allows recurring
timers using the same delay for each trigger.

Therefore, I chose to extract an interface, allowing me to more
re-implement the class differently without affecting existing behavior.
The deadline field of this class is actually a delay rather than a
timestamp. To clarify that the deadline can be calculated from it, it
makes sense to rename it.
The ActorTimerQueue was specific to DelayedTimerSubscription because
when scheduling, it would consider deadline to be a delay. However, if
we want to add other TimerSubscription implementations that are not
delay based, we'll need the TimerSubscription to be able to determine
the exact deadline itself.

In order to do so DelayedTimerSubscription needs the actor clock to
determine the deadline. It's okay to pass the clock as a parameter, even
though other implementation may not need it. In the end, we're just
asking for the moment it should be run.
This adds a new feature to the ActorControl: scheduling a timer task at
a specific timestamp.

Note that due to the way the ActorTimerQueue is implemented
(DeadlineTimerWheel), there is no guarantee that the task is run AT the
timestamp.
It's no longer a rule, but I didn't want to clutter the previous commit
with this rename.
I want to add more tests, and keep the class a bit organized
Inline the scheduleTimer methods. It only deduplicated a single line.
I'd rather remove the indirection.
Mostly just a copy of the DelayedTimerTests, but adjusted for running at
a timestamp.
Adds the ability to the stream-platform's ProcessingScheduleService to
run tasks at a timestamp. The execution is not guaranteed to happen
exactly at, but atleast not before the timestamp.

Inline with other task scheduling abilities of the
ProcessingScheduleService, this adds a runAtAsync method as well. It
functions similarly to the other async task scheduling capabilities.
This is to unflake tests that expect to trigger a timer through the due
date checker after traveling through time.

This was unstable because the due date checker was scheduled with a
delay, where this delay is used to determine an actual deadline at a
later point in time (during scheduling of the TimerSubscription by the
actor). The unstability arose because the task can be scheduled async
which means that the deadline was calculated async as well, so it could
happen that the deadline was determined after the test had already
travelled through time.

This problem is resolved by using the new runAt method, which allows
scheduling a task at a specific timestamp. When the scheduling is
postponed until after the time traveling, the task will trigger
immediately if the timestamp is now in the past.
@github-actions github-actions bot added the component/zeebe Related to the Zeebe component/team label May 14, 2024
@korthout korthout marked this pull request as ready for review May 14, 2024 16:24
@korthout korthout requested a review from a team May 14, 2024 16:24
@korthout korthout changed the title Schedule processing tasks by timestamp Schedule tasks by timestamp May 14, 2024
@Zelldon
Copy link
Member

Zelldon commented May 17, 2024

I haven't reviewed the PR yet sorry @korthout just to get it right and playing a bit devils advocate here.

  1. That is not an actual production issues right?
  2. We just do this for our test cases in the engine is this correct?
  3. Wouldn't we also solve this if we would use larger timeouts?
  4. Wouldn't this be also solved if the Engine "unit" tests would be decoupled from the platform like proposed here Proposal: Improve/Iterate over Engine Tests #10004
  5. Meaning providing its own simple scheduling and simple implementation (as JUNIt extension) that just gives the follow-up records into the engine again after being produced? 🤔

@korthout
Copy link
Member Author

All good @Zelldon. I'm also interested in @oleschoenburg's perspective when they return. Let me try to answer your questions in the meantime:

  1. That is not an actual production issues right?

This is meant to resolve a testing issue. However, the change does affect production behavior.

Since the actor scheduler was previously unable to schedule tasks at a specific timestamp, a small delay was always added on top of the requested delay when scheduling occurred async (i.e., the time between requesting the scheduling and executing this request). Note that async scheduling of processing tasks is the default behavior.

While this delay is typically tiny or unnoticeable, it could lead to additional delays in triggering timer events. Using the timestamp of a timer event is more accurate than calculating a delay from it and then postponing when that delay is applied to determine a timestamp again.

Using a timestamp leads to more accurate timer event triggering.

  1. We just do this for our test cases in the engine is this correct?

See above answer.

  1. Wouldn't we also solve this if we would use larger timeouts?

Yes. But this means the test also takes more time to run when it encounters the time travel race condition.

In the same way, the tests can be made to pass by reducing the duration of the timer events in the test cases. This has the same downside.

An additional downside for both cases is that we have to change our test cases specifically for this implementation limitation. I'd rather ensure we have tests for short and long-living timers as they both are reasonable use cases.

  1. Wouldn't this be also solved if the Engine "unit" tests would be decoupled from the platform like proposed here Proposal: Improve/Iterate over Engine Tests #10004

Absolutely, if the processing scheduling tasks are not scheduled asynchronously, this problem doesn't exist. However, on production we do schedule processing tasks asynchronously, so we may want to also make that available in our tests regardless.

I am in favor of decoupling the engine tests from the rest of the platform and immediately taking a JUnit5 approach. That does not stop us from adding this functionality as well. I see value in this pull request because it removes an entire class of flaky tests while also making timer event triggers more accurate on production.

  1. Meaning providing its own simple scheduling and simple implementation (as JUNIt extension) that just gives the follow-up records into the engine again after being produced? 🤔

See above answer.

Copy link
Member

@oleschoenburg oleschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for tackling this 🥇 Before a do a full review, I'd like to suggest three things:

🔧 I understand that you split TimerSubscription into Delayed and Stamped because recurring timers need to reschedule with the same delay again. I feel like that's a lot of code duplication and would suggest a slightly different approach where a single TimerSubscription treats deadline as a fixed point in time or a delay based on isRecurring.

🔧 Even without the refactoring suggested above, I'd like to always use the Stamped variant except for recurring timers. The old ActorControl#schedule method can take care of converting a delay into a fixed timestamp.

🔧 Remove the old runDelayed methods from ProcessingScheduleService. Let's try and keep that interface small.

@korthout
Copy link
Member Author

Thanks @oleschoenburg 👍 I think all of these are valuable but also a lot of extra effort.

I chose not to change the behavior of parts I don't know. Therefore, I added duplication between the two types of timer subscriptions. If you think this can all be unduplicated again, then I'd hope that someone a bit closer to the scheduler can do this.

I'd be willing to contribute to removing the undelayed from the ProcessingScheduleService as this would directly affect the engine. However, I wonder whether we want to expand this already large pull request. Shall I open an issue for this instead?

My hope is that we can unflake these tests soon.

Copy link
Member

@oleschoenburg oleschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, then let's merge it as is and maybe tackle my suggestions as follow-up 👍

@korthout
Copy link
Member Author

Thanks @oleschoenburg 🙇

As this resolves flakiness, should we backport it? 🤔

@oleschoenburg
Copy link
Member

I'd say we can at least try it out.

@korthout korthout added backport stable/8.2 Backport a pull request to 8.2.x backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.5 Backport a pull request to stable/8.5 labels May 23, 2024
@korthout korthout added this pull request to the merge queue May 23, 2024
This name is more fitting as we don't guarantee that the scheduled task
is ran at the provided timestamp. The only guarantee we provide is that
it is ran at some point at or after the timestamp.
@korthout korthout force-pushed the korthout-17254-schedule-by-timestamp branch from 8666086 to 176af26 Compare May 24, 2024 09:30
This reverts commit 176af26.

It's not necessary to be so explicit, as schedulers can never really
achieve running at a specific timestamp.
@korthout korthout enabled auto-merge May 24, 2024 11:48
@korthout korthout added this pull request to the merge queue May 24, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 24, 2024
@korthout korthout added this pull request to the merge queue May 24, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 24, 2024
@korthout korthout added this pull request to the merge queue May 27, 2024
Merged via the queue into main with commit 3cf914c May 27, 2024
38 checks passed
@korthout korthout deleted the korthout-17254-schedule-by-timestamp branch May 27, 2024 08:53
@backport-action
Copy link
Collaborator

Backport failed for stable/8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin stable/8.2
git worktree add -d .worktree/backport-18508-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-18508-to-stable/8.2
git switch --create backport-18508-to-stable/8.2
git cherry-pick -x baea47045d49041773604d5be3f814fbe00a7488 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8

@backport-action
Copy link
Collaborator

Backport failed for stable/8.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin stable/8.3
git worktree add -d .worktree/backport-18508-to-stable/8.3 origin/stable/8.3
cd .worktree/backport-18508-to-stable/8.3
git switch --create backport-18508-to-stable/8.3
git cherry-pick -x baea47045d49041773604d5be3f814fbe00a7488 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8

@backport-action
Copy link
Collaborator

Backport failed for stable/8.4, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin stable/8.4
git worktree add -d .worktree/backport-18508-to-stable/8.4 origin/stable/8.4
cd .worktree/backport-18508-to-stable/8.4
git switch --create backport-18508-to-stable/8.4
git cherry-pick -x baea47045d49041773604d5be3f814fbe00a7488 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8

@backport-action
Copy link
Collaborator

Successfully created backport PR for stable/8.5:

@korthout
Copy link
Member Author

I'll retry the backport once #18860 is merged, to reduce the manual effort.

@korthout
Copy link
Member Author

/backport

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.2:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-18508-to-stable/8.2
git worktree add --checkout .worktree/backport-18508-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-18508-to-stable/8.2
git reset --hard HEAD^
git cherry-pick -x 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8
git push --force-with-lease

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.3:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-18508-to-stable/8.3
git worktree add --checkout .worktree/backport-18508-to-stable/8.3 origin/stable/8.3
cd .worktree/backport-18508-to-stable/8.3
git reset --hard HEAD^
git cherry-pick -x 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8
git push --force-with-lease

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.4:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-18508-to-stable/8.4
git worktree add --checkout .worktree/backport-18508-to-stable/8.4 origin/stable/8.4
cd .worktree/backport-18508-to-stable/8.4
git reset --hard HEAD^
git cherry-pick -x 5d7066c73e1b47d583ba5cac741b95fc74d4a523 1bbd61ed54a4ed2d7fc1151cbcd26661b3d3acdf bf81f0bb7ce1fd785cb2441a8a41510ea71cc6cd 05a3b3fc763947b0dcadc7dae74aa2de40d64ae3 b88304339a41b06884a1d34a3377f1600bfc1ac6 96a818bc79fb11f74a2ee6f93ca5f3de1792f893 eefa7c7092b432fca96de7f6414c0c9c3abb6641 c1c47a6bfc3ab7dd2c6c3edc5e82c318285c8ae7 e0b8c00083ef98372b06dbcb1d2093fb15bb2bc5 a31d0e42fb8bac4594584347ea657133143cbe1e 41ac7ff200fe435b34e2794bb7601b197dd250c4 5a0e14339770a63780ed6ff61dd5e4d78b4984a6 176af26a00e91d690f30c25de00cbca67dd27a8a 893875c30427416cdbce3f82e208176306257da8
git push --force-with-lease

@backport-action
Copy link
Collaborator

Git push to origin failed for stable/8.5 with exitcode 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable/8.2 Backport a pull request to 8.2.x backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.5 Backport a pull request to stable/8.5 component/zeebe Related to the Zeebe component/team
Projects
None yet
4 participants