Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate failed repair jobs #3806

Open
Michal-Leszczynski opened this issue Apr 17, 2024 · 1 comment
Open

Terminate failed repair jobs #3806

Michal-Leszczynski opened this issue Apr 17, 2024 · 1 comment
Labels

Comments

@Michal-Leszczynski
Copy link
Collaborator

Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:

  • in case of a timeout, SM believes that the task has failed and stopped running, so it schedules new repair jobs for "released" hosts. This can break the 1 job per 1 host rule.
  • not terminated repair jobs running after SM task has ended might make it impossible to retry the SM task until they are finished (see https://github.com/scylladb/scylla-enterprise/issues/4055)
@Michal-Leszczynski Michal-Leszczynski added bug Something isn't working repair labels Apr 17, 2024
@karol-kokoszka
Copy link
Collaborator

Gromming notes

The goal is to call the Scylla API to kill the repair job that timeout on the job status check to assure that the job is not handled by the Scylla server anymore.

The timeout for the repair status check is set for 30 minutes right now.
@asias Any clue what would be the best timeout we can set for waiting on the repair job status ?

We need to have the integration test covering this scenario.

  1. Timeout the repair job
  2. Assert that the job is terminated and no longer running on the Scylla server.

It may create a need for controlling the timeout value via yaml or other configuration.

The issue describes just a corner case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants