Possibility for repairs to never be triggered #264

masokol · 2022-02-10T14:00:53Z

Since ecchronos assumes tables are repaired when there's no repair history, it's possible that repairs will never be triggered if ecchronos restarts/crashes once every repair interval before repair is actually triggered.

VictorCavichioli · 2024-01-30T05:39:48Z

Possible TestCase:

Configure ecChronos interval schedule to use minutes instead of days (default configuration;
Start ecChronos with automatic repair enabled
During schedule interval force a restart, we do this in two ways, killing java process and then running again:
Kill java process
```
kill -15 <pid>
```
Restart ecChronos using ecChronos binary jar
```
nohup java -jar ecchronos-binary-5.0.1-SNAPSHOT.jar > /dev/null 2>&1 &
```

But if you're running in a container you can just execute

docker restart <container_name/id>

Check schedule repair

Some considerations:

I believe this case of ecChronos restarting during the interval is valid, but I don't see how all ecChronos instances could be always restarting, if one is, others will be running repairs on their nodes and the repair_history will receive data.

DanielwEriksson · 2024-02-21T11:02:01Z

where can this jar be found? in my environment it does not exist

epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ pwd
/home/epkdaek/cassandra/ecchronos-binary-5.0.1-SNAPSHOT
epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ find . -name ec*.jar
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ ls lib/e*
lib/ecchronos-binary-5.0.1-SNAPSHOT.pom lib/error_prone_annotations-2.18.0.jar
epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

DanielwEriksson · 2024-02-21T11:11:37Z

should the restart suggested to trigger the scenario be the same as if ecc is started via

./bin/ecctool start -f

and the stopping it via ctrl-C and then starting it again via the same command?

and I assume the changing from days to minutes is in /conf/ecc.yml
but which parts? this section?
repair:
##
## A class for providing repair configuration for tables.
## The default FileBasedRepairConfiguration uses a schedule.yml file to define per-table configurations.
##
provider: com.ericsson.bss.cassandra.ecchronos.application.FileBasedRepairConfiguration
##
## How often repairs should be triggered for tables.
##
interval:
time: 7
unit: days
?

VictorCavichioli · 2024-02-23T13:31:58Z

Yes, I figure out that it really does not exist, you can try in the way you've suggested.

DanielwEriksson · 2024-02-26T10:09:34Z

After discussion with the author this is how to reproduce the issue/bug.

run "ecctool schedules" and notice "completed at" and "next repair"

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:48:55

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-26 09:59:30 | 2024-03-04 09:58:25 | VNODE |
| 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-26 10:01:22 | 2024-03-04 10:00:15 | VNODE |
| 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-26 10:03:15 | 2024-03-04 10:02:08 | VNODE |
| 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-26 10:05:09 | 2024-03-04 10:04:01 | VNODE |
| 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-26 10:07:03 | 2024-03-04 10:01:38 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

-- truncate repair_history and restart ecchronos

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:50:42

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE |
| 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE |
| 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |
| 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |
| 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

after every restart these times will be recalculated without the repair job being executed.

DanielwEriksson · 2024-02-27T11:06:14Z

The repair jobs seems to execute just fine. I have changed the repair schedule to 10 min in ecc.yml before starting.

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool start -f
ecc started with pid 1066218

. ____ _ __ _ _
/\ / ' __ _ () __ __ _ \ \ \
( ( )__ | '_ | '| | ' / ` | \ \ \
\/ )| |)| | | | | || (| | ) ) ) )
' || .__|| ||| |_, | / / / /
=========||==============|/=////
:: Spring Boot :: (v2.7.17)

11:46:28.754 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Starting SpringBooter using Java 11.0.21 on elx721027t9 with PID 1066218 (/home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT/lib/application-5.0.0-SNAPSHOT.jar started by epkdaek in /home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT)
11:46:28.757 [main] INFO c.e.b.c.e.a.spring.SpringBooter - No active profile set, falling back to 1 default profile: "default"
11:46:30.103 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat initialized with port(s): 8080 (http)
11:46:30.112 [main] INFO o.a.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler ["http-nio-127.0.0.1-8080"]
11:46:30.114 [main] INFO o.a.catalina.core.StandardService - Starting service [Tomcat]
11:46:30.114 [main] INFO o.a.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.82]
11:46:30.241 [main] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring embedded WebApplicationContext
11:46:30.242 [main] INFO o.s.b.w.s.c.ServletWebServerApplicationContext - Root WebApplicationContext: initialization completed in 1429 ms
11:46:30.487 [main] INFO c.e.b.c.e.a.DefaultNativeConnectionProvider - Connecting through CQL using localhost:9042, authentication: false, tls: false
11:46:33.472 [s1-admin-0] INFO c.e.b.c.e.c.DataCenterAwarePolicy - Using provided data-center name 'datacenter1' for DataCenterAwareLoadBalancingPolicy
11:46:33.487 [main] INFO c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using localhost:7100, authentication: false, tls: false
11:46:33.728 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock is new, next repair 2024-02-27 11:56:33
11:46:33.784 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock_priority is new, next repair 2024-02-27 11:56:33
11:46:33.818 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.on_demand_repair_status is new, next repair 2024-02-27 11:56:33
11:46:33.870 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.reject_configuration is new, next repair 2024-02-27 11:56:33
11:46:33.895 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.repair_history is new, next repair 2024-02-27 11:56:33
11:46:34.283 [main] INFO o.s.b.a.e.web.EndpointLinksResolver - Exposing 1 endpoint(s) beneath base path '/actuator'
11:46:34.310 [main] INFO o.a.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-127.0.0.1-8080"]
11:46:34.325 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat started on port(s): 8080 (http) with context path ''
11:46:34.344 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Started SpringBooter in 6.0 seconds (JVM running for 6.44)
11:46:42.231 [http-nio-127.0.0.1-8080-exec-1] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet 'dispatcherServlet'
11:46:42.232 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Initializing Servlet 'dispatcherServlet'
11:46:42.233 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Completed initialization in 1 ms
11:57:03.807 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock
11:58:39.425 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock_priority
12:00:13.321 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.on_demand_repair_status
12:01:45.794 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.reject_configuration
12:03:18.858 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.repair_history

jwaeab · 2024-02-28T09:31:25Z

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

masokol · 2024-02-28T12:59:42Z

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

DanielwEriksson · 2024-02-28T13:17:33Z

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

What is the expected behaviour?
Right now it expects that the next repair should execute at "now" + "configured schedule time". by this log entry,

12:21:13.488 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table

is new, next repair .

jwaeab · 2024-02-29T05:27:48Z

@masokol
Then there are two options, I guess, to fix it;

Repair immediately if history is empty
Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

DanielwEriksson · 2024-02-29T09:03:07Z

#2 is almost how it is done today. just that it recalculates if it is restarted again.
Who can decide between #1 and #2 orif there are more?

jwaeab · 2024-02-29T09:44:56Z

Yes, and that's maskol's point; e.g. the next repair date will keep moving forward if history is empty and ecchronos is restarted.
My two cents are, if an empty repair history is assumed to be "repaired", I suggest that ecchronos should probably timestamp it as such in the history then - unless there are any other problems with that.

masokol · 2024-03-01T08:20:37Z

@masokol Then there are two options, I guess, to fix it;

Repair immediately if history is empty

Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

From what i've understood the assumption that everything is repaired if repair history is empty was made to avoid option 1. So maybe option 2 or some completely new solution. One could argue that having an explicit delay, like repair_delay for each schedule might be a better option. Anyway, whichever solution you choose, i think ecChronos should know if it's the first time it's starting or not.

DanielwEriksson · 2024-03-04T12:32:25Z

The working assumption that was decided is that when a new table is found ecchronos should update the histroy that a repair has been done now without doing the repair so there is history information the next time it starts

more to come

ECCTOOL_EXAMPLES.md remains to be updated

masokol changed the title ~~Possible for repairs to never be triggered~~ Possibility for repairs to never be triggered Feb 10, 2022

masokol added the bug Something isn't working label Feb 22, 2022

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 19, 2024

Possibility for repairs to never be triggered Ericsson#264

95991d3

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 19, 2024

Possibility for repairs to never be triggered Ericsson#264

afb56a4

DanielwEriksson self-assigned this Mar 20, 2024

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 20, 2024

Issue Ericsson#264 - TC issues

d4872b1

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 20, 2024

Issue Ericsson#264 - TC issues

ef915dc

DanielwEriksson closed this as completed Mar 20, 2024

DanielwEriksson reopened this Mar 20, 2024

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024

Issue Ericsson#264 - review comments

d3148a7

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024

Issue Ericsson#264 - github environment

d8d2822

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024

Issue Ericsson#264 - github environment

75b820e

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 3, 2024

Issue Ericsson#264 - missed review comments

23484f0

more to come

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024

Issue Ericsson#264 - Review comments

f5d6af0

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024

Issue Ericsson#264 - Review comments

485568b

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024

Issue Ericsson#264 - Review comments

e4af00f

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 17, 2024

Issue Ericsson#264 - TC fixes

c230afe

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 22, 2024

Issue Ericsson#264 - python TC fixes

dcbf7e7

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 23, 2024

Issue Ericsson#264 - python TC fixes

c2a410e

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 23, 2024

Issue Ericsson#264 - Review comments

ac7845e

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 24, 2024

Issue Ericsson#264 - More review comments

061102f

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 24, 2024

Issue Ericsson#264 - More review comments

65260eb

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 25, 2024

Issue Ericsson#264 - More review comments

8bb1cd5

ECCTOOL_EXAMPLES.md remains to be updated

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 25, 2024

Issue Ericsson#264 - More review comments

1f25162

ECCTOOL_EXAMPLES.md remains to be updated

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 26, 2024

Issue Ericsson#264 - More review comments

6b00df3

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024

Issue Ericsson#264 - method naming

d408a43

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024

Issue Ericsson#264 - More review comments

b261687

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024

Issue Ericsson#264 - More review comments

518707f

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 30, 2024

Issue Ericsson#264 - More review comments

1591fcd

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024

Issue Ericsson#264 - More review comments

e1326e0

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024

Issue Ericsson#264 - More review comments

10fcbdf

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024

Issue Ericsson#264 - More review comments

ec4e415

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 3, 2024

Issue Ericsson#264 - More review comments 2024-05-03

2bff986

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 6, 2024

Issue Ericsson#264 - More review comments 2024-05-06

b2928f2

jwaeab closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility for repairs to never be triggered #264

Possibility for repairs to never be triggered #264

masokol commented Feb 10, 2022

VictorCavichioli commented Jan 30, 2024

DanielwEriksson commented Feb 21, 2024

DanielwEriksson commented Feb 21, 2024

VictorCavichioli commented Feb 23, 2024 •

edited

DanielwEriksson commented Feb 26, 2024

DanielwEriksson commented Feb 27, 2024

jwaeab commented Feb 28, 2024

masokol commented Feb 28, 2024

DanielwEriksson commented Feb 28, 2024

jwaeab commented Feb 29, 2024

DanielwEriksson commented Feb 29, 2024

jwaeab commented Feb 29, 2024

masokol commented Mar 1, 2024

DanielwEriksson commented Mar 4, 2024

Possibility for repairs to never be triggered #264

Possibility for repairs to never be triggered #264

Comments

masokol commented Feb 10, 2022

VictorCavichioli commented Jan 30, 2024

DanielwEriksson commented Feb 21, 2024

DanielwEriksson commented Feb 21, 2024

VictorCavichioli commented Feb 23, 2024 • edited

DanielwEriksson commented Feb 26, 2024

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:48:55

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:50:42

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

DanielwEriksson commented Feb 27, 2024

jwaeab commented Feb 28, 2024

masokol commented Feb 28, 2024

DanielwEriksson commented Feb 28, 2024

jwaeab commented Feb 29, 2024

DanielwEriksson commented Feb 29, 2024

jwaeab commented Feb 29, 2024

masokol commented Mar 1, 2024

DanielwEriksson commented Mar 4, 2024

VictorCavichioli commented Feb 23, 2024 •

edited

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:48:55

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:50:42