Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility for repairs to never be triggered #264

Closed
masokol opened this issue Feb 10, 2022 · 14 comments
Closed

Possibility for repairs to never be triggered #264

masokol opened this issue Feb 10, 2022 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@masokol
Copy link
Contributor

masokol commented Feb 10, 2022

Since ecchronos assumes tables are repaired when there's no repair history, it's possible that repairs will never be triggered if ecchronos restarts/crashes once every repair interval before repair is actually triggered.

@masokol masokol changed the title Possible for repairs to never be triggered Possibility for repairs to never be triggered Feb 10, 2022
@masokol masokol added the bug Something isn't working label Feb 22, 2022
@VictorCavichioli
Copy link
Contributor

Possible TestCase:

  1. Configure ecChronos interval schedule to use minutes instead of days (default configuration;
  2. Start ecChronos with automatic repair enabled
  3. During schedule interval force a restart, we do this in two ways, killing java process and then running again:
    Kill java process
    kill -15 <pid>
    
    Restart ecChronos using ecChronos binary jar
    nohup java -jar ecchronos-binary-5.0.1-SNAPSHOT.jar > /dev/null 2>&1 &
    

But if you're running in a container you can just execute

docker restart <container_name/id>
  1. Check schedule repair

Some considerations:

I believe this case of ecChronos restarting during the interval is valid, but I don't see how all ecChronos instances could be always restarting, if one is, others will be running repairs on their nodes and the repair_history will receive data.

@DanielwEriksson
Copy link
Contributor

where can this jar be found? in my environment it does not exist

epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ pwd
/home/epkdaek/cassandra/ecchronos-binary-5.0.1-SNAPSHOT
epkdaek@elx721027t9:
/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ find . -name ec*.jar
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

epkdaek@elx721027t9:/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ ls lib/e*
lib/ecchronos-binary-5.0.1-SNAPSHOT.pom lib/error_prone_annotations-2.18.0.jar
epkdaek@elx721027t9:
/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

@DanielwEriksson
Copy link
Contributor

should the restart suggested to trigger the scenario be the same as if ecc is started via

./bin/ecctool start -f

and the stopping it via ctrl-C and then starting it again via the same command?

and I assume the changing from days to minutes is in /conf/ecc.yml
but which parts? this section?
repair:
##
## A class for providing repair configuration for tables.
## The default FileBasedRepairConfiguration uses a schedule.yml file to define per-table configurations.
##
provider: com.ericsson.bss.cassandra.ecchronos.application.FileBasedRepairConfiguration
##
## How often repairs should be triggered for tables.
##
interval:
time: 7
unit: days
?

@VictorCavichioli
Copy link
Contributor

VictorCavichioli commented Feb 23, 2024

Yes, I figure out that it really does not exist, you can try in the way you've suggested.

@DanielwEriksson
Copy link
Contributor

After discussion with the author this is how to reproduce the issue/bug.

run "ecctool schedules" and notice "completed at" and "next repair"

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:48:55

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-26 09:59:30 | 2024-03-04 09:58:25 | VNODE |
| 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-26 10:01:22 | 2024-03-04 10:00:15 | VNODE |
| 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-26 10:03:15 | 2024-03-04 10:02:08 | VNODE |
| 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-26 10:05:09 | 2024-03-04 10:04:01 | VNODE |
| 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-26 10:07:03 | 2024-03-04 10:01:38 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

-- truncate repair_history and restart ecchronos

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules
Snapshot as of 2024-02-26 10:50:42

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE |
| 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE |
| 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |
| 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |
| 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

after every restart these times will be recalculated without the repair job being executed.

@DanielwEriksson
Copy link
Contributor

The repair jobs seems to execute just fine. I have changed the repair schedule to 10 min in ecc.yml before starting.

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool start -f
ecc started with pid 1066218

. ____ _ __ _ _
/\ / ' __ _ () __ __ _ \ \ \
( ( )_
_ | '_ | '| | ' / ` | \ \ \
\/ )| |)| | | | | || (| | ) ) ) )
' |
| .__|| ||| |_, | / / / /
=========|
|==============|/=////
:: Spring Boot :: (v2.7.17)

11:46:28.754 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Starting SpringBooter using Java 11.0.21 on elx721027t9 with PID 1066218 (/home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT/lib/application-5.0.0-SNAPSHOT.jar started by epkdaek in /home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT)
11:46:28.757 [main] INFO c.e.b.c.e.a.spring.SpringBooter - No active profile set, falling back to 1 default profile: "default"
11:46:30.103 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat initialized with port(s): 8080 (http)
11:46:30.112 [main] INFO o.a.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler ["http-nio-127.0.0.1-8080"]
11:46:30.114 [main] INFO o.a.catalina.core.StandardService - Starting service [Tomcat]
11:46:30.114 [main] INFO o.a.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.82]
11:46:30.241 [main] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring embedded WebApplicationContext
11:46:30.242 [main] INFO o.s.b.w.s.c.ServletWebServerApplicationContext - Root WebApplicationContext: initialization completed in 1429 ms
11:46:30.487 [main] INFO c.e.b.c.e.a.DefaultNativeConnectionProvider - Connecting through CQL using localhost:9042, authentication: false, tls: false
11:46:33.472 [s1-admin-0] INFO c.e.b.c.e.c.DataCenterAwarePolicy - Using provided data-center name 'datacenter1' for DataCenterAwareLoadBalancingPolicy
11:46:33.487 [main] INFO c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using localhost:7100, authentication: false, tls: false
11:46:33.728 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock is new, next repair 2024-02-27 11:56:33
11:46:33.784 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock_priority is new, next repair 2024-02-27 11:56:33
11:46:33.818 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.on_demand_repair_status is new, next repair 2024-02-27 11:56:33
11:46:33.870 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.reject_configuration is new, next repair 2024-02-27 11:56:33
11:46:33.895 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.repair_history is new, next repair 2024-02-27 11:56:33
11:46:34.283 [main] INFO o.s.b.a.e.web.EndpointLinksResolver - Exposing 1 endpoint(s) beneath base path '/actuator'
11:46:34.310 [main] INFO o.a.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-127.0.0.1-8080"]
11:46:34.325 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat started on port(s): 8080 (http) with context path ''
11:46:34.344 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Started SpringBooter in 6.0 seconds (JVM running for 6.44)
11:46:42.231 [http-nio-127.0.0.1-8080-exec-1] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet 'dispatcherServlet'
11:46:42.232 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Initializing Servlet 'dispatcherServlet'
11:46:42.233 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Completed initialization in 1 ms
11:57:03.807 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock
11:58:39.425 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock_priority
12:00:13.321 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.on_demand_repair_status
12:01:45.794 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.reject_configuration
12:03:18.858 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.repair_history

@jwaeab
Copy link
Collaborator

jwaeab commented Feb 28, 2024

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

@masokol
Copy link
Contributor Author

masokol commented Feb 28, 2024

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

@DanielwEriksson
Copy link
Contributor

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

What is the expected behaviour?
Right now it expects that the next repair should execute at "now" + "configured schedule time". by this log entry,

12:21:13.488 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table

is new, next repair .

@jwaeab
Copy link
Collaborator

jwaeab commented Feb 29, 2024

@masokol
Then there are two options, I guess, to fix it;

  1. Repair immediately if history is empty
  2. Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

@DanielwEriksson
Copy link
Contributor

#2 is almost how it is done today. just that it recalculates if it is restarted again.
Who can decide between #1 and #2 orif there are more?

@jwaeab
Copy link
Collaborator

jwaeab commented Feb 29, 2024

Yes, and that's maskol's point; e.g. the next repair date will keep moving forward if history is empty and ecchronos is restarted.
My two cents are, if an empty repair history is assumed to be "repaired", I suggest that ecchronos should probably timestamp it as such in the history then - unless there are any other problems with that.

@masokol
Copy link
Contributor Author

masokol commented Mar 1, 2024

@masokol Then there are two options, I guess, to fix it;

  1. Repair immediately if history is empty
  2. Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

From what i've understood the assumption that everything is repaired if repair history is empty was made to avoid option 1. So maybe option 2 or some completely new solution. One could argue that having an explicit delay, like repair_delay for each schedule might be a better option. Anyway, whichever solution you choose, i think ecChronos should know if it's the first time it's starting or not.

@DanielwEriksson
Copy link
Contributor

The working assumption that was decided is that when a new table is found ecchronos should update the histroy that a repair has been done now without doing the repair so there is history information the next time it starts

DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 19, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 19, 2024
@DanielwEriksson DanielwEriksson self-assigned this Mar 20, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 20, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Mar 20, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 3, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 15, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 17, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 22, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 23, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 23, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 24, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 24, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 25, 2024
ECCTOOL_EXAMPLES.md remains to be updated
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 25, 2024
ECCTOOL_EXAMPLES.md remains to be updated
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 26, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 29, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue Apr 30, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 2, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 3, 2024
DanielwEriksson added a commit to DanielwEriksson/ecchronos that referenced this issue May 6, 2024
@jwaeab jwaeab closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants