Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and rethink the coordinator/monitor implementation #1744

Open
ph opened this issue Aug 15, 2022 · 2 comments · May be fixed by #3131
Open

Refactor and rethink the coordinator/monitor implementation #1744

ph opened this issue Aug 15, 2022 · 2 comments · May be fixed by #3131
Assignees
Labels
Team:Fleet Label for the Fleet team

Comments

@ph
Copy link
Contributor

ph commented Aug 15, 2022

When an elastic policy is dispatched to fleet server, one of the fleet server will be elected and take take the control of some of the management task, like when to unenroll agent after a last seen timeout. I've fixed a race condition issue discovered in #1738. When I've jumped the code I found it a bit harder to ready to really understand the flow of events and when goroutine were created and removed.

We need to evaluate if we need to add logic to this area of the code, if this is the case we should really invest some time in refactoring the logic, here a few things to consider changing:

  • We create a goroutine per agent policy, as the number of policy is low this is perfectly fine, but I think that logic could be handled by a single cleanup event loop in a single goroutine.
  • We are exposing internal fields from monitorT object in the test suite, we should hide all access to the internal field using accessor even if this is only for testing. This allow a single locking logic.
  • Looking at the code, It look like possible the usage of multiple internal fields into a watcher struct that would encapsulate more logic.
  • Internal state of the monitor bleeds into the goroutine execution, this make it harder to lock or prevent concurrent access to the resource. Encapsulating that logic into his own object would make it simple to test and verifies.
@ph ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 15, 2022
@jen-huang jen-huang added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 12, 2022
@kpollich
Copy link
Member

This issue is a little out of my wheelhouse in terms of expertise, but it seems like a very nuanced technical debt issue. I think the best path forward is to keep this near the top of the backlog and look to take this on during feature freeze or ON week in the near future.

@michel-laterman - let me know if you have other thoughts here. Curious about how mission critical this refactor might be or if this would solve any major issues we have with the Fleet Server codebase today.

@kpollich kpollich removed their assignment Oct 18, 2022
@joshdover
Copy link
Contributor

We ran into the coordinator again in #2606

I think it will make sense to remove it at some point.

@michel-laterman michel-laterman self-assigned this Dec 18, 2023
@michel-laterman michel-laterman linked a pull request Dec 18, 2023 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants