Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the coordinator #3131

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Nov 29, 2023

What is the problem this PR solves?

fleet-server coordinator and leader election mechanisms are nops that add unneeded complexity to understanding the codebase.

How does this PR solve the problem?

Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related Issues

@michel-laterman michel-laterman added Team:Fleet Label for the Fleet team tech debt labels Nov 29, 2023
Remove the policy coordinator and policy leader election mechanisms
from fleet-server. Deprecate the coordinator_idx value in
fleet-server's json schema and remove coordinator_idx references when
processing policies.
Comment on lines -17 to -18
FleetPoliciesLeader = ".fleet-policies-leader"
FleetServers = ".fleet-servers"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the functions that create entries in policies-leader as well as the one that writes to .fleet-servers. Is there any other component that uses the .fleet-servers entries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should create a follow up issue to check and clean up this index in the next major if not used

Copy link
Member

@nchaulet nchaulet Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel-laterman the Fleet status API rely on the .fleet-servers entries https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/routes/setup/handlers.ts#L22 so we should change that in Kibana before going forward with that PR otherwise Fleet UI will be broken and user will not be able to add agents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to change Kibana's behaviour as well, or change fleet-server to register itself somewhere else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this one it's probably okay to change the Kibana behavior, we can rely on retrieving agents with fleet-server installed instead.

internal/pkg/dl/migration.go Outdated Show resolved Hide resolved
@michel-laterman michel-laterman marked this pull request as ready for review November 30, 2023 22:15
@michel-laterman michel-laterman requested a review from a team as a code owner November 30, 2023 22:15
Copy link
Contributor

mergify bot commented Dec 6, 2023

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.
Is there anything that can go wrong if there are multiple fleet servers, some of them on older version with a coordinator?

@michel-laterman
Copy link
Contributor Author

If we have the scenario where an older version is running at the same time as a newer version, it will only provide policies where the coordinator_idx has been updated to >0, however a newer version may serve (the same policy) with a 0 value.
If an agent gets it's policy from a new fleet-server, then checks in with an older instance it may immediately get a policy change action with the coordinator_idx value changed.
The rest of the policy data will not change.

@cmacknz, would this sequence present an issue to the agent?

@cmacknz
Copy link
Member

cmacknz commented Dec 11, 2023

Just to confirm this only duplicates the policy change action and not other action types?

If the policy is exactly the same I think this is fine, the agent should realize there are no changes to the running set of components and make not changes. I would recommend testing this quickly to confirm though.

There is no actual requirement for independent Fleet Servers to run the same version is there? In practice this is often the case but there is nothing enforcing.

I am wondering if we could get into an edge case where the agent is continuously re-processing a policy change going between two Fleet servers with different versions. Is this possible or something that we need to guard against? Does Fleet server itself tolerate the coordinator_idx constantly toggling?

@michel-laterman
Copy link
Contributor Author

OK, I did some testing and removing the coordinator_idx value should not effect the agent.

I tested the following sequence

  1. agent enrolls in fleet-server with coordinator
  2. fleet-server coordinator is removed
  3. policy is updated
  4. fleet-server with coordinator replaces fleet-server instance

The agent behaved normally throughout.
I also tried the sequence with a 2nd fleet-server instance (with a coordinator) that did not serve the agent (but updated the polices that the agent uses), it did not effect the other fleet-server instance.

Copy link
Contributor

mergify bot commented Dec 15, 2023

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

@michel-laterman
Copy link
Contributor Author

We need to hold off on merging this until the following Kibana issues are resolved:

Copy link

Copy link
Contributor

mergify bot commented Dec 26, 2023

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

@jlind23
Copy link
Contributor

jlind23 commented Apr 2, 2024

@michel-laterman does this one also closes - elastic/kibana#173538

The v8.5 migration generated new output keys for an agent by forcing the
policy outputs to be prepared by incrementing the coordinator_idx for
the policy. Behaviour was changed instead to detect if a single output
has no api_key value (empty string) and subscribe with a revision of 0.
Comment on lines +278 to +287
// use revision_idx=0 if the agent has a single output where no API key is defined
// This will force the policy monitor to emit a new policy to regerate API keys
revID := agent.PolicyRevisionIdx
for _, output := range agent.Outputs {
if output.APIKey == "" {
revID = 0
break
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinator incrementation in dl/migration.go has been removed, this should result in the same behaviour and remove the need for elastic/kibana#173538

internal/pkg/server/fleet_integration_test.go Outdated Show resolved Hide resolved
@jen-huang
Copy link

We need to hold off on merging this until the following Kibana issues are resolved:

FYI elastic/kibana#173537 was recently completed and merged, so this should be unblocked now. Are there any BWC scenarios we need to consider?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team tech debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor and rethink the coordinator/monitor implementation
7 participants