Remove the coordinator #3131

michel-laterman · 2023-11-29T22:58:23Z

What is the problem this PR solves?

fleet-server coordinator and leader election mechanisms are nops that add unneeded complexity to understanding the codebase.

How does this PR solve the problem?

Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
~~I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.~~

Checklist

I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related Issues

Closes Refactor and rethink the coordinator/monitor implementation #1744

Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.

michel-laterman · 2023-11-30T17:02:52Z

internal/pkg/dl/constants.go

-	FleetPoliciesLeader    = ".fleet-policies-leader"
-	FleetServers           = ".fleet-servers"


I've removed the functions that create entries in policies-leader as well as the one that writes to .fleet-servers. Is there any other component that uses the .fleet-servers entries?

we should create a follow up issue to check and clean up this index in the next major if not used

@michel-laterman the Fleet status API rely on the .fleet-servers entries https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/routes/setup/handlers.ts#L22 so we should change that in Kibana before going forward with that PR otherwise Fleet UI will be broken and user will not be able to add agents.

Do we want to change Kibana's behaviour as well, or change fleet-server to register itself somewhere else?

I think for this one it's probably okay to change the Kibana behavior, we can rely on retrieving agents with fleet-server installed instead.

internal/pkg/dl/migration.go

mergify · 2023-12-06T07:32:21Z

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

juliaElastic

Changes LGTM.
Is there anything that can go wrong if there are multiple fleet servers, some of them on older version with a coordinator?

michel-laterman · 2023-12-11T15:53:30Z

If we have the scenario where an older version is running at the same time as a newer version, it will only provide policies where the coordinator_idx has been updated to >0, however a newer version may serve (the same policy) with a 0 value.
If an agent gets it's policy from a new fleet-server, then checks in with an older instance it may immediately get a policy change action with the coordinator_idx value changed.
The rest of the policy data will not change.

@cmacknz, would this sequence present an issue to the agent?

cmacknz · 2023-12-11T19:46:48Z

Just to confirm this only duplicates the policy change action and not other action types?

If the policy is exactly the same I think this is fine, the agent should realize there are no changes to the running set of components and make not changes. I would recommend testing this quickly to confirm though.

There is no actual requirement for independent Fleet Servers to run the same version is there? In practice this is often the case but there is nothing enforcing.

I am wondering if we could get into an edge case where the agent is continuously re-processing a policy change going between two Fleet servers with different versions. Is this possible or something that we need to guard against? Does Fleet server itself tolerate the coordinator_idx constantly toggling?

michel-laterman · 2023-12-13T23:47:33Z

OK, I did some testing and removing the coordinator_idx value should not effect the agent.

I tested the following sequence

agent enrolls in fleet-server with coordinator
fleet-server coordinator is removed
policy is updated
fleet-server with coordinator replaces fleet-server instance

The agent behaved normally throughout.
I also tried the sequence with a 2nd fleet-server instance (with a coordinator) that did not serve the agent (but updated the polices that the agent uses), it did not effect the other fleet-server instance.

mergify · 2023-12-15T00:00:18Z

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

michel-laterman · 2023-12-18T15:21:32Z

We need to hold off on merging this until the following Kibana issues are resolved:

elastic-sonarqube · 2023-12-19T18:01:47Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
76.9% 76.9% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

mergify · 2023-12-26T21:21:03Z

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b remove-coordinator upstream/remove-coordinator
git merge upstream/main
git push upstream remove-coordinator

jlind23 · 2024-04-02T15:11:47Z

@michel-laterman does this one also closes - elastic/kibana#173538

The v8.5 migration generated new output keys for an agent by forcing the policy outputs to be prepared by incrementing the coordinator_idx for the policy. Behaviour was changed instead to detect if a single output has no api_key value (empty string) and subscribe with a revision of 0.

michel-laterman · 2024-04-17T16:56:16Z

internal/pkg/api/handleCheckin.go

+	// use revision_idx=0 if the agent has a single output where no API key is defined
+	// This will force the policy monitor to emit a new policy to regerate API keys
+	revID := agent.PolicyRevisionIdx
+	for _, output := range agent.Outputs {
+		if output.APIKey == "" {
+			revID = 0
+			break
+		}
+	}
+


The coordinator incrementation in dl/migration.go has been removed, this should result in the same behaviour and remove the need for elastic/kibana#173538

internal/pkg/server/fleet_integration_test.go

elastic-sonarqube · 2024-04-17T18:08:49Z

Quality Gate passed

Issues
1 New issue
1 Fixed issue
0 Accepted issues

Measures
0 Security Hotspots
83.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

jen-huang · 2024-06-06T22:16:50Z

We need to hold off on merging this until the following Kibana issues are resolved:

[Fleet] Remove dependency on .fleet-servers index kibana#173537

[Fleet] policy revision_idx source of truth kibana#173538

FYI elastic/kibana#173537 was recently completed and merged, so this should be unblocked now. Are there any BWC scenarios we need to consider?

michel-laterman added Team:Fleet Label for the Fleet team tech debt labels Nov 29, 2023

Remove the coordinator

28c2822

Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.

michel-laterman force-pushed the remove-coordinator branch from f691382 to 28c2822 Compare November 30, 2023 17:00

michel-laterman commented Nov 30, 2023

View reviewed changes

michel-laterman marked this pull request as ready for review November 30, 2023 22:15

michel-laterman requested a review from a team as a code owner November 30, 2023 22:15

michel-laterman added 2 commits December 1, 2023 11:58

Merge branch 'main' into remove-coordinator

fa4c13b

Merge branch 'main' into remove-coordinator

bbec52e

juliaElastic approved these changes Dec 7, 2023

View reviewed changes

michel-laterman and others added 2 commits December 13, 2023 13:10

Merge branch 'main' into remove-coordinator

41c6b46

Fix linter

521f9e7

Update linter

43078e0

nchaulet mentioned this pull request Dec 18, 2023

[Fleet] Remove dependency on .fleet-servers index elastic/kibana#173537

Closed

michel-laterman mentioned this pull request Dec 18, 2023

[Fleet] policy revision_idx source of truth elastic/kibana#173538

Closed

michel-laterman added 2 commits December 19, 2023 11:42

Merge branch 'main' into remove-coordinator

0057f93

Add issue to changelog

1ecd336

michel-laterman mentioned this pull request Dec 26, 2023

[Meta] Run select inputs on dedicated hosts #144

Closed

Merge branch 'main' into remove-coordinator

6b3ab10

michel-laterman mentioned this pull request Apr 16, 2024

policy monitor improvements #3470

Open

Merge branch 'main' into remove-coordinator

9f3a479

michel-laterman force-pushed the remove-coordinator branch from af2552a to 76a2334 Compare April 17, 2024 17:15

michel-laterman commented Apr 17, 2024

View reviewed changes

michel-laterman added 2 commits April 17, 2024 10:30

fix integration test

70a26f7

Fix linter, cleanup codesmells, add deprecations to model/schema.json

79aec50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the coordinator #3131

Remove the coordinator #3131

michel-laterman commented Nov 29, 2023 •

edited

michel-laterman Nov 30, 2023

juliaElastic Dec 7, 2023

nchaulet Dec 14, 2023 •

edited

michel-laterman Dec 14, 2023

nchaulet Dec 14, 2023

mergify bot commented Dec 6, 2023

juliaElastic left a comment

michel-laterman commented Dec 11, 2023

cmacknz commented Dec 11, 2023

michel-laterman commented Dec 13, 2023

mergify bot commented Dec 15, 2023

michel-laterman commented Dec 18, 2023

elastic-sonarqube bot commented Dec 19, 2023

mergify bot commented Dec 26, 2023

jlind23 commented Apr 2, 2024

michel-laterman Apr 17, 2024

elastic-sonarqube bot commented Apr 17, 2024

jen-huang commented Jun 6, 2024

		FleetPoliciesLeader = ".fleet-policies-leader"
		FleetServers = ".fleet-servers"

Remove the coordinator #3131

Are you sure you want to change the base?

Remove the coordinator #3131

Conversation

michel-laterman commented Nov 29, 2023 • edited

What is the problem this PR solves?

How does this PR solve the problem?

Design Checklist

Checklist

Related Issues

michel-laterman Nov 30, 2023

Choose a reason for hiding this comment

juliaElastic Dec 7, 2023

Choose a reason for hiding this comment

nchaulet Dec 14, 2023 • edited

Choose a reason for hiding this comment

michel-laterman Dec 14, 2023

Choose a reason for hiding this comment

nchaulet Dec 14, 2023

Choose a reason for hiding this comment

mergify bot commented Dec 6, 2023

juliaElastic left a comment

Choose a reason for hiding this comment

michel-laterman commented Dec 11, 2023

cmacknz commented Dec 11, 2023

michel-laterman commented Dec 13, 2023

mergify bot commented Dec 15, 2023

michel-laterman commented Dec 18, 2023

elastic-sonarqube bot commented Dec 19, 2023

Quality Gate passed

mergify bot commented Dec 26, 2023

jlind23 commented Apr 2, 2024

michel-laterman Apr 17, 2024

Choose a reason for hiding this comment

elastic-sonarqube bot commented Apr 17, 2024

Quality Gate passed

jen-huang commented Jun 6, 2024

michel-laterman commented Nov 29, 2023 •

edited

nchaulet Dec 14, 2023 •

edited