-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable no-errors-in-logs check by default, and extend it to all Cilium components #2184
Conversation
9872c09
to
6d354a1
Compare
6d354a1
to
87e9ede
Compare
87e9ede
to
2ed0125
Compare
Currently, we check for errors in Cilium agent logs only when unsafe tests are enabled. Let's promote this check to be always performed for Cilium v1.14 and subsequent (earlier versions are not compatible with the Cilium CLI v0.15 series), to ensure that we catch unexpected errors early (as unsafe tests are only enabled in a limited set of CI workflows) given that it does not perform any unsafe operation. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Don't additionally print the errors possibly detected by the no-errors-in-logs check, as already logged by the action itself. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the no-errors-in-logs check asserts that errors are not found in the main cilium agent container. Let's extend this check to also cover the init containers in the cilium agent pods, as well as other cilium components, such as the operator and the clustermesh-apiserver. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
2ed0125
to
74a8181
Compare
Additionally tested in cilium/cilium#30604 and cilium/cilium#29890 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giorio94 Nice work Marco!
@@ -1232,8 +1232,10 @@ func Run(ctx context.Context, ct *check.ConnectivityTest, extra Hooks) error { | |||
return err | |||
} | |||
|
|||
if ct.Params().IncludeUnsafeTests { | |||
ct.NewTest("check-log-errors").WithScenarios(tests.NoErrorsInLogs(ct.CiliumVersion)) | |||
if versioncheck.MustCompile(">=1.14.0")(ct.CiliumVersion) || ct.Params().IncludeUnsafeTests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an expensive call if an operator runs this wide open on a larger cluster. ie scanning logs for errors = #cluster*#pod*#cos*#lines. Perhaps this should be an opt-in flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this test is potentially expensive. The reasoning for enabling it by default (opposed to it being behind a special flag) is that we want to run it for all CI workflows (which are the main consumers of the connectivity tests in any case), and making it opt-in would make that really tricky to achieve. That said, it is still possible for a user to either disable this check through the --test
flag, or specify a custom label selection which matches only a subset of nodes. This is consistent with the behavior of other expensive checks, like the health one.
74a8181
to
62f259a
Compare
62f259a
to
eb1ed91
Compare
Otherwise it is impossible to coalescence multiple occurrences of the same error. While being there, let's change the GetLogs helper to directly take a PodLogOptions parameter instead of listing all possible subfields, for increased clarity and generality. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Slightly change the no-errors-in-logs check to create a separate action for each container, in order to improve progress reporting. Additionally, configure the test to collect a single sysdump regardless of the number of log errors detected, given that it already includes the information from all the Cilium containers. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
eb1ed91
to
f7ba331
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giorio94 Thanks for the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very valuable improvement, looks good aside from some questions.
t.NewGenericAction(n, id).Run(func(a *check.Action) { | ||
logs, err := client.GetLogs(ctx, pod.Namespace, pod.Name, container, opts) | ||
if err != nil { | ||
a.Fatalf("Error reading Cilium logs: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, do we want to fatal out here? seems like we could still proceed with other checks even if this fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions honestly. The current implementation mimics the previous one in this respect, which already fataled out in case we failed reading the logs (with the caveat that the test suite then recovers from the panic, and continues with the execution of the other scenarios). And this is also consistent with other checks which also fatal out in case something goes wrong during retrieval of information (e.g., drop counts).
IMHO, we could likely improve this logic to be more resilient in case of agent restart and/or spot instances rotation, but that looks a more generic and long term effort to me, which is unrelated from this specific PR.
} | ||
failureMsgs := strings.Join(failures, "\n") | ||
t.Failf("Found %d logs matching list of errors that must be investigated:\n%s", len(uniqueFailures), failureMsgs) | ||
a.Failf("Found %d logs in %s matching list of errors that must be investigated:%s", len(uniqueFailures), id, failures.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: do we actually match this anywhere, it appears like we just report on unique errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I'm fully understanding your question here. If you refer to the matching list wording, that refers to the fact that we check each log entry against a set of known unexpected messages (with possible exceptions), and trigger a failure if any of them is found. Multiple occurrences of the same entry are then coalescenced for the sake of brevity.
GitHub Pull Request: #623 chore(deps): update dependency cilium/cilium-cli to v0.15.23 (v1.15) [![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [cilium/cilium-cli](https://togithub.com/cilium/cilium-cli) | | patch | `v0.15.22` -> `v0.15.23` | | [cilium/cilium-cli](https://togithub.com/cilium/cilium-cli) | action | patch | `v0.15.22` -> `v0.15.23` | --- ### Release Notes <details> <summary>cilium/cilium-cli (cilium/cilium-cli)</summary> ### [`v0.15.23`](https://togithub.com/cilium/cilium-cli/releases/tag/v0.15.23) [Compare Source](https://togithub.com/cilium/cilium-cli/compare/v0.15.22...v0.15.23) #### What's Changed - gateway: Upgrade API version by [@&cilium#8203;sayboras](https://togithub.com/sayboras) in [cilium/cilium-cli#2285 - chore(deps): update dependency kubernetes-sigs/kind to v0.21.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2284 - IPsec key rotation command. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2266 - IPsec key status command implementation. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2287 - AWS OIDC instead of access key. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2297 - Remove no longer necessary step from the external workloads installation script generation process by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2275 - Enable no-errors-in-logs check by default, and extend it to all Cilium components by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2184 - chore(deps): update golangci/golangci-lint-action action to v4 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2295 - chore(deps): update helm/kind-action action to v1.9.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2296 - chore(deps): update golang docker tag to v1.22.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2289 - fix(deps): update module golang.org/x/mod to v0.15.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2294 - chore(deps): update go to v1.22.0 (minor) by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2293 - chore(deps): update all github action dependencies (patch) by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2286 - chore(deps): update golangci/golangci-lint docker tag to v1.56.1 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2290 - IPsec key rotation with algorithm change support. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2291 - chore: Amend connectivity tests for OpenShift by [@&cilium#8203;fgiloux](https://togithub.com/fgiloux) in [cilium/cilium-cli#2303 - Increase timeouts in AKS and GKE GHA workflows by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2307 - gha: increase GKE disk size in external workloads workflow to 15GB by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2301 - Prepare for v0.15.23 release by [@&cilium#8203;michi-covalent](https://togithub.com/michi-covalent) in [cilium/cilium-cli#2302 #### New Contributors - [@&cilium#8203;fgiloux](https://togithub.com/fgiloux) made their first contribution in [cilium/cilium-cli#2303 **Full Changelog**: cilium/cilium-cli@v0.15.22...v0.15.23 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "on monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/aanm/cilium). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4yMjAuMiIsInVwZGF0ZWRJblZlciI6IjM3LjIyMC4yIiwidGFyZ2V0QnJhbmNoIjoidjEuMTUifQ==--> Patch-set: 1 Change-id: I515325620aa4905df61d31323487f6936237e3a5 Subject: chore(deps): update dependency cilium/cilium-cli to v0.15.23 Branch: refs/heads/v1.15 Status: new Topic: Commit: 7c37f7d Tag: autogenerated:gerrit:newPatchSet Groups: 7c37f7d Private: false Work-in-progress: false
Currently, we check for errors in Cilium agent logs only when unsafe tests are enabled. Let's promote this check to be always performed for Cilium v1.14 and subsequent (earlier versions are not compatible with the Cilium CLI v0.15 series), to ensure that we catch unexpected errors early (as unsafe tests are only enabled in a limited set of CI workflows) given that it does not perform any unsafe operation. Additionally, let's extend it to also cover the init containers in the cilium agent pods, as well as other cilium components, such as the operator and the clustermesh-apiserver.