Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable no-errors-in-logs check by default, and extend it to all Cilium components #2184

Merged
merged 5 commits into from
Feb 13, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Dec 14, 2023

Currently, we check for errors in Cilium agent logs only when unsafe tests are enabled. Let's promote this check to be always performed for Cilium v1.14 and subsequent (earlier versions are not compatible with the Cilium CLI v0.15 series), to ensure that we catch unexpected errors early (as unsafe tests are only enabled in a limited set of CI workflows) given that it does not perform any unsafe operation. Additionally, let's extend it to also cover the init containers in the cilium agent pods, as well as other cilium components, such as the operator and the clustermesh-apiserver.

Currently, we check for errors in Cilium agent logs only when unsafe
tests are enabled. Let's promote this check to be always performed
for Cilium v1.14 and subsequent (earlier versions are not compatible
with the Cilium CLI v0.15 series), to ensure that we catch unexpected
errors early (as unsafe tests are only enabled in a limited set of CI
workflows) given that it does not perform any unsafe operation.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Don't additionally print the errors possibly detected by the
no-errors-in-logs check, as already logged by the action itself.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the no-errors-in-logs check asserts that errors are not found
in the main cilium agent container. Let's extend this check to also cover
the init containers in the cilium agent pods, as well as other cilium
components, such as the operator and the clustermesh-apiserver.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

giorio94 commented Feb 2, 2024

Additionally tested in cilium/cilium#30604 and cilium/cilium#29890

@giorio94 giorio94 marked this pull request as ready for review February 2, 2024 16:15
@giorio94 giorio94 requested review from a team as code owners February 2, 2024 16:15
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giorio94 Nice work Marco!

k8s/client.go Outdated Show resolved Hide resolved
@@ -1232,8 +1232,10 @@ func Run(ctx context.Context, ct *check.ConnectivityTest, extra Hooks) error {
return err
}

if ct.Params().IncludeUnsafeTests {
ct.NewTest("check-log-errors").WithScenarios(tests.NoErrorsInLogs(ct.CiliumVersion))
if versioncheck.MustCompile(">=1.14.0")(ct.CiliumVersion) || ct.Params().IncludeUnsafeTests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an expensive call if an operator runs this wide open on a larger cluster. ie scanning logs for errors = #cluster*#pod*#cos*#lines. Perhaps this should be an opt-in flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this test is potentially expensive. The reasoning for enabling it by default (opposed to it being behind a special flag) is that we want to run it for all CI workflows (which are the main consumers of the connectivity tests in any case), and making it opt-in would make that really tricky to achieve. That said, it is still possible for a user to either disable this check through the --test flag, or specify a custom label selection which matches only a subset of nodes. This is consistent with the behavior of other expensive checks, like the health one.

connectivity/tests/errors.go Show resolved Hide resolved
connectivity/tests/errors.go Outdated Show resolved Hide resolved
connectivity/tests/errors.go Show resolved Hide resolved
Otherwise it is impossible to coalescence multiple occurrences of the
same error. While being there, let's change the GetLogs helper to
directly take a PodLogOptions parameter instead of listing all possible
subfields, for increased clarity and generality.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Slightly change the no-errors-in-logs check to create a separate action
for each container, in order to improve progress reporting. Additionally,
configure the test to collect a single sysdump regardless of the number
of log errors detected, given that it already includes the information
from all the Cilium containers.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giorio94 Thanks for the updates!

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very valuable improvement, looks good aside from some questions.

t.NewGenericAction(n, id).Run(func(a *check.Action) {
logs, err := client.GetLogs(ctx, pod.Namespace, pod.Name, container, opts)
if err != nil {
a.Fatalf("Error reading Cilium logs: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, do we want to fatal out here? seems like we could still proceed with other checks even if this fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions honestly. The current implementation mimics the previous one in this respect, which already fataled out in case we failed reading the logs (with the caveat that the test suite then recovers from the panic, and continues with the execution of the other scenarios). And this is also consistent with other checks which also fatal out in case something goes wrong during retrieval of information (e.g., drop counts).

IMHO, we could likely improve this logic to be more resilient in case of agent restart and/or spot instances rotation, but that looks a more generic and long term effort to me, which is unrelated from this specific PR.

}
failureMsgs := strings.Join(failures, "\n")
t.Failf("Found %d logs matching list of errors that must be investigated:\n%s", len(uniqueFailures), failureMsgs)
a.Failf("Found %d logs in %s matching list of errors that must be investigated:%s", len(uniqueFailures), id, failures.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: do we actually match this anywhere, it appears like we just report on unique errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I'm fully understanding your question here. If you refer to the matching list wording, that refers to the fact that we check each log entry against a set of known unexpected messages (with possible exceptions), and trigger a failure if any of them is found. Multiple occurrences of the same entry are then coalescenced for the sake of brevity.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Feb 13, 2024
@michi-covalent michi-covalent merged commit 1fc317d into main Feb 13, 2024
13 checks passed
@michi-covalent michi-covalent deleted the pr/giorio94/fail-on-log-errors branch February 13, 2024 15:42
aanm pushed a commit to aanm/cilium that referenced this pull request Mar 7, 2024
GitHub Pull Request: #623

chore(deps): update dependency cilium/cilium-cli to v0.15.23 (v1.15)

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [cilium/cilium-cli](https://togithub.com/cilium/cilium-cli) |  | patch | `v0.15.22` -> `v0.15.23` |
| [cilium/cilium-cli](https://togithub.com/cilium/cilium-cli) | action | patch | `v0.15.22` -> `v0.15.23` |

---

### Release Notes

<details>
<summary>cilium/cilium-cli (cilium/cilium-cli)</summary>

### [`v0.15.23`](https://togithub.com/cilium/cilium-cli/releases/tag/v0.15.23)

[Compare Source](https://togithub.com/cilium/cilium-cli/compare/v0.15.22...v0.15.23)

#### What's Changed

-   gateway: Upgrade API version by [@&cilium#8203;sayboras](https://togithub.com/sayboras) in [cilium/cilium-cli#2285
-   chore(deps): update dependency kubernetes-sigs/kind to v0.21.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2284
-   IPsec key rotation command. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2266
-   IPsec key status command implementation. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2287
-   AWS OIDC instead of access key. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2297
-   Remove no longer necessary step from the external workloads installation script generation process by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2275
-   Enable no-errors-in-logs check by default, and extend it to all Cilium components by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2184
-   chore(deps): update golangci/golangci-lint-action action to v4 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2295
-   chore(deps): update helm/kind-action action to v1.9.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2296
-   chore(deps): update golang docker tag to v1.22.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2289
-   fix(deps): update module golang.org/x/mod to v0.15.0 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2294
-   chore(deps): update go to v1.22.0 (minor) by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2293
-   chore(deps): update all github action dependencies (patch) by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2286
-   chore(deps): update golangci/golangci-lint docker tag to v1.56.1 by [@&cilium#8203;renovate](https://togithub.com/renovate) in [cilium/cilium-cli#2290
-   IPsec key rotation with algorithm change support. by [@&cilium#8203;viktor-kurchenko](https://togithub.com/viktor-kurchenko) in [cilium/cilium-cli#2291
-   chore: Amend connectivity tests for OpenShift by [@&cilium#8203;fgiloux](https://togithub.com/fgiloux) in [cilium/cilium-cli#2303
-   Increase timeouts in AKS and GKE GHA workflows by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2307
-   gha: increase GKE disk size in external workloads workflow to 15GB by [@&cilium#8203;giorio94](https://togithub.com/giorio94) in [cilium/cilium-cli#2301
-   Prepare for v0.15.23 release by [@&cilium#8203;michi-covalent](https://togithub.com/michi-covalent) in [cilium/cilium-cli#2302

#### New Contributors

-   [@&cilium#8203;fgiloux](https://togithub.com/fgiloux) made their first contribution in [cilium/cilium-cli#2303

**Full Changelog**: cilium/cilium-cli@v0.15.22...v0.15.23

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "on monday" (UTC), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/aanm/cilium).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4yMjAuMiIsInVwZGF0ZWRJblZlciI6IjM3LjIyMC4yIiwidGFyZ2V0QnJhbmNoIjoidjEuMTUifQ==-->


Patch-set: 1
Change-id: I515325620aa4905df61d31323487f6936237e3a5
Subject: chore(deps): update dependency cilium/cilium-cli to v0.15.23
Branch: refs/heads/v1.15
Status: new
Topic: 
Commit: 7c37f7d
Tag: autogenerated:gerrit:newPatchSet
Groups: 7c37f7d
Private: false
Work-in-progress: false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants