Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-9037: Change Canary to use passthrough route #978

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rfredette
Copy link
Contributor

@rfredette rfredette commented Sep 19, 2023

When the default ingress controller is configured to use mTLS, connecting to edge or reencrypt routes requires the client to have a valid certificate/key pair. There is no way for a user to provide a client certificate or key to the ingress operator, so canary checks using edge encryption fail. This PR changes the canary route to be passthrough and has the canary host handle the TLS handshake, allowing it to function even if mTLS is otherwise required.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2023
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from bc30df7 to 5af9bc9 Compare September 25, 2023 19:01
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 3aec956 to 591e372 Compare October 3, 2023 16:58
@rfredette
Copy link
Contributor Author

Looking through the logs from the latest e2e-aws-ovn run, I only see a single canary health check failure, which isn't enough to trigger the operator to change its CanaryChecksSucceeding status condition, so I think it's unlikely that that failure or the other e2e-*-ovn are related to these changes. I've rebased in case there were relevant changes my branch is missing, but mostly I think it just needs a retest.

@rfredette
Copy link
Contributor Author

On further investigation, there is a canary related failure in e2e-*-ovn, for test [sig-arch] Managed cluster [It] should expose cluster services outside the cluster [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]. It expects the canary to return an HTTP code 200, which it seems to from this part of the test output:

Oct 3 18:29:54.955: INFO: stdout: "{\"test\":0,\"rc\":0,\"curl\":{\"code\":200},\"error\":\"\",\"body\":\"SGVhbHRoY2hlY2sgcmVxdWVzdGVkCg==\",\"headers\":\"HTTP/2 200 \\r\\nx-request-port: 8080\\r\\ncontent-type: text/plain; charset=utf-8\\r\\ncontent-length: 22\\r\\ndate: Tue, 03 Oct 2023 18:29:54 GMT\\r\\n\\r\\n\"}\n"

Despite getting what should be a valid response, the test fails.

@rfredette rfredette changed the title WIP: Change Canary to use passthrough route OCPBUGS-9037: Change Canary to use passthrough route Oct 6, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan October 6, 2023 20:06
@rfredette
Copy link
Contributor Author

the ovn failures require the fix in openshift/origin#28301 or something similar, but e2e-hypershift's failure seems unrelated.
/test e2e-hypershift

@rfredette
Copy link
Contributor Author

The origin test fix has merged
/retest

@frobware
Copy link
Contributor

/retest

1 similar comment
@rfredette
Copy link
Contributor Author

/retest

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 591e372 to c350a7e Compare October 19, 2023 19:57
@frobware
Copy link
Contributor

frobware commented Oct 20, 2023

Given the change from re-encrypt to passthrough why do we need to do the things you highlight in commit c350a7e? What breaks, given the switch to passhthrough, if you don't do the additional reconciliation logic?

@frobware
Copy link
Contributor

/assign @frobware

@rfredette
Copy link
Contributor Author

/retest

@frobware
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023
@Miciah
Copy link
Contributor

Miciah commented Dec 4, 2023

/assign

@rfredette
Copy link
Contributor Author

/retest

Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. No major issues; just some comments around update edge-cases, possible races in test code, and stylistic things.

test/e2e/canary_test.go Outdated Show resolved Hide resolved
test/e2e/canary_test.go Outdated Show resolved Hide resolved
test/e2e/canary_test.go Show resolved Hide resolved
test/e2e/canary_test.go Outdated Show resolved Hide resolved
test/e2e/canary_test.go Show resolved Hide resolved
test/e2e/client_tls_test.go Outdated Show resolved Hide resolved
@@ -114,6 +115,20 @@ func New(mgr manager.Manager, config Config) (controller.Controller, error) {
if err := c.Watch(source.Kind(operatorCache, &routev1.Route{}), enqueueRequestForDefaultIngressController(config.Namespace), canaryRoutePredicate, updateFilter); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the logic in updateFilter is incorrect; the watch should ignore the update if only the port changed, and specifically only if the port changed between "8443-tcp" and "8888-tcp". If the port changes between "8080-tcp" and "8443-tcp", the watch shouldn't ignore that update.

For upgrades, this issue is probably masked by all the other watches; something will trigger reconcilation during an upgrade.

Did you test this with the ingress.operator.openshift.io/rotate-canary-route=true IngressController annotation set to make sure that that still works properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with ingress.operator.openshift.io/rotate-canary-route=true earlier, but I've tested with it now and made some changes to updateFilter to only ignore updates if only the port changes, and I've verified that rotating ports doesn't trigger reconciliation.

That said, I don't think it adequately covers the case you mentioned:

If the port changes between "8080-tcp" and "8443-tcp", the watch shouldn't ignore that update.

I'm not 100% clear on the reasoning. If the route was updated from port 8080 to 8443, and assuming that was the only change, I think we would be fine to ignore that update, since 8443 is a valid port in the canary daemonset. However, if it was changed from 8443 to 8080, we would want to reconcile, since 8080 isn't one of the ports that the daemonset exposes.

Or to put it another way, I think we want to ignore updates to the route if only the port changes, and the new port is one of the ones that is exposed by the desired daemonset. Does that make sense, or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to put it another way, I think we want to ignore updates to the route if only the port changes, and the new port is one of the ones that is exposed by the desired daemonset. Does that make sense, or am I missing something?

That makes sense. I'm fine with that logic. It does mean that the watch predicate and the canary route are a little more tightly coupled; what do you think about adding a package variable with the set of ports that the canary route uses, using that new variable in the predicate, and adding an assertion in Test_desiredCanaryService that the service has the same set of ports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a good way to go about it. I'll get started on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah I think I have this implemented in a way that I'm happy with, but I wanted your opinion on part of it. I've included the minimum viable implementation in 28050d6, but if we're going to have a package variable specifying the canary ports, I think it makes more sense to actually use that as the source of truth, so I also pushed ba50ba2 that uses the CanaryPorts variable to generate the desired ports at runtime, and uses CanaryPorts in ListenAndServerTLS instead of hardcoding which ports to serve on in the canary pod.
Let me know what you think of that; if you agree that it's the way to go, I'll squash it into the previous commit, but I wanted to leave it separate in case there's a reason to avoid making that change.

Copy link
Contributor

@frobware frobware Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's supposed to happen when I [now] set:

  annotations:
    ingress.operator.openshift.io/rotate-canary-route: "true"

from the ingress-operator I see:

2024-03-28T13:29:47.934Z        INFO    operator.route_metrics_controller       handler/enqueue_mapped.go:81    queueing ingresscontroller      {"name": "default"}
2024-03-28T13:29:47.934Z        INFO    operator.route_metrics_controller       handler/enqueue_mapped.go:81    queueing ingresscontroller      {"name": "default"}
2024-03-28T13:29:47.934Z        INFO    operator.route_metrics_controller       controller/controller.go:119    reconciling     {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-03-28T13:29:48.457Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T13:29:48.965Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}

Copy link
Contributor

@frobware frobware Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appear to have new canary pods now listening on 8443/8888:

% oc get pods -n openshift-ingress-canary
NAME                   READY   STATUS    RESTARTS   AGE
ingress-canary-bdpr6   1/1     Running   0          36m
ingress-canary-hzn9q   1/1     Running   0          36m
ingress-canary-rfxtc   1/1     Running   0          36m

% oc rsh -n openshift-ingress-canary ingress-canary-rfxtc
sh-5.1$ lsof -n -P -p 1
COMMAND   PID       USER   FD      TYPE DEVICE SIZE/OFF      NODE NAME
ingress-o   1 1000630000  cwd       DIR  0,116       28 186649044 /
ingress-o   1 1000630000  rtd       DIR  0,116       28 186649044 /
ingress-o   1 1000630000  txt       REG  0,116 96616736 161484348 /usr/bin/ingress-operator
ingress-o   1 1000630000  mem       REG    8,4          161484348 /usr/bin/ingress-operator (path dev=0,116)
ingress-o   1 1000630000  mem       REG    8,4          180430339 /usr/lib64/libz.so.1.2.11 (path dev=0,116)
ingress-o   1 1000630000  mem       REG    8,4           31474507 /usr/lib64/libcrypto.so.3.0.7 (path dev=0,116)
ingress-o   1 1000630000  mem       REG    8,4           31474502 /usr/lib64/libc.so.6 (path dev=0,116)
ingress-o   1 1000630000  mem       REG    8,4           31474566 /usr/lib64/libresolv.so.2 (path dev=0,116)
ingress-o   1 1000630000  mem       REG    8,4           31474498 /usr/lib64/ld-linux-x86-64.so.2 (path dev=0,116)
ingress-o   1 1000630000    0u      CHR    1,3      0t0         6 /dev/null
ingress-o   1 1000630000    1w     FIFO   0,13      0t0    353731 pipe
ingress-o   1 1000630000    2w     FIFO   0,13      0t0    353732 pipe
ingress-o   1 1000630000    3u     IPv6 354868      0t0       TCP *:8443 (LISTEN)
ingress-o   1 1000630000    4u  a_inode   0,14        0        62 [eventpoll:3,5,8]
ingress-o   1 1000630000    5r     FIFO   0,13      0t0    353765 pipe
ingress-o   1 1000630000    6w     FIFO   0,13      0t0    353765 pipe
ingress-o   1 1000630000    8u     IPv6 353135      0t0       TCP *:8888 (LISTEN)

And the response header appears to be correct:

sh-5.1$ curl -v -k  https://localhost:8443 2>&1 | grep x-request
< x-request-port: 8443
sh-5.1$ curl -v -k  https://localhost:8888 2>&1 | grep x-request
< x-request-port: 8888
sh-5.1$ curl -v -k  https://localhost:8443 2>&1 | grep x-request
< x-request-port: 8443
sh-5.1$ curl -v -k  https://localhost:8888 2>&1 | grep x-request
< x-request-port: 8888

but I do repeatedly see the following message from the ingress-operator:

2024-03-28T14:03:28.949Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:29.464Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:33.584Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:34.100Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:38.234Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:38.756Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:39.271Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:43.388Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:43.915Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:44.445Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:48.607Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:49.116Z        ERROR   operator.canary_controller      wait/backoff.go:226     error performing canary route check     {"error": "canary request received on port 8888, but route specifies 8443"}

Perhaps this error has been present for a while but given the changes in the PR we should understand why this logs an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get this error to happen, but not nearly as consistently as you saw it. Here's what I think is happening: the ingress operator rotates the canary route's port, then very soon after it pings the canary. If the ping is fast enough, it could get to the router before the haproxy config is written. In that case, haproxy would send it to the old canary port when the ingress operator expects it to hit the new one, causing this error.
Looking at the code, when port rotation is enabled, the port update does happen almost immediately before the canary probe. I think we could reverse that order and it would work essentially the same, but you'd have much more time for the haproxy config to settle.

pkg/operator/controller/canary/service.go Show resolved Hide resolved
pkg/operator/controller/canary/service.go Outdated Show resolved Hide resolved
pkg/operator/controller/canary/service_test.go Outdated Show resolved Hide resolved
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 108bafc to f91f987 Compare February 20, 2024 00:47
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2024
Copy link
Contributor

openshift-ci bot commented Feb 20, 2024

New changes are detected. LGTM label has been removed.

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from f91f987 to 2368187 Compare February 20, 2024 01:23
@lihongan
Copy link
Contributor

In the bug https://issues.redhat.com//browse/OCPBUGS-9037 it mentioned both ingress canary check and console health check issue with mTLS, looks this PR just to fix ingress canary check part, how about console route?

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch 2 times, most recently from 0604d91 to 11b44dd Compare February 26, 2024 19:18
@rfredette
Copy link
Contributor Author

/retest

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 1435412 to ce7b911 Compare March 4, 2024 21:00
@rfredette
Copy link
Contributor Author

/retest

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from ce7b911 to ba50ba2 Compare March 6, 2024 21:04
@frobware
Copy link
Contributor

Given that this change switches the canary route to use passthrough we have a couple of references to edge that we should correct:

% ag 'canary.*edge.*termination'
pkg/operator/controller/canary/http.go
32:     // Use https now that the canary route uses edge termination.
54:             // The canary route uses edge termination and the

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2024
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 958a227 to fbc5adc Compare May 10, 2024 03:16
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2024
rfredette and others added 4 commits May 10, 2024 15:11
TestClientTLS, TestMTLSWithCRLs, and TestRouterCompressionOperation
queried the canary route to test various router config options. However,
with the canary being switched to use a passthrough route, many router
config options no longer apply. This commit changes each test to deploy
their own backend and route.

TestClientTLS and TestMTLSWithCRLs were both changed to use the echo pod
backend, which echoes the contents of the request back to the client.

TestRouterCompressionOperation requires that the response have a
specific content-type header, which the echo pod doesn't provide. It now
deploys a httpd backend, which does include the content-type header, and
has been updated to enable compression on the appropriate content-type.

This is part of the fix for OCPBUGS-9037
Edge terminated TLS is subject to the ingress controller's client TLS
(mTLS) requirements. When mTLS is required, the client is also required
to provide a TLS certificate and key, but there is currently no way to
provide a client certificate or key to the ingress operator, causing
canary health checks to fail when mTLS is required.

To allow the canary healthchecks to work when mTLS is required, this
commit changes the canary to serve TLS and changes the canary route to
use passthrough encryption, and adds a test to verify that enabling mTLS
doesn't cause canary checks to fail.

The canary service has been updated to have the
service.beta.openshift.io/serving-cert-secret-name annotation, which
prompts the auto-generation of a secret with a certificate and key for
the canary to use for TLS. It also now serves on port 8443 instead of
8080. Previously, reconciliation of the canary service was limited to
making sure it existed, but now the operator checks for changes to the
service's annotations and ports, and updates them to the desired values
if necessary.

In order to make the serving certificate available to the canary, the
canary daemonset has been updated to mount the relevant secret, and set
the environment variables TLS_CERT and TLS_KEY to the full path to the
serving certificate and key, respectively. In order to make sure this
secret is available for the canary process, the reconciliation logic has
also been updated to keep the environment variables, volumes, and volume
mounts at their desired values.

The operator now watches the canary daemonset and service, and
reconciles canary resources when either changes.

TestCanaryWithMTLS enables mTLS on the default ingress controller, then
verifies that the CanaryChecksSucceeding condition remains true.

This is part of the fix for OCPBUGS-9037

Co-authored-by: Andrew McDermott <frobware@users.noreply.github.com>
Use the CanaryPorts service as the source of truth for which ports the
canary serves. This means using CanaryPorts to generate the service
ports at runtime, as well as using it to determine which ports the
canary pod serves on.
This reduces the likelihood that the canary will be polled before the
route port change takes effect.
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from fbc5adc to 557f3b1 Compare May 13, 2024 19:23
@rfredette
Copy link
Contributor Author

/retest

Copy link
Contributor

openshift-ci bot commented May 15, 2024

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-operator 557f3b1 link true /test e2e-azure-operator
ci/prow/e2e-gcp-operator 557f3b1 link true /test e2e-gcp-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants