-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-9037: Change Canary to use passthrough route #978
base: master
Are you sure you want to change the base?
OCPBUGS-9037: Change Canary to use passthrough route #978
Conversation
bc30df7
to
5af9bc9
Compare
3aec956
to
591e372
Compare
Looking through the logs from the latest |
On further investigation, there is a canary related failure in
Despite getting what should be a valid response, the test fails. |
@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:
Comment In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
the ovn failures require the fix in openshift/origin#28301 or something similar, but |
The origin test fix has merged |
/retest |
1 similar comment
/retest |
591e372
to
c350a7e
Compare
Given the change from re-encrypt to passthrough why do we need to do the things you highlight in commit c350a7e? What breaks, given the switch to passhthrough, if you don't do the additional reconciliation logic? |
/assign @frobware |
/retest |
/lgtm |
/assign |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. No major issues; just some comments around update edge-cases, possible races in test code, and stylistic things.
@@ -114,6 +115,20 @@ func New(mgr manager.Manager, config Config) (controller.Controller, error) { | |||
if err := c.Watch(source.Kind(operatorCache, &routev1.Route{}), enqueueRequestForDefaultIngressController(config.Namespace), canaryRoutePredicate, updateFilter); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the logic in updateFilter
is incorrect; the watch should ignore the update if only the port changed, and specifically only if the port changed between "8443-tcp" and "8888-tcp". If the port changes between "8080-tcp" and "8443-tcp", the watch shouldn't ignore that update.
For upgrades, this issue is probably masked by all the other watches; something will trigger reconcilation during an upgrade.
Did you test this with the ingress.operator.openshift.io/rotate-canary-route=true
IngressController annotation set to make sure that that still works properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test with ingress.operator.openshift.io/rotate-canary-route=true
earlier, but I've tested with it now and made some changes to updateFilter
to only ignore updates if only the port changes, and I've verified that rotating ports doesn't trigger reconciliation.
That said, I don't think it adequately covers the case you mentioned:
If the port changes between "8080-tcp" and "8443-tcp", the watch shouldn't ignore that update.
I'm not 100% clear on the reasoning. If the route was updated from port 8080 to 8443, and assuming that was the only change, I think we would be fine to ignore that update, since 8443 is a valid port in the canary daemonset. However, if it was changed from 8443 to 8080, we would want to reconcile, since 8080 isn't one of the ports that the daemonset exposes.
Or to put it another way, I think we want to ignore updates to the route if only the port changes, and the new port is one of the ones that is exposed by the desired daemonset. Does that make sense, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to put it another way, I think we want to ignore updates to the route if only the port changes, and the new port is one of the ones that is exposed by the desired daemonset. Does that make sense, or am I missing something?
That makes sense. I'm fine with that logic. It does mean that the watch predicate and the canary route are a little more tightly coupled; what do you think about adding a package variable with the set of ports that the canary route uses, using that new variable in the predicate, and adding an assertion in Test_desiredCanaryService
that the service has the same set of ports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's a good way to go about it. I'll get started on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Miciah I think I have this implemented in a way that I'm happy with, but I wanted your opinion on part of it. I've included the minimum viable implementation in 28050d6, but if we're going to have a package variable specifying the canary ports, I think it makes more sense to actually use that as the source of truth, so I also pushed ba50ba2 that uses the CanaryPorts
variable to generate the desired ports at runtime, and uses CanaryPorts
in ListenAndServerTLS
instead of hardcoding which ports to serve on in the canary pod.
Let me know what you think of that; if you agree that it's the way to go, I'll squash it into the previous commit, but I wanted to leave it separate in case there's a reason to avoid making that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's supposed to happen when I [now] set:
annotations:
ingress.operator.openshift.io/rotate-canary-route: "true"
from the ingress-operator I see:
2024-03-28T13:29:47.934Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:81 queueing ingresscontroller {"name": "default"}
2024-03-28T13:29:47.934Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:81 queueing ingresscontroller {"name": "default"}
2024-03-28T13:29:47.934Z INFO operator.route_metrics_controller controller/controller.go:119 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-03-28T13:29:48.457Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T13:29:48.965Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appear to have new canary pods now listening on 8443/8888:
% oc get pods -n openshift-ingress-canary
NAME READY STATUS RESTARTS AGE
ingress-canary-bdpr6 1/1 Running 0 36m
ingress-canary-hzn9q 1/1 Running 0 36m
ingress-canary-rfxtc 1/1 Running 0 36m
% oc rsh -n openshift-ingress-canary ingress-canary-rfxtc
sh-5.1$ lsof -n -P -p 1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ingress-o 1 1000630000 cwd DIR 0,116 28 186649044 /
ingress-o 1 1000630000 rtd DIR 0,116 28 186649044 /
ingress-o 1 1000630000 txt REG 0,116 96616736 161484348 /usr/bin/ingress-operator
ingress-o 1 1000630000 mem REG 8,4 161484348 /usr/bin/ingress-operator (path dev=0,116)
ingress-o 1 1000630000 mem REG 8,4 180430339 /usr/lib64/libz.so.1.2.11 (path dev=0,116)
ingress-o 1 1000630000 mem REG 8,4 31474507 /usr/lib64/libcrypto.so.3.0.7 (path dev=0,116)
ingress-o 1 1000630000 mem REG 8,4 31474502 /usr/lib64/libc.so.6 (path dev=0,116)
ingress-o 1 1000630000 mem REG 8,4 31474566 /usr/lib64/libresolv.so.2 (path dev=0,116)
ingress-o 1 1000630000 mem REG 8,4 31474498 /usr/lib64/ld-linux-x86-64.so.2 (path dev=0,116)
ingress-o 1 1000630000 0u CHR 1,3 0t0 6 /dev/null
ingress-o 1 1000630000 1w FIFO 0,13 0t0 353731 pipe
ingress-o 1 1000630000 2w FIFO 0,13 0t0 353732 pipe
ingress-o 1 1000630000 3u IPv6 354868 0t0 TCP *:8443 (LISTEN)
ingress-o 1 1000630000 4u a_inode 0,14 0 62 [eventpoll:3,5,8]
ingress-o 1 1000630000 5r FIFO 0,13 0t0 353765 pipe
ingress-o 1 1000630000 6w FIFO 0,13 0t0 353765 pipe
ingress-o 1 1000630000 8u IPv6 353135 0t0 TCP *:8888 (LISTEN)
And the response header appears to be correct:
sh-5.1$ curl -v -k https://localhost:8443 2>&1 | grep x-request
< x-request-port: 8443
sh-5.1$ curl -v -k https://localhost:8888 2>&1 | grep x-request
< x-request-port: 8888
sh-5.1$ curl -v -k https://localhost:8443 2>&1 | grep x-request
< x-request-port: 8443
sh-5.1$ curl -v -k https://localhost:8888 2>&1 | grep x-request
< x-request-port: 8888
but I do repeatedly see the following message from the ingress-operator:
2024-03-28T14:03:28.949Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:29.464Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:33.584Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:34.100Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:38.234Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:38.756Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:39.271Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:43.388Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:43.915Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:44.445Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8443, but route specifies 8888"}
2024-03-28T14:03:48.607Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
2024-03-28T14:03:49.116Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "canary request received on port 8888, but route specifies 8443"}
Perhaps this error has been present for a while but given the changes in the PR we should understand why this logs an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get this error to happen, but not nearly as consistently as you saw it. Here's what I think is happening: the ingress operator rotates the canary route's port, then very soon after it pings the canary. If the ping is fast enough, it could get to the router before the haproxy config is written. In that case, haproxy would send it to the old canary port when the ingress operator expects it to hit the new one, causing this error.
Looking at the code, when port rotation is enabled, the port update does happen almost immediately before the canary probe. I think we could reverse that order and it would work essentially the same, but you'd have much more time for the haproxy config to settle.
108bafc
to
f91f987
Compare
New changes are detected. LGTM label has been removed. |
f91f987
to
2368187
Compare
In the bug https://issues.redhat.com//browse/OCPBUGS-9037 it mentioned both ingress canary check and console health check issue with mTLS, looks this PR just to fix ingress canary check part, how about console route? |
0604d91
to
11b44dd
Compare
/retest |
1435412
to
ce7b911
Compare
/retest |
ce7b911
to
ba50ba2
Compare
Given that this change switches the canary route to use passthrough we have a couple of references to edge that we should correct: % ag 'canary.*edge.*termination'
pkg/operator/controller/canary/http.go
32: // Use https now that the canary route uses edge termination.
54: // The canary route uses edge termination and the |
958a227
to
fbc5adc
Compare
TestClientTLS, TestMTLSWithCRLs, and TestRouterCompressionOperation queried the canary route to test various router config options. However, with the canary being switched to use a passthrough route, many router config options no longer apply. This commit changes each test to deploy their own backend and route. TestClientTLS and TestMTLSWithCRLs were both changed to use the echo pod backend, which echoes the contents of the request back to the client. TestRouterCompressionOperation requires that the response have a specific content-type header, which the echo pod doesn't provide. It now deploys a httpd backend, which does include the content-type header, and has been updated to enable compression on the appropriate content-type. This is part of the fix for OCPBUGS-9037
Edge terminated TLS is subject to the ingress controller's client TLS (mTLS) requirements. When mTLS is required, the client is also required to provide a TLS certificate and key, but there is currently no way to provide a client certificate or key to the ingress operator, causing canary health checks to fail when mTLS is required. To allow the canary healthchecks to work when mTLS is required, this commit changes the canary to serve TLS and changes the canary route to use passthrough encryption, and adds a test to verify that enabling mTLS doesn't cause canary checks to fail. The canary service has been updated to have the service.beta.openshift.io/serving-cert-secret-name annotation, which prompts the auto-generation of a secret with a certificate and key for the canary to use for TLS. It also now serves on port 8443 instead of 8080. Previously, reconciliation of the canary service was limited to making sure it existed, but now the operator checks for changes to the service's annotations and ports, and updates them to the desired values if necessary. In order to make the serving certificate available to the canary, the canary daemonset has been updated to mount the relevant secret, and set the environment variables TLS_CERT and TLS_KEY to the full path to the serving certificate and key, respectively. In order to make sure this secret is available for the canary process, the reconciliation logic has also been updated to keep the environment variables, volumes, and volume mounts at their desired values. The operator now watches the canary daemonset and service, and reconciles canary resources when either changes. TestCanaryWithMTLS enables mTLS on the default ingress controller, then verifies that the CanaryChecksSucceeding condition remains true. This is part of the fix for OCPBUGS-9037 Co-authored-by: Andrew McDermott <frobware@users.noreply.github.com>
Use the CanaryPorts service as the source of truth for which ports the canary serves. This means using CanaryPorts to generate the service ports at runtime, as well as using it to determine which ports the canary pod serves on.
This reduces the likelihood that the canary will be polled before the route port change takes effect.
fbc5adc
to
557f3b1
Compare
/retest |
@rfredette: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When the default ingress controller is configured to use mTLS, connecting to edge or reencrypt routes requires the client to have a valid certificate/key pair. There is no way for a user to provide a client certificate or key to the ingress operator, so canary checks using edge encryption fail. This PR changes the canary route to be passthrough and has the canary host handle the TLS handshake, allowing it to function even if mTLS is otherwise required.