OCPBUGS-14914: Set tunnel and server timeouts only at the backend level #536

alebedev87 · 2023-11-07T20:15:09Z

This PR removes the tunnel and server timeouts in the global default section and sets them on all the backends. The route annotations for the timeouts continue to work as they did before.

As suggested in the upstream issue, the middle backends (be_sni and be_no_sni) are set with the maximum of all the route timeouts to avoid the following warning message:

I1130 10:49:26.557528       1 router.go:649] template "msg"="router reloaded" "output"="[NOTICE]   (18) : haproxy version is 2.6.13-234aa6d\n[NOTICE]   (18) : path to executable is /usr/sbin/haproxy\n[WARNING]  (18) : config : missing timeouts for backend 'be_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n[WARNING]  (18) : config : missing timeouts for backend 'be_no_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n

Test using haproxy-timeout-tunnel repository:

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].image 
quay.io/alebedev/openshift-router:12.4.110

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].env | grep -A1 TUNNEL
- name: ROUTER_DEFAULT_TUNNEL_TIMEOUT
  value: 5s

$ oc -n tunnel-timeout get route hello-openshift -o yaml | yq .metadata.annotations
haproxy.router.openshift.io/timeout-tunnel: 15s

$ ROUTE_HOST=$(oc -n tunnel-timeout get route hello-openshift --template='{{.spec.host}}')
$ time HOST=${ROUTE_HOST} PORT=443 SCHEME=wss ./websocket-cli.js
WebSocket Client Connected
echo-protocol Connection Closed

real	0m15.721s
user	0m0.126s
sys	0m0.021s

$ oc -n openshift-ingress exec -ti deploy/router-default -- cat /var/lib/haproxy/conf/haproxy.config | grep 'timeout tunnel '
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  5s
  timeout tunnel  15s

The e2e test: openshift/cluster-ingress-operator#1013.

openshift-ci-robot · 2023-11-07T20:15:14Z

@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is invalid:

expected the bug to target only the "4.15.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-11-07T20:15:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alebedev87 · 2023-11-14T22:31:08Z

/jira refresh

openshift-ci-robot · 2023-11-14T22:31:17Z

@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-11-14T22:33:59Z

@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

This PR removes the tunnel and server timeouts in the global default section ans sets them on all the backends. The route annotations for the timeouts continue to work as they did before.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ShudiLi · 2023-11-17T05:42:21Z

Tested it with 4.15.0-0.ci.test-2023-11-16-093712-ci-ln-dr45512-latest
`
1.
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.15.0-0.ci.test-2023-11-16-093712-ci-ln-dr45512-latest True False 50m Cluster version is 4.15.0-0.ci.test-2023-11-16-093712-ci-ln-dr45512-latest

% oc get route
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
myedge myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com unsec-server3 http edge None

% oc annotate route myedge haproxy.router.openshift.io/timeout-tunnel=70m
route.route.openshift.io/myedge annotate

% oc -n openshift-ingress rsh router-default-d78ddc6f-bprk5
sh-4.4$ grep -A7 "timeout server " haproxy-config.template
timeout server {{ $value }}
{{- else }}
timeout server {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s" }}
{{- end }}
{{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout-tunnel")) }}
timeout tunnel {{ $value }}
{{- else }}
timeout tunnel {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_TUNNEL_TIMEOUT") "1h" }}
{{- end }}

sh-4.4$

let client get a large size file with speed 1000bytes/s, started time at about 2023-11-17 03:18:56 and terminated time about 2023-11-17 04:38:15, more than 1h

sh-4.4# wget --no-check-certificate --limit-rate=1000 --delete-after https://myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com/oc -k
--2023-11-17 03:18:56-- https://myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com/oc
Resolving myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com (myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com)... 146.148.109.119
Connecting to myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com (myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com)|146.148.109.119|:443... connected.
WARNING: The certificate of 'myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com' is not trusted.
WARNING: The certificate of 'myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com' hasn't got a known issuer.
HTTP request sent, awaiting response... 200 OK
Length: 121359408 (116M)
Saving to: 'oc.tmp'

oc.tmp 0%[ ] 239.88K 1000 B/s in 4m 6s

--2023-11-17 04:38:15-- (try:20) https://myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com/oc
Connecting to myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com (myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com)|146.148.109.119|:443... connected.
WARNING: The certificate of 'myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com' is not trusted.
WARNING: The certificate of 'myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com' hasn't got a known issuer.
HTTP request sent, awaiting response... 206 Partial Content
Length: 121359408 (116M), 116746470 (111M) remaining
Saving to: 'oc.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp'

oc.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp 3%[++++++ ] 4.62M 1000 B/s in 3m 55s

2023-11-17 04:42:09 (1000 B/s) - Read error at byte 4847481/121359408 (The TLS connection was non-properly terminated.). Giving up.

sh-4.4#

check the captured packets and can see the session is terminated after more than 1h
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:19:09.684312 IP (tos 0x0, ttl 64, id 2327, offset 0, flags [DF], proto TCP (6), length 52)
146.148.109.119.https > 10.129.2.17.43942: Flags [.], cksum 0x0cc4 (incorrect -> 0x0a65), ack 1908667883, win 506, options [nop,nop,TS val 1841440389 ecr 1595459286], length 0
03:19:09.684337 IP (tos 0x0, ttl 64, id 12471, offset 0, flags [DF], proto TCP (6), length 52)
10.129.2.17.43942 > 146.148.109.119.https: Flags [.], cksum 0x0cc4 (incorrect -> 0x189e), ack 1, win 0, options [nop,nop,TS val 1595462614 ecr 1841433925], length 0
03:19:16.340278 IP (tos 0x0, ttl 64, id 2328, offset 0, flags [DF], proto TCP (6), length 52)
146.148.109.119.https > 10.129.2.17.43942: Flags [.], cksum 0x0cc4 (incorrect -> 0xe364), ack 1, win 506, options [nop,nop,TS val 1841447045 ecr 1595462614], length 0
03:19:16.340316 IP (tos 0x0, ttl 64, id 12472, offset 0, flags [DF], proto TCP (6), length 52)
10.129.2.17.43942 > 146.148.109.119.https: Flags [.], cksum 0x0cc4 (incorrect -> 0xfe9d), ack 1, win 0, options [nop,nop,TS val 1595469270 ecr 1841433925], length 0
...

04:41:53.345283 IP (tos 0x0, ttl 62, id 24165, offset 0, flags [DF], proto TCP (6), length 628)
146.148.109.119.https > 10.129.2.17.45884: Flags [P.], cksum 0xccdd (correct), seq 235640:236216, ack 904, win 505, options [nop,nop,TS val 927357442 ecr 1600426274], length 576
04:41:53.345307 IP (tos 0x0, ttl 62, id 24166, offset 0, flags [DF], proto TCP (6), length 628)
146.148.109.119.https > 10.129.2.17.45884: Flags [.], cksum 0x0157 (correct), seq 236216:236792, ack 904, win 505, options [nop,nop,TS val 927357442 ecr 1600426274], length 576
04:41:53.345371 IP (tos 0x0, ttl 64, id 62346, offset 0, flags [DF], proto TCP (6), length 52)
`

images/router/haproxy/conf/haproxy-config.template

hack/Dockerfile.debug

alebedev87 · 2023-11-30T22:28:13Z

Back to WIP to address the config warning about missing timeout server in the middle backends:

I1130 10:49:26.557528       1 router.go:649] template "msg"="router reloaded" "output"="[NOTICE]   (18) : haproxy version is 2.6.13-234aa6d\n[NOTICE]   (18) : path to executable is /usr/sbin/haproxy\n[WARNING]  (18) : config : missing timeouts for backend 'be_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n[WARNING]  (18) : config : missing timeouts for backend 'be_no_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n

alebedev87 · 2023-12-04T09:58:26Z

/test e2e-agnostic

alebedev87 · 2023-12-04T11:15:23Z

/test e2e-metal-ipi-ovn-ipv6

openshift-ci-robot · 2023-12-04T13:14:34Z

@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR removes the tunnel and server timeouts in the global default section ans sets them on all the backends. The route annotations for the timeouts continue to work as they did before.

Test using haproxy-timeout-tunnel repository:

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].image 
quay.io/alebedev/openshift-router:12.4.110

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].env | grep -A1 TUNNEL
- name: ROUTER_DEFAULT_TUNNEL_TIMEOUT
 value: 5s

$ oc -n tunnel-timeout get route hello-openshift -o yaml | yq .metadata.annotations
haproxy.router.openshift.io/timeout-tunnel: 15s

$ ROUTE_HOST=$(oc -n tunnel-timeout get route hello-openshift --template='{{.spec.host}}')
$ time HOST=${ROUTE_HOST} PORT=443 SCHEME=wss ./websocket-cli.js
WebSocket Client Connected
echo-protocol Connection Closed

real	0m15.721s
user	0m0.126s
sys	0m0.021s

$ oc -n openshift-ingress exec -ti deploy/router-default -- cat /var/lib/haproxy/conf/haproxy.config | grep 'timeout tunnel '
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  15s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-12-05T01:31:48Z

Back to WIP to address the config warning about missing timeout server in the middle backends:

I1130 10:49:26.557528       1 router.go:649] template "msg"="router reloaded" "output"="[NOTICE]   (18) : haproxy version is 2.6.13-234aa6d\n[NOTICE]   (18) : path to executable is /usr/sbin/haproxy\n[WARNING]  (18) : config : missing timeouts for backend 'be_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n[WARNING]  (18) : config : missing timeouts for backend 'be_no_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n

I still see these errors in the most recent test results:

Did you mean remove the WIP label?

ShudiLi · 2023-12-12T23:32:13Z

/label qe-approved

candita · 2023-12-13T20:51:55Z

@alebedev87 could you add an e2e test for this?

Also, I think we need to address the "middle server" issue. Could you add some more details in the description that explains the solution design?

alebedev87 · 2023-12-14T14:34:43Z

The warning about the server timeout in the middle backends is addressed by setting them with the maximum timeout among all the routes. This was one of the suggestions from the upstream issue.

alebedev87 · 2023-12-14T14:39:21Z

@candita: yes, I plan to add a test case for this fix in the cluster-ingress-operator's e2e.

ShudiLi · 2023-12-25T08:11:39Z

Tested it again with 4.16.0-0.test-2023-12-25-011937-ci-ln-2g1gk12-latest, WebSocket Client was disconnected after about 30 seconds.

`
1.
sh-4.4# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.0-0.test-2023-12-25-011937-ci-ln-2g1gk12-latest True False 124m Cluster version is 4.16.0-0.test-2023-12-25-011937-ci-ln-2g1gk12-latest

sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 0m30.100s
user 0m0.081s
sys 0m0.018s
sh-4.4#
`

ShudiLi · 2024-01-03T06:30:21Z

tested it with 4.16.0-0.test-2024-01-03-013347-ci-ln-mfg1zck-latest, the default clientTimeout is 30s, and the default erverTimeout is 300s, after configured clientTimeout and serverTimeout with 2h, websocket client was disconnected after 70m as expected

% oc -n openshift-ingress-operator get ingresscontroller default -oyaml | grep -A3 tuningOptions:
tuningOptions:
clientTimeout: 7200s
reloadInterval: 0s
serverTimeout: 7200s

% oc -n tunnel-timeout get route hello-openshift -oyaml | grep -A1 annotations:
annotations:
haproxy.router.openshift.io/timeout-tunnel: 70m

sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 70m0.112s
user 0m0.091s
sys 0m0.017s
sh-4.4#

% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.0-0.test-2024-01-03-013347-ci-ln-mfg1zck-latest True False 4h16m Cluster version is 4.16.0-0.test-2024-01-03-013347-ci-ln-mfg1zck-latest

ShudiLi · 2024-01-03T06:30:39Z

/label qe-approved

ShudiLi · 2024-01-05T07:20:06Z

Tested serverTimeout with 10s, 120s, 600s and the default, the function worked as expected in a created route

tuningOptions\serverTimeout: 600s
check haproxy.conf:
backend be_sni
timeout server 10m
backend be_no_sni
timeout server 10m

backend be_http:default:unsec-server3
mode http
option redispatch
option forwardfor
balance random
timeout server 10m
timeout tunnel 70m

timeout check 5000ms
3. send traffic to a un-secure route, and check the captured packets on server, after 10m tcp reset packet is received
06:20:53.539042 IP (tos 0x0, ttl 64, id 28240, offset 0, flags [DF], proto TCP (6), length 483)
10.129.2.17.39226 > 10.131.0.21.webcache: Flags [P.], cksum 0xdee3 (correct), seq 1:432, ack 1, win 504, options [nop,nop,TS val 1118141720 ecr 2226409851], length 431: HTTP, length: 431
GET / HTTP/1.1
user-agent: curl/7.61.1
accept: /
host: unsec-server3-default.apps.shudi-416gcptime.qe.gcp.devcluster.openshift.com
x-forwarded-host: unsec-server3-default.apps.shudi-416gcptime.qe.gcp.devcluster.openshift.com
x-forwarded-port: 80
x-forwarded-proto: http
forwarded: for=10.131.0.22;host=unsec-server3-default.apps.shudi-416gcptime.qe.gcp.devcluster.openshift.com;proto=http
x-forwarded-for: 10.131.0.22
06:30:53.540111 IP (tos 0x0, ttl 64, id 28241, offset 0, flags [DF], proto TCP (6), length 52)
10.129.2.17.39226 > 10.131.0.21.webcache: Flags [R.], cksum 0x5f1d (correct), seq 432, ack 1, win 504, options [nop,nop,TS val 1118741720 ecr 2226409851], length 0

set serverTimeout to defaut by remove tuningOptions\serverTimeout
check haproxy.conf
backend be_sni
timeout server 5m
backend be_no_sni
timeout server 5m

backend be_http:default:unsec-server3
mode http
option redispatch
option forwardfor
balance random
timeout server 30s
timeout tunnel 70m

alebedev87 · 2024-01-10T21:20:23Z

/retitle OCPBUGS-14914: set timeouts on the backend level

Wrong PR was WIPed, sorry.

ShudiLi · 2024-01-12T05:40:59Z

/label qe-approved

candita · 2024-01-26T22:32:57Z

images/router/haproxy/conf/haproxy-config.template

@@ -425,6 +424,9 @@ frontend fe_sni
 ##########################################################################
 # backend for when sni does not exist, or ssl term needs to happen on the edge
 backend be_no_sni
+        {{- with $value := maxTimeoutFirstMatchedAndClipped .State "haproxy.router.openshift.io/timeout" $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s"  }}


Why do you call the special maxTimeoutFirstMatchedAndClipped here but continue to call clipHAProxyTimeoutValue for other backend stanzas?

We want the maximum of all the backend timeouts to be set on the middle backends (be_sni, be_no_sni). Otherwise we get the warning during the config reloading.

pkg/router/template/template_helper.go

candita · 2024-01-26T23:20:52Z

images/router/haproxy/conf/haproxy-config.template

@@ -312,6 +308,9 @@ frontend public_ssl
 # traffic
 ##########################################################################
 backend be_sni
+        {{- with $value := maxTimeoutFirstMatchedAndClipped .State "haproxy.router.openshift.io/timeout" $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s"  }}
+  timeout server  {{ $value }}


Was this annotation ever being applied before?

I didn't understand which annotation you were referring to. timeout server was not set in be_sni before, it's needed to prevent this warning message.

The middle backends (be_sni, be_no_sni) are updated with the server timeout set to the maximum value among all routes from the configuration. This prevents a warning message during config reload.

alebedev87 · 2024-01-29T12:23:46Z

#536 (comment) addressed.

openshift-ci-robot · 2024-01-29T12:26:21Z

@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR removes the tunnel and server timeouts in the global default section and sets them on all the backends. The route annotations for the timeouts continue to work as they did before.

As suggested in the upstream issue, the middle backends (be_sni and be_no_sni) are set with the maximum of all the route timeouts to avoid the following warning message:

I1130 10:49:26.557528       1 router.go:649] template "msg"="router reloaded" "output"="[NOTICE]   (18) : haproxy version is 2.6.13-234aa6d\n[NOTICE]   (18) : path to executable is /usr/sbin/haproxy\n[WARNING]  (18) : config : missing timeouts for backend 'be_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n[WARNING]  (18) : config : missing timeouts for backend 'be_no_sni'.\n   | While not properly invalid, you will certainly encounter various problems\n   | with such a configuration. To fix this, please ensure that all following\n   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.\n

Test using haproxy-timeout-tunnel repository:

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].image 
quay.io/alebedev/openshift-router:12.4.110

$ oc -n openshift-ingress get deploy router-default -o yaml | yq .spec.template.spec.containers[0].env | grep -A1 TUNNEL
- name: ROUTER_DEFAULT_TUNNEL_TIMEOUT
 value: 5s

$ oc -n tunnel-timeout get route hello-openshift -o yaml | yq .metadata.annotations
haproxy.router.openshift.io/timeout-tunnel: 15s

$ ROUTE_HOST=$(oc -n tunnel-timeout get route hello-openshift --template='{{.spec.host}}')
$ time HOST=${ROUTE_HOST} PORT=443 SCHEME=wss ./websocket-cli.js
WebSocket Client Connected
echo-protocol Connection Closed

real	0m15.721s
user	0m0.126s
sys	0m0.021s

$ oc -n openshift-ingress exec -ti deploy/router-default -- cat /var/lib/haproxy/conf/haproxy.config | grep 'timeout tunnel '
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  5s
 timeout tunnel  15s

The e2e test: openshift/cluster-ingress-operator#1013.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2024-01-30T08:55:18Z

/test e2e-upgrade

openshift-ci · 2024-01-30T11:32:06Z

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ShudiLi · 2024-02-01T10:32:28Z

Tested it with default, 10s, 90s, 70m on 4.16.0-0.test-2024-02-01-013554-ci-ln-g418bqb-latest

tunnel time out

oc get clusterversion

NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.0-0.test-2024-02-01-013554-ci-ln-g418bqb-latest True False 8h Cluster version is 4.16.0-0.test-2024-02-01-013554-ci-ln-g418bqb-latest

sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 0m15.130s
user 0m0.101s
sys 0m0.019s
sh-4.4#
sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 0m10.132s
user 0m0.108s
sys 0m0.016s
sh-4.4#
sh-4.4#
sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 0m30.128s
user 0m0.102s
sys 0m0.017s
sh-4.4#

sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 1m30.138s
user 0m0.109s
sys 0m0.019s
sh-4.4#

sh-4.4# ./run-cli.sh
WebSocket Client Connected
echo-protocol Connection Closed

real 70m0.140s
user 0m0.109s
sys 0m0.019s
sh-4.4#

server time out, just put the test result for 70m here
10.129.2.18.43904 > 10.128.2.13.webcache: Flags [P.], cksum 0x3f23 (correct), seq 1:420, ack 1, win 258, options [nop,nop,TS val 2217547651 ecr 3480210042], length 419: HTTP, length: 419
GET / HTTP/1.1
user-agent: curl/7.61.1
accept: /
host: unsec-server3-default.apps.shudi-416g01.qe.gcp.devcluster.openshift.com
x-forwarded-host: unsec-server3-default.apps.shudi-416g01.qe.gcp.devcluster.openshift.com
x-forwarded-port: 80
x-forwarded-proto: http
forwarded: for=10.128.2.14;host=unsec-server3-default.apps.shudi-416g01.qe.gcp.devcluster.openshift.com;proto=http
x-forwarded-for: 10.128.2.14

10.129.2.18.43904 > 10.128.2.13.webcache: Flags [R.], cksum 0x9a82 (correct), seq 420, ack 1, win 258, options [nop,nop,TS val 2221747652 ecr 3480210043], length 0

ShudiLi · 2024-02-01T10:33:24Z

/label qe-approved

Miciah · 2024-02-13T22:30:33Z

images/router/haproxy/conf/haproxy-config.template

@@ -312,6 +308,9 @@ frontend public_ssl
 # traffic
 ##########################################################################
 backend be_sni
+        {{- with $value := maxTimeoutFirstMatchedAndClipped .State "haproxy.router.openshift.io/timeout" $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s"  }}


As you need the same value in both be_sni and be_no_sni, you can avoid computing the value twice using something like {{- $maxServerTimeout := maxTimeoutFirstMatchedAndClipped .State "haproxy.router.openshift.io/timeout" $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s" }} at the top of the file. (Based on the other review comments, it might also be helpful to add a {{- /* comment */}} at the same place explaining why we need to compute the max timeout.)

Miciah · 2024-02-13T22:41:00Z

images/router/haproxy/conf/haproxy-config.template

  timeout server-fin {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_FIN_TIMEOUT") "1s" }}
  timeout http-request {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_TIMEOUT") "10s" }}
  timeout http-keep-alive {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_HTTP_KEEPALIVE") "300s" }}

-  # Long timeout for WebSocket connections.
-  timeout tunnel {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_TUNNEL_TIMEOUT") "1h" }}


Why don't you need to specify timeout tunnel on be_sni or be_no_sni? Is HAProxy more lenient about not specifying timeout tunnel than it is about not specifying timeout server?

Does omitting timeout tunnel result in an infinite timeout, similarly to omitting timeout server, or does it fall back to using the server timeout as the tunnel timeout? Crucially, if you have a route that has a tunnel timeout that is greater than any route's server timeout, is the tunnel timeout respected, or does the server timeout in be_sni/be_no_sni cause the connection to be terminated before the route's tunnel timeout?

Why don't you need to specify timeout tunnel on be_sni or be_no_sni? Is HAProxy more lenient about not specifying timeout tunnel than it is about not specifying timeout server?

Yes. Only the missing server timeout gives a warning message. I'm not sure the warning reports a real problem though taking into account that the bottom backends will still have the server timeout set.

Does omitting timeout tunnel result in an infinite timeout, similarly to omitting timeout server, or does it fall back to using the server timeout as the tunnel timeout? Crucially, if you have a route that has a tunnel timeout that is greater than any route's server timeout, is the tunnel timeout respected, or does the server timeout in be_sni/be_no_sni cause the connection to be terminated before the route's tunnel timeout?

Regarding the link between server and tunnel timeouts. I discovered that the tunnel timeout supersedes server timeout in the TCP proxies. Note that the middle backends in the router configuration are TCP proxies. So setting tunnel timeout in default or the middle backends (be_sni/be_no_sni) results into interference in the server timeout behavior set on the bottom backend level. I didn't see the opposite though: the server timeout set on default or middle backends didn't influence the tunnel timeout set on the bottom backends.

Miciah · 2024-02-13T23:01:31Z

images/router/haproxy/conf/haproxy-config.template

@@ -581,10 +583,10 @@ backend {{ genBackendNamePrefix $cfg.TLSTermination }}:{{ $cfgIdx }}
          {{- end }}
  tcp-request content reject if !whitelist
        {{- end }}
-        {{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout")) }}
+        {{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout") (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s") }}


All these additional env calls add some overhead. As Andy says, "Trips across the Go/Template boundary can be expensive, particularly when you have many, many routes (10,000+)." Consider adding {{- $defaultServerTimeout := firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s" }} at the top of the template and using {{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout") $defaultServerTimeout) }} here.

As noted in #483 (review), clipHAProxyTimeoutValue (firstMatch $timeSpecPattern foo) could be simplified as clipHAProxyTimeoutValue foo. However, that might be better left as a separate change.

...And actually, now that I look at it again, I believe clipHAProxyTimeoutValue (firstMatch $timeSpecPattern) foo will silently ignore invalid values while clipHAProxyTimeoutValue foo will log them. Logging errors for invalid values might be a positive change, but it's an additional behavioral change that you might not want to tie to OCPBUGS-14914.

...Although thinking about it still further, I wonder: Is maxTimeoutFirstMatchedAndClipped going to cause clipHAProxyTimeoutValue to log an error for each route that has an invalid annotation value, where before this change those invalid values get filtered out by firstMatch before clipHAProxyTimeoutValue has a chance to report any errors?

Miciah · 2024-02-13T23:03:32Z

images/router/haproxy/conf/haproxy-config.template

+        {{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout") (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s") }}
+  timeout server  {{ $value }}
+        {{- end }}
+        {{- with $value := clipHAProxyTimeoutValue (firstMatch $timeSpecPattern (index $cfg.Annotations "haproxy.router.openshift.io/timeout-tunnel") (index $cfg.Annotations "haproxy.router.openshift.io/timeout") (env "ROUTER_DEFAULT_TUNNEL_TIMEOUT") "1h") }}


Please replace env "ROUTER_DEFAULT_SERVER_TIMEOUT" and env "ROUTER_DEFAULT_TUNNEL_TIMEOUT" each with a variable that is computed just once, if possible.

Miciah · 2024-02-13T23:26:51Z

pkg/router/template/template_helper.go

+		timeout := clipHAProxyTimeoutValue(firstMatch(pattern, append([]string{cfg.Annotations[annotation]}, values...)...))
+		if timeout != "" {
+			// No error handling because clipHAProxyTimeoutValue returns
+			// a valid timeout or an empty string. The latter is already handled.
+			timeoutDuration, _ := time.ParseDuration(timeout)


This is a lot of redundant computation:

maxTimeoutFirstMatchedAndClipped parses each route's timeout annotation value using firstMatch, then (if the value is valid) clipHAProxyTimeoutValue, then time.ParseDuration.

fe_sni calls maxTimeoutFirstMatchedAndClipped, so you're parsing each valid value 3 times.

fe_no_sni also calls maxTimeoutFirstMatchedAndClipped, so you're parsing each valid value 3 more times.

The backend calls clipHAProxyTimeoutValue (firstMatch $timeSpecPattern ...), so you're parsing each valid value 2 more times.

Granted, the regexp cache that Andy added in #268 avoids the cost of compiling the same regexp repeatedly, but you're still repeatedly doing a regexp match and then time.ParseDuration on the same value, right?

Miciah · 2024-02-13T23:35:24Z

pkg/router/template/template_helper.go

-	"parseIPList":             parseIPList,             //parses the list of IPs/CIDRs (IPv4/IPv6)
+	"clipHAProxyTimeoutValue":          clipHAProxyTimeoutValue,          //clips extraordinarily high timeout values to be below the maximum allowed timeout value
+	"parseIPList":                      parseIPList,                      //parses the list of IPs/CIDRs (IPv4/IPv6)
+	"maxTimeoutFirstMatchedAndClipped": maxTimeoutFirstMatchedAndClipped, //finds the maximum timeout managed by a given annotation among all the routes, matches it against a given pattern, and clips


Keep in mind that this list of helper functions is effectively an API. We cannot rename or remove a helper function without potentially breaking third-party templates, so we tend to prefer defining general and composable helpers. That said, I think it is worth putting this functionality in a single specialized helper to avoid excessive performance impact.

Miciah · 2024-02-13T23:37:56Z

pkg/router/template/template_helper_test.go

+		annotation string
+		pattern    string


Could you add test cases with different values for annotation and pattern?

Miciah · 2024-02-13T23:41:24Z

pkg/router/template/template_helper_test.go

+		{
+			name: "Route timeout clipped",
+			state: map[ServiceAliasConfigKey]ServiceAliasConfig{
+				"test:route1": {
+					Annotations: map[string]string{
+						"haproxy.router.openshift.io/timeout": "999999999s",
+					},
+				},
+			},
+			annotation: "haproxy.router.openshift.io/timeout",
+			pattern:    `[1-9][0-9]*(us|ms|s|m|h|d)?`,
+			values:     []string{"30s"},
+			expected:   "2147483647ms",
+		},


How about a test case with values: []string{"999999999s"}?

Miciah · 2024-02-13T23:44:27Z

pkg/router/template/template_helper_test.go

+			name: "Default timeout is maximum",
+			state: map[ServiceAliasConfigKey]ServiceAliasConfig{
+				"test:route1": {},
+				"test:route2": {},
+				"test:route3": {},
+			},
+			annotation: "haproxy.router.openshift.io/timeout",
+			pattern:    `[1-9][0-9]*(us|ms|s|m|h|d)?`,
+			values:     []string{"30s"},
+			expected:   "30s",
+		},


Should at least one of the routes specify a value? It makes sense to me to have a test case for "some routes specify timeouts, but the default timeout is maximum" .

By the way, what happens if you have 0 routes?

openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Nov 7, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2023

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 7, 2023

openshift-ci bot requested review from candita and gcs278 November 7, 2023 20:15

alebedev87 force-pushed the timeouts-in-backends branch from 6a1044e to ef08ed2 Compare November 14, 2023 22:29

alebedev87 changed the title ~~[WIP] OCPBUGS-14914: set timeouts on the backend level~~ OCPBUGS-14914: set timeouts on the backend level Nov 14, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2023

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 14, 2023

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 14, 2023

openshift-ci bot requested a review from ShudiLi November 14, 2023 22:31

Miciah reviewed Nov 22, 2023

View reviewed changes

images/router/haproxy/conf/haproxy-config.template Outdated Show resolved Hide resolved

images/router/haproxy/conf/haproxy-config.template Outdated Show resolved Hide resolved

hack/Dockerfile.debug Outdated Show resolved Hide resolved

alebedev87 changed the title ~~OCPBUGS-14914: set timeouts on the backend level~~ [WIP] OCPBUGS-14914: set timeouts on the backend level Nov 30, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 30, 2023

alebedev87 force-pushed the timeouts-in-backends branch 2 times, most recently from 4113831 to 6096a39 Compare December 1, 2023 16:33

alebedev87 changed the title ~~[WIP] OCPBUGS-14914: set timeouts on the backend level~~ OCPBUGS-14914: set timeouts on the backend level Dec 4, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2023

alebedev87 force-pushed the timeouts-in-backends branch from 6096a39 to ff40068 Compare December 14, 2023 14:24

Miciah mentioned this pull request Dec 21, 2023

OCPBUG19700: Modified Route-specific annotations table and added a section for ingress controller annotation openshift/openshift-docs#68972

Open

1 task

alebedev87 mentioned this pull request Jan 8, 2024

OCPBUGS-14914: e2e - test tunnel and server timeouts openshift/cluster-ingress-operator#1013

Open

alebedev87 changed the title ~~OCPBUGS-14914: set timeouts on the backend level~~ [WIP] OCPBUGS-14914: set timeouts on the backend level Jan 10, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2024

openshift-ci bot changed the title ~~[WIP] OCPBUGS-14914: set timeouts on the backend level~~ OCPBUGS-14914: set timeouts on the backend level Jan 10, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2024

candita reviewed Jan 26, 2024

View reviewed changes

pkg/router/template/template_helper.go Outdated Show resolved Hide resolved

candita reviewed Jan 26, 2024

View reviewed changes

OCPBUGS-14914: Set tunnel and server timeouts only at the backend level

e165421

The middle backends (be_sni, be_no_sni) are updated with the server timeout set to the maximum value among all routes from the configuration. This prevents a warning message during config reload.

alebedev87 force-pushed the timeouts-in-backends branch from ff40068 to e165421 Compare January 29, 2024 12:22

alebedev87 changed the title ~~OCPBUGS-14914: set timeouts on the backend level~~ OCPBUGS-14914: Set tunnel and server timeouts only at the backend level Jan 29, 2024

Miciah reviewed Feb 13, 2024

View reviewed changes

OCPBUGS-14914: Set tunnel and server timeouts only at the backend level #536

Are you sure you want to change the base?

OCPBUGS-14914: Set tunnel and server timeouts only at the backend level #536

Conversation

alebedev87 commented Nov 7, 2023 • edited

openshift-ci-robot commented Nov 7, 2023

openshift-ci bot commented Nov 7, 2023

alebedev87 commented Nov 14, 2023

openshift-ci-robot commented Nov 14, 2023

openshift-ci-robot commented Nov 14, 2023

ShudiLi commented Nov 17, 2023 • edited

alebedev87 commented Nov 30, 2023

alebedev87 commented Dec 4, 2023

alebedev87 commented Dec 4, 2023

openshift-ci-robot commented Dec 4, 2023

Miciah commented Dec 5, 2023

ShudiLi commented Dec 12, 2023

candita commented Dec 13, 2023 • edited

alebedev87 commented Dec 14, 2023 • edited

alebedev87 commented Dec 14, 2023

ShudiLi commented Dec 25, 2023

ShudiLi commented Jan 3, 2024

ShudiLi commented Jan 3, 2024

ShudiLi commented Jan 5, 2024

alebedev87 commented Jan 10, 2024

ShudiLi commented Jan 12, 2024

candita Jan 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alebedev87 Jan 29, 2024 • edited

Choose a reason for hiding this comment

alebedev87 commented Jan 29, 2024

openshift-ci-robot commented Jan 29, 2024

alebedev87 commented Jan 30, 2024

openshift-ci bot commented Jan 30, 2024

ShudiLi commented Feb 1, 2024 • edited

oc get clusterversion

ShudiLi commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Miciah Feb 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Miciah Feb 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alebedev87 commented Nov 7, 2023 •

edited

ShudiLi commented Nov 17, 2023 •

edited

candita commented Dec 13, 2023 •

edited

alebedev87 commented Dec 14, 2023 •

edited

candita Jan 26, 2024 •

edited

alebedev87 Jan 29, 2024 •

edited

ShudiLi commented Feb 1, 2024 •

edited

Miciah Feb 13, 2024 •

edited

Miciah Feb 13, 2024 •

edited