Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23221: internalServiceChanged: Fix target port logic #1052

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented Apr 30, 2024

internalServiceChanged: Fix target port logic

Fix internalServiceChanged to copy target ports from the expected service rather than preserving target ports from the current service. Only copy NodePort and other fields that we want to preserve.

Before this PR, the internalServiceChanged function explicitly preserved the current service's ports' target ports when reconciling the service. However, the function was intended to update the target ports if they changed.

In particular, if the current service specifies a numeric target port for the metrics port, then the operator should replace the port number with a port name so that the service uses the correct port even if the pod container's port number changes. Without this change, updating an IngressController's spec.endpointPublishingStrategy.hostNetwork.statsPort API field can break metrics as the service's metrics port's numeric target port may no longer match the pod container's metrics port.

Follow-up to #864.

Test_*Changed: Verify that some update was made

This is a follow-up to the previous change for consistency.

crdChanged, nodePortServiceChanged: Fix preserve logic

Fix the update logic in crdChanged and nodePortServiceChanged to preserve fields that are ignored when checking whether an update is necessary.

Follow-up to #567, #890, and #927.

Set SessionAffinity to avoid confusing diffs

If a service does not specify spec.sessionAffinity, the API sets a default value of "None". This means that when creating or updating a service, the empty value and "None" are equivalent, and so when reconciling a service, cluster-ingress-operator treats these values as equal. This results in the correct update behavior: The operator ignores the API defaulting, but if someone sets any value other than the empty value or "None", the operator reverts the update; and if the operator leaves spec.sessionAffinity empty when reverting the update, the API again defaults it to "None".

However, even the updating behavior is correct, it can cause confusing log output when the operator logs the diff with the changes: If the operator has the empty value in the updated object but the current value is actually "None", this shows up in the diff, even though the actual update does not change the field value when taking defaulting into account.

This change adds an explicit spec.sessionAffinity: None setting to the services that the operator manages to avoid having the API defaulting show up in these diffs.

Follow-up to #407.

Test_desiredLoadBalancerService: Add assertions

Add assertions for service type, externalTrafficPolicy, internalTrafficPolicy, and ports, for consistency with Test_desiredInternalIngressControllerService.

Test_desiredInternalIngressControllerService: Fix godoc

Fix capitalization of the test function name in the godoc.

TestHostNetworkPortBinding: Use t.Log, t.Cleanup

Update TestHostNetworkPortBinding to comply with current conventions. Convert some comments into t.Log statements, replace defer with t.Cleanup, and tweak some error messages.

TestHostNetworkPortBinding: Check service ports

Update TestHostNetworkPortBinding to verify that the internal service has the expected ports. In particular, the metrics port should have its target port set to the name "metrics" rather than a port number.

Test_deploymentConfigChanged: Use k8s.io/utils/ptr

Replace use of the deprecated "k8s.io/utils/pointer" package by using "k8s.io/utils/ptr" instead.

Set "service-ca-bundle" volume's default mode

Set an explicit value of 420 decimal (0644 octal) for the router deployment's "service-ca-bundle" volume's default mode.

Previously, the router deployment yaml manifest had defaultMode indented incorrectly, and so the operator did not actually set the intended value for the volume's default mode. Because the intended value is equal to the default value, the volume ultimately ends up with the intended default mode anyway. However, the fact that the operator did not specify the default mode made log output show a bogus diff for the volume when the operator updated the deployment.

After this change, the yaml is correctly indented, and the log output for a deployment update has a less noisy diff.

Follow-up to #399.

Specify default mode on configmap/secret volumes

Set an explicit value of 420 decimal (0644 octal) for the default mode of all configmap or secret volumes in the router deployment that are defined in Go code (as opposed to being defined in the yaml manifest): client-ca, default-certificate, error-pages, metrics-certs, rsyslog-config, service-ca-bundle, and stats-auth.

After this change, the log output for a deployment update has a less noisy diff.

Specify parameter values for router probes

Set explicit values for the liveness, readiness, and startup probes in the router deployment's yaml manifest.

Previously, the router deployment yaml manifest omitted these values when the intended values were equal to the default values. This meant that the API server would set the default values for these parameters, and so the probes ultimately ended up with the intended parameter values. However, omitting these values in the yaml manifest also had the unintended side-effect of making the operator log bogus diffs for these parameters whenever it updated the deployment.

After this change, the log output for a deployment update has a less noisy diff.

copyProbe: Refactor to group copying of defaults

Group together the assignments for the case that copyDefaultValues is true.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Apr 30, 2024
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-23221, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

internalServiceChanged: Fix target port logic

Fix internalServiceChanged to copy target ports from the expected service rather than preserving target ports from the current service.

Before this PR, the internalServiceChanged function explicitly preserved the current service's ports' target ports when reconciling the service. However, the function was intended to update the target ports if they changed.

In particular, if the current service specifies a numeric target port for the metrics port, then the operator should replace the port number with a port name so that the service uses the correct port even if the pod container's port number changes. Without this change, updating an IngressController's spec.endpointPublishingStrategy.hostNetwork.statsPort API field can break metrics as the service's metrics port's numeric target port may no longer match the pod container's metrics port.

Follow-up to #864.

Test_*Changed: Verify that some update was made

This is a follow-up to the previous change for consistency.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 30, 2024
@Miciah Miciah force-pushed the OCPBUGS-23221-internalServiceChanged-fix-target-port-logic branch from 30433aa to 53baea6 Compare May 1, 2024 00:53
@Miciah
Copy link
Contributor Author

Miciah commented May 1, 2024

ci/prow/e2e-aws-ovn-single-node failed with the following error:

 Updating install-config.yaml to a single m5d.2xlarge control plane node and 0 workers
/bin/bash: line 19: pip3: command not found 

I filed OCPBUGS-33162 for this error.

@Miciah
Copy link
Contributor Author

Miciah commented May 1, 2024

ci/prow/e2e-aws-operator and ci/prow/e2e-azure-operator both failed because TestRouteAdmissionPolicy failed, which is a known issue that is being tracked as part of OCPBUGS-26498.

@candita
Copy link
Contributor

candita commented May 1, 2024

/assign

@candita
Copy link
Contributor

candita commented May 1, 2024

@Miciah will this also need to be backported to 4.12, like #864 (comment) ?

@Miciah Miciah force-pushed the OCPBUGS-23221-internalServiceChanged-fix-target-port-logic branch from 53baea6 to 584ef42 Compare May 1, 2024 23:14
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-23221, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

internalServiceChanged: Fix target port logic

Fix internalServiceChanged to copy target ports from the expected service rather than preserving target ports from the current service. Only copy NodePort and other fields that we want to preserve.

Before this PR, the internalServiceChanged function explicitly preserved the current service's ports' target ports when reconciling the service. However, the function was intended to update the target ports if they changed.

In particular, if the current service specifies a numeric target port for the metrics port, then the operator should replace the port number with a port name so that the service uses the correct port even if the pod container's port number changes. Without this change, updating an IngressController's spec.endpointPublishingStrategy.hostNetwork.statsPort API field can break metrics as the service's metrics port's numeric target port may no longer match the pod container's metrics port.

Follow-up to #864.

Test_*Changed: Verify that some update was made

This is a follow-up to the previous change for consistency.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor Author

Miciah commented May 1, 2024

https://github.com/openshift/cluster-ingress-operator/compare/53baea6f04f80b40128edd4d55cae58e7dd13dc3..584ef427468eac9d8a5a1948e11ae26f673d3114 updates the logic to preserve the same fields that we ignore when checking whether an update is needed, and also adds an E2E test.

@Miciah
Copy link
Contributor Author

Miciah commented May 1, 2024

@Miciah will this also need to be backported to 4.12, like #864 (comment) ?

Yeah, if anyone requests a backport and PM agrees, it makes sense to backport this PR.

@Miciah Miciah force-pushed the OCPBUGS-23221-internalServiceChanged-fix-target-port-logic branch from 584ef42 to 600ed28 Compare May 1, 2024 23:34
@Miciah
Copy link
Contributor Author

Miciah commented May 1, 2024

https://github.com/openshift/cluster-ingress-operator/compare/584ef427468eac9d8a5a1948e11ae26f673d3114..600ed28e6f713ec4fe0d2e94b37061f96bf61852 adds comments about keeping cmp options and updated.Spec.Foo = current.Spec.Foo assignments consistent and adds some missing assignments.

@Miciah
Copy link
Contributor Author

Miciah commented May 2, 2024

ci/prow/e2e-azure-operator failed because the installer timed out waiting for the API server respond after bootstrap.

ci/prow/e2e-aws-operator failed because TestRouteAdmissionPolicy failed.

for i, updatedPort := range updated.Spec.Ports {
for _, currentPort := range current.Spec.Ports {
if currentPort.Name == updatedPort.Name {
updated.Spec.Ports[i].TargetPort = currentPort.TargetPort
updated.Spec.Ports[i].NodePort = currentPort.NodePort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that we are allowing the spec.ports.nodePorts to be changed from what we expect, to what is set in current?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something/someone changed the nodePort, and only the nodePort, then L147-148 would return "unchanged" because we ignore it. However, if something changed the nodePort, and also changed something we don't ignore, and we get to here, then we change it. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that we are allowing the spec.ports.nodePorts to be changed from what we expect, to what is set in current?

It does. This is explained in the comments:

// Ignore fields that the API, other controllers, or user may
// have modified. Note: This list must be kept consistent with
// the updated.Spec.Foo = current.Spec.Foo assignments below!
// Preserve fields that the API, other controllers, or user may have
// modified. Note: This list must be kept consistent with
// serviceCmpOpts above!

In particular, NodePort is set by the API, and so NodePort must be ignored in the comparison and preserved during the update so that the operator does not try to revert the value that the API set.

If something/someone changed the nodePort, and only the nodePort, then L147-148 would return "unchanged" because we ignore it. However, if something changed the nodePort, and also changed something we don't ignore, and we get to here, then we change it. Am I missing something?

updated.Spec.Ports[i].NodePort = currentPort.NodePort copies the value that was observed (currentPort.NodePort) to the final version of the service that we will use for the update (updated). That is, the assignment preserves the current NodePort value for updates.

Note that we initialize updated by copying the spec field from expected:

updated := current.DeepCopy()
updated.Spec = expected.Spec

Then if we want to preserve the current value for some subfield, we need to copy the value from current:

updated.Spec.ClusterIP = current.Spec.ClusterIP
updated.Spec.ClusterIPs = current.Spec.ClusterIPs
updated.Spec.ExternalIPs = current.Spec.ExternalIPs
updated.Spec.HealthCheckNodePort = current.Spec.HealthCheckNodePort
updated.Spec.IPFamilies = current.Spec.IPFamilies
updated.Spec.IPFamilyPolicy = current.Spec.IPFamilyPolicy
for i, updatedPort := range updated.Spec.Ports {
for _, currentPort := range current.Spec.Ports {
if currentPort.Name == updatedPort.Name {
updated.Spec.Ports[i].NodePort = currentPort.NodePort

At the end of all this, updated is the result of merging current and expected, copying the field values that the operator should update from expected and copying the field values that the operator should preserve (that is, leave alone) from current.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still,

If something/someone changed the nodePort, and only the nodePort, then L147-148 would return "unchanged" because we ignore it.

Does that mean the change to nodePort won't get saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return false, nil statement on line 147/148 means that updateInternalService won't do an update. In that case, there's no need to preserve NodePort's value—we're not doing an update anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. If we never expect current to hold any values that are different for the ignored fields, then why do we have to preserve them later?

Copy link
Contributor Author

@Miciah Miciah May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator preserves the ignored fields so that it doesn't inadvertently update fields that we intend for it to ignore.

The operator ignores these fields because it isn't supposed to be managing them. In particular, the API sets NodePort, so the operator should never try to update NodePort, even if expected.Spec.Ports[i].NodePort differs from current.Spec.Ports[i].NodePort.

Maybe the word "expected" is misleading. expected has some fields where the operator manages the field and actually expects the field to have a certain value, as well as some fields where the operator is intended to ignore changes to the field and preserve the current value, whatever that value may be.

Ultimately, the operator should update the service if a managed field (for example, spec.internalTrafficPolicy) does not have the expected value, and the operator should update that particular field. However, the operator should not update the service if only ignored fields (for example, nodePort) changed, and the operator should not update an unmanaged field if the operator has to update the service because some managed field value needs an update.

@Miciah
Copy link
Contributor Author

Miciah commented May 3, 2024

e2e-gcp-operator succeeded, and the operator logs have the expected output:

% curl -so ingress-operator.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1052/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1785815245374099456/artifacts/e2e-gcp-operator/gather-extra/artifacts/pods/openshift-ingress-operator_ingress-operator-58469b5f8b-6dqzj_ingress-operator.log
% grep -e 'created internal ingresscontroller service' -c -- ingress-operator.log                                                                                                                                       
83
% grep -e 'updated internal service' -- ingress-operator.log                                                                                      
2024-05-02T00:36:39.000Z        INFO    operator.ingress_controller     ingress/internal_service.go:44  updated internal service        {"namespace": "openshift-ingress", "name": "router-internal-reconcile-internal-service", "diff": "  &v1.Service{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-internal-reconcile-internal-service\", Namespace: \"openshift-ingress\", UID: \"5ba7f8c8-9af0-4da4-82dd-22dbf88c5f94\", ResourceVersion: \"48575\", ...},\n  \tSpec: v1.ServiceSpec{\n  \t\tPorts: []v1.ServicePort{\n  \t\t\t{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...},\n  \t\t\t{Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...},\n  \t\t\t{\n  \t\t\t\t... // 2 identical fields\n  \t\t\t\tAppProtocol: nil,\n  \t\t\t\tPort:        1936,\n  \t\t\t\tTargetPort: intstr.IntOrString{\n- \t\t\t\t\tType:   0,\n+ \t\t\t\t\tType:   1,\n- \t\t\t\t\tIntVal: 12345,\n+ \t\t\t\t\tIntVal: 0,\n- \t\t\t\t\tStrVal: \"\",\n+ \t\t\t\t\tStrVal: \"metrics\",\n  \t\t\t\t},\n  \t\t\t\tNodePort: 0,\n  \t\t\t},\n  \t\t},\n  \t\tSelector:                 {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"reconcile-internal-service\"},\n  \t\tClusterIP:                \"172.30.106.3\",\n  \t\tClusterIPs:               {\"172.30.106.3\"},\n  \t\tType:                     \"ClusterIP\",\n  \t\tExternalIPs:              nil,\n- \t\tSessionAffinity:          \"None\",\n+ \t\tSessionAffinity:          \"\",\n  \t\tLoadBalancerIP:           \"\",\n  \t\tLoadBalancerSourceRanges: nil,\n  \t\t... // 10 identical fields\n  \t},\n  \tStatus: {},\n  }\n"}
% grep -e 'updated internal service' -- ingress-operator.log | sed -e 's/^[^{]*//' | jq -r .diff
  &v1.Service{
        TypeMeta:   {},
        ObjectMeta: {Name: "router-internal-reconcile-internal-service", Namespace: "openshift-ingress", UID: "5ba7f8c8-9af0-4da4-82dd-22dbf88c5f94", ResourceVersion: "48575", ...},
        Spec: v1.ServiceSpec{
                Ports: []v1.ServicePort{
                        {Name: "http", Protocol: "TCP", Port: 80, TargetPort: {Type: 1, StrVal: "http"}, ...},
                        {Name: "https", Protocol: "TCP", Port: 443, TargetPort: {Type: 1, StrVal: "https"}, ...},
                        {
                                ... // 2 identical fields
                                AppProtocol: nil,
                                Port:        1936,
                                TargetPort: intstr.IntOrString{
-                                       Type:   0,
+                                       Type:   1,
-                                       IntVal: 12345,
+                                       IntVal: 0,
-                                       StrVal: "",
+                                       StrVal: "metrics",
                                },
                                NodePort: 0,
                        },
                },
                Selector:                 {"ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "reconcile-internal-service"},
                ClusterIP:                "172.30.106.3",
                ClusterIPs:               {"172.30.106.3"},
                Type:                     "ClusterIP",
                ExternalIPs:              nil,
-               SessionAffinity:          "None",
+               SessionAffinity:          "",
                LoadBalancerIP:           "",
                LoadBalancerSourceRanges: nil,
                ... // 10 identical fields
        },
        Status: {},
  }

% 

SessionAffinity is an unusual case; "None" and empty are equivalent, and the API defaults the value to "None", so it shows up in the diff. I realize the update logic is confusing, and seeing SessionAffinity in the diff may add to the confusion, so I'll push a commit to prevent it.

@Miciah
Copy link
Contributor Author

Miciah commented May 3, 2024

Multiple CI jobs failed because Clone the correct source code into an image and tag it as src failed with 0/26 nodes are available. This is an issue with the CI cluster; the tests didn't even actually run.
/retest-required

t.Fatal(err)
}

findPort := func(name string, service *corev1.Service) *corev1.ServicePort {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this e2e test find/update the port on the IngressController, and check to make sure it is updated on the Service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestHostNetworkPortBinding verifies that updating IngressController.spec.endpointPublishingStrategy.hostNetwork.statsPort updates the deployment. TestReconcileInternalService verifies that the "metrics" port exists, and that the operator is managing it. 🤔... Would it make sense to add a check to TestHostNetworkPortBinding to verify that the service uses "metrics" (and not a number) as the target port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3a56b8d adds an assertion on the expected ports in TestHostNetworkPortBinding.

Copy link
Contributor

@candita candita May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestHostNetworkPortBinding would have prevented the original bug if it was testing the same thing as the bug report. It is testing assertContainerHasPort(t, routerContainer, "metrics", 9936). Doesn't it need to assert that the service now has the port?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestHostNetworkPortBinding would have prevented the original bug if it was testing the same thing as the bug report. It is testing assertContainerHasPort(t, routerContainer, "metrics", 9936). Doesn't it need to assert that the service now has the port?

Is the assertion that I added what you expected?

Or, maybe the new assertion for expected ports should have the new port numbers too? https://github.com/openshift/cluster-ingress-operator/pull/1052/files/cfba4d9ad778d01a6ab874682ad0c2d95f81eab0#diff-cf4b4e5424070a666a364d1d1b04011478d888d6d3d65c943ca7783333b7a6e4R1018-R1034

The port changes on the pod, but the port doesn't need to change on the service if the service just references the target port by name instead of by number. That's the idea behind #864. The fault is that the update logic didn't actually work properly, and so the service would still have port 1936 instead of port "metrics".

t.Fatal("serviceMonitorChanged does not behave as a fixed-point function")
} else {
if updatedChanged, _ := serviceMonitorChanged(sm1, sm3); !updatedChanged {
t.Error("serviceMonitorChangedChanged reported changes but did not make any update")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
t.Error("serviceMonitorChangedChanged reported changes but did not make any update")
t.Error("serviceMonitorChanged reported changes but did not make any update")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah Miciah force-pushed the OCPBUGS-23221-internalServiceChanged-fix-target-port-logic branch from 3a56b8d to 3d06500 Compare May 3, 2024 22:01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could add a test case for NodePort changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Miciah added 3 commits May 3, 2024 18:13
Fix internalServiceChanged to copy target ports from the expected service
rather than preserving target ports from the current service.  Only copy
NodePort and other fields that we want to preserve.

Before this commit, the internalServiceChanged function explicitly
preserved the current service's ports' target ports when reconciling the
service.  However, the function was intended to update the target ports if
they changed.

In particular, if the current service specifies a numeric target port for
the metrics port, then the operator should replace the port number with a
port name so that the service uses the correct port even if the pod
container's port number changes.  Without this change, updating an
IngressController's spec.endpointPublishingStrategy.hostNetwork.statsPort
API field can break metrics as the service's metrics port's numeric target
port may no longer match the pod container's metrics port.

Follow-up to commit 44a60f5
and commit 9c6e368.

This commit fixes OCPBUGS-23221

https://issues.redhat.com/browse/OCPBUGS-23221

* pkg/operator/controller/ingress/internal_service.go
(internalServiceChanged): Fix logic that preserves the current service's
fields not to preserve the ports' target ports and instead to preserve the
same fields that we ignore with serviceCmpOpts.
* pkg/operator/controller/ingress/internal_service_test.go
(Test_internalServiceChanged): Verify that internalServiceChanged actually
updates the service when it reports that the service has changed.  Add a
test case to verify that updates to NodePort are ignored.
* test/e2e/operator_test.go (TestReconcileInternalService): New test to
verify that the operator reconciles the internal service's metrics port's
target port by setting a bogus target port and checking that the operator
reverts the test's update.
* test/e2e/all_test.go (TestAll): Add TestReconcileInternalService to
parallel tests.
This is a follow-up to the previous commit for consistency.

* pkg/operator/controller/canary/daemonset_test.go
(Test_canaryDaemonsetChanged):
* pkg/operator/controller/canary/namespace_test.go
(Test_canaryNamespaceChanged):
* pkg/operator/controller/canary/route_test.go
(Test_canaryRouteChanged):
* pkg/operator/controller/ingress/deployment_test.go
(Test_deploymentConfigChanged):
* pkg/operator/controller/ingress/load_balancer_service_test.go
(Test_loadBalancerServiceChanged)
(Test_loadBalancerServiceAnnotationsChanged):
* pkg/operator/controller/ingress/monitoring_test.go
(Test_serviceMonitorChanged):
* pkg/operator/controller/ingress/namespace_test.go
(Test_routerNamespaceChanged):
* pkg/operator/controller/ingress/nodeport_service_test.go
(Test_nodePortServiceChanged):
* pkg/operator/controller/ingress/poddisruptionbudget_test.go
(Test_podDisruptionBudgetChange):
* pkg/operator/controller/ingressclass/ingressclass_test.go
(Test_ingressClassChanged): Verify that the function made some update to
the object if it reported that the object changed.
Fix the update logic in crdChanged and nodePortServiceChanged to preserve
fields that are ignored when checking whether an update is necessary.

Follow-up to commit 4a306cb,
commit dd962af,
commit 5579aa1, and
commit d7ffd6e.

* pkg/operator/controller/gatewayapi/crds.go (crdChanged): Preserve
spec.Conversion.
* pkg/operator/controller/ingress/nodeport_service.go
(nodePortServiceChanged): Preserve spec.clusterIPs, spec.ipFamilies, and
spec.ipFamilyPolicy.
Set explicit values for the liveness, readiness, and startup probes in the
router deployment's yaml manifest.

Previously, the router deployment yaml manifest omitted these values when
the intended values were equal to the default values.  This meant that the
API server would set the default values for these parameters, and so the
probes ultimately ended up with the intended parameter values.  However,
omitting these values in the yaml manifest also had the unintended
side-effect of making the operator log bogus diffs for these parameters
whenever it updated the deployment.

After this commit, the log output for a deployment update has a less noisy
diff.

* pkg/manifests/assets/router/deployment.yaml: Specify explicit values for
the failureThreshold, httpGet.scheme, periodSeconds, successThreshold, and
timeoutSeconds parameters for the liveness, readiness, and startup probes.
* pkg/operator/controller/ingress/deployment.go (copyProbe): Add a
copyDefaultValues parameter.  If this parameter is true, copyProbe will
copy the value for a field even if is equal to the default value.
(hashableProbe): Call copyProbe with copyDefaultValues set to false so that
empty and default values result in the same hash.
(deploymentConfigChanged): Call copyProbe with copyDefaultValues set to
true so that default values get copied to the updated object gets diffed
for logging.
* pkg/operator/controller/ingress/deployment_test.go (checkProbes): New
helper function to verify that the given container has the expected probes
and probe parameters.
(TestDesiredRouterDeploymentSpecAndNetwork):
(TestDesiredRouterDeploymentVariety): Use checkProbes.
@Miciah Miciah force-pushed the OCPBUGS-23221-internalServiceChanged-fix-target-port-logic branch from 4d0f520 to 2f0e787 Compare May 17, 2024 10:03
@Miciah
Copy link
Contributor Author

Miciah commented May 17, 2024

I realized that I was still missing defaultMode for some configmap and secret volumes, and also that the unit tests were rather lax (and in one case misleading, checking for a nonexistent "error-pages" volume). https://github.com/openshift/cluster-ingress-operator/compare/4d0f52090379819b1b50af60d060d167a1629ba1..2f0e7872151af6cd62bc88ad26403d166f1259c9 addresses defaultMode for some additional volumes and improves test coverage.

@Miciah
Copy link
Contributor Author

Miciah commented May 17, 2024

/test e2e-aws-operator
/test e2e-azure-operator
/test e2e-gcp-operatotr

@Miciah
Copy link
Contributor Author

Miciah commented May 17, 2024

/test e2e-gcp-operator

1 similar comment
@Miciah
Copy link
Contributor Author

Miciah commented May 17, 2024

/test e2e-gcp-operator

@candita
Copy link
Contributor

candita commented May 17, 2024

For deployments, the diffs have some noise for the volume DefaultModes parameter and for various probe parameters. I can push a commit like 1badfc6 to fix those if you like.

As this PR is blocked on #1054 anyway, I went ahead and pushed commits to fix these bogus diffs. I'll check the diffs in the ingress-operator logs when we get back the results from a new CI run.

I noticed your changes. I think this is causing an issue in a 4.12 bug that I'm looking into. Posted in slack.

Update: we may need to backport to 4.12.

@candita
Copy link
Contributor

candita commented May 17, 2024

Internal service and NodePort diffs I am seeing in 4.12:

- \t\tClusterIPs:               []string{\"10.20.134.56\"},\n
+ \t\tClusterIPs:               nil,\n  

- \t\tSessionAffinity:          \"None\",\n
+ \t\tSessionAffinity:          \"\",\n  

- \t\tIPFamilies:                    []v1.IPFamily{\"IPv4\"},\n
+ \t\tIPFamilies:                    nil,\n

- \t\tIPFamilyPolicy:                &\"SingleStack\",\n
+ \t\tIPFamilyPolicy:                nil,\n  

- \t\tInternalTrafficPolicy:         &\"Cluster\",\n
+ \t\tInternalTrafficPolicy:         nil,\n  

@candita
Copy link
Contributor

candita commented May 17, 2024

Looks like nodes never became available:

level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).

/test e2e-gcp-ovn

@candita
Copy link
Contributor

candita commented May 17, 2024

Regarding this failure in e2e-hypershift:

{"level":"error","ts":"2024-05-17T10:20:44Z","msg":"failed to setup shared OIDC provider","error":"failed to create OIDC provider: LimitExceeded: Cannot exceed quota for OpenIdConnectProvidersPerAccount: 100\n\tstatus code: 409, request id: 9f217247-4e4b-456b-9ab8-a8484f6ccb08","stacktrace":"github.com/openshift/hypershift/test/e2e.main\n\t/go/src/github.com/openshift/hypershift/test/e2e/e2e_test.go:170\ngithub.com/openshift/hypershift/test/e2e.TestMain\n\t/go/src/github.com/openshift/hypershift/test/e2e/e2e_test.go:136\nmain.main\n\t_testmain.go:69\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:267"}

Slack thread discussing quota related issues for Hypershift: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1715956957922319

@Miciah
Copy link
Contributor Author

Miciah commented May 18, 2024

For deployments, the diffs have some noise for the volume DefaultModes parameter and for various probe parameters. I can push a commit like 1badfc6 to fix those if you like.

As this PR is blocked on #1054 anyway, I went ahead and pushed commits to fix these bogus diffs. I'll check the diffs in the ingress-operator logs when we get back the results from a new CI run.

I noticed your changes. I think this is causing an issue in a 4.12 bug that I'm looking into. Posted in slack.

Update: we may need to backport to 4.12.

Right, it makes sense to backport at least add17d7 to 4.12 as #864 was backported to 4.12.

Internal service and NodePort diffs I am seeing in 4.12:

- \t\tClusterIPs:               []string{\"10.20.134.56\"},\n
+ \t\tClusterIPs:               nil,\n  

- \t\tSessionAffinity:          \"None\",\n
+ \t\tSessionAffinity:          \"\",\n  

- \t\tIPFamilies:                    []v1.IPFamily{\"IPv4\"},\n
+ \t\tIPFamilies:                    nil,\n

- \t\tIPFamilyPolicy:                &\"SingleStack\",\n
+ \t\tIPFamilyPolicy:                nil,\n  

Right, these should be fixed by d21abc8 and 1badfc6, so we can backport those too.

- \t\tInternalTrafficPolicy:         &\"Cluster\",\n
+ \t\tInternalTrafficPolicy:         nil,\n  

This should have been fixed by 5579aa1, which is proposed for backport to 4.13. If you're seeing OCPBUGS-13190 on 4.12, please add 4.12 to OCPBUGS-13190's "Affects Versions", and we can propose backporting #927 to 4.12 as well.


The diffs in the ingress-operator logs from the latest e2e-gcp-operator run look really clean now. They are too large to paste in a comment, but you can review them using the following commands:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1052/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1791470779821461504/artifacts/e2e-gcp-operator/gather-extra/artifacts/pods/openshift-ingress-operator_ingress-operator-57f7fb8f95-8p27t_ingress-operator.log' -o ingress-operator.log
% sed < ingress-operator.log -n -e $'/diff/ {s/\\\\n/\\\n/g;s/\\\\t/\t/g;p;}'

However, three anomalies jumped out at me. First, the LoadBalancer-type service for TestAllowedSourceRanges has an inexplicable, delayed second update:

2024-05-17T15:07:39.258Z        INFO    operator.ingress_controller     ingress/load_balancer_service.go:321    updated load balancer service   {"namespace": "openshift-ingress", "name": "router-sourcerange", "diff": "  &v1.Service{
        TypeMeta:   {},
        ObjectMeta: {Name: \"router-sourcerange\", Namespace: \"openshift-ingress\", UID: \"bf35e2d3-bb10-498e-b771-0659315fc65e\", ResourceVersion: \"44908\", ...},
        Spec: v1.ServiceSpec{
                ... // 6 identical fields
                SessionAffinity:          \"None\",
                LoadBalancerIP:           \"\",
-               LoadBalancerSourceRanges: []string{\"127.0.0.0/8\"},
+               LoadBalancerSourceRanges: []string{\"10.0.0.0/8\"},
                ExternalName:             \"\",
                ExternalTrafficPolicy:    \"Local\",
                ... // 8 identical fields
        },
        Status: {LoadBalancer: {Ingress: {{IP: \"34.29.9.29\"}}}},
  }
"}
2024-05-17T15:25:51.134Z        INFO    operator.ingress_controller     ingress/load_balancer_service.go:321    updated load balancer service   {"namespace": "openshift-ingress", "name": "router-sourcerange", "diff": "  &v1.Service{
        TypeMeta:   {},
        ObjectMeta: {Name: \"router-sourcerange\", Namespace: \"openshift-ingress\", UID: \"021a0f6b-e316-4d0a-9aaa-15189dded4f6\", ResourceVersion: \"58305\", ...},
        Spec: v1.ServiceSpec{
                ... // 6 identical fields
                SessionAffinity:          \"None\",
                LoadBalancerIP:           \"\",
-               LoadBalancerSourceRanges: []string{\"127.0.0.0/8\"},
+               LoadBalancerSourceRanges: []string{\"10.0.0.0/8\"},
                ExternalName:             \"\",
                ExternalTrafficPolicy:    \"Local\",
                ... // 8 identical fields
        },
        Status: {LoadBalancer: {Ingress: {{IP: \"34.16.7.141\"}}}},
  }
"}

The second update happens about 18 minutes after the first update. It also happens after the serial tests appear to have finished. I suspect it has something to do with the fact that TestAllowedSourceRanges uses t.Cleanup to delete its IngressController; maybe the cleanup is deleted for that reason, though that doesn't fully explain why the service gets updated a second time (it should just get deleted when the cleanup happens).

The second anomaly that I noticed was this update from the TestRouteHardStopAfterEnableOnIngressConfig test:

                                                        {
                                                                Name:      \"ROUTER_DISABLE_HTTP2\",
-                                                               Value:     \"false\",
+                                                               Value:     \"true\",
                                                                ValueFrom: nil,
                                                        },
                                                        {Name: \"ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK\", Value: \"false\"},
                                                        {Name: \"ROUTER_DOMAIN\", Value: \"apps.ci-op-tg7xmq00-76b3b.origin-ci-int-gce.dev.rhcloud.com\"},
+                                                       {Name: \"ROUTER_HARD_STOP_AFTER\", Value: \"3h20m0s\"},

The "3h20m" value has to be from this update:

if err := hardStopAfterTestIngressConfig(t, kclient, ingressConfig, routerDeployment, (200 * time.Minute).String()); err != nil {

However, TestRouteHardStopAfterEnableOnIngressConfig is a serial test, and so are TestRouteHTTP2EnableAndDisableIngressController and TestRouteHTTP2EnableAndDisableIngressConfig, so it isn't clear to me why a single update would modify both ROUTER_DISABLE_HTTP2 and ROUTER_HARD_STOP_AFTER. @frobware, any thoughts on this?

The third anomaly I noticed was this:

% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -m 1 -- ingress-operator.log       
2024-05-17T14:46:01.434Z	INFO	operator.ingress_controller	ingress/controller.go:326	successfully updated Infra CR with Ingress Load Balancer IPs
% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -c -- ingress-operator.log 
142

#1016 has a logic error, which causes the operator to log this message even when it didn't do an update:

// If the lbService exists for the "default" IngressController, then update Infra CR's PlatformStatus with the Ingress LB IPs.
if haveLB && ci.Name == manifests.DefaultIngressControllerName {
if updated, err := computeUpdatedInfraFromService(lbService, infraConfig); err != nil {
errs = append(errs, fmt.Errorf("failed to update Infrastructure PlatformStatus: %w", err))
} else if updated {
if err := r.client.Status().Update(context.TODO(), infraConfig); err != nil {
errs = append(errs, fmt.Errorf("failed to update Infrastructure CR after updating Ingress LB IPs: %w", err))
}
}
log.Info("successfully updated Infra CR with Ingress Load Balancer IPs")
}

(I've actually noticed this before, but I never got around to filing a bug report...). #1016 was backported to 4.15, so it would be nice to fix it and backport the fix to 4.15. It's just log noise, though, so it might not be justifiable to backport on its own. It is rather noisy, though, and it's trivial to fix, so if you like, I can bundle a fix for this log message into this PR.

@ShudiLi
Copy link
Member

ShudiLi commented May 20, 2024

/label qe-approved

tested it 4.16.0-0.ci.test-2024-05-20-061242-ci-ln-qyfl0pt-latest

1.
% oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci.test-2024-05-20-061242-ci-ln-qyfl0pt-latest   True        False         10m     Cluster version is 4.16.0-0.ci.test-2024-05-20-061242-ci-ln-qyfl0pt-latest

2.
% oc create -f ingress-hostnetwork.yaml 
ingresscontroller.operator.openshift.io/int1 created
% cat ingress-hostnetwork.yaml                                                         
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: int1
  namespace: openshift-ingress-operator
spec:
  endpointPublishingStrategy:
    hostNetwork:
      httpPort: 23080
      httpsPort: 23443
      statsPort: 23936
    type: HostNetwork
  domain: int1.ci-ln-qyfl0pt-c1627.vmc-ci.devcluster.openshift.com

3.
% oc -n openshift-ingress get pods
NAME                             READY   STATUS    RESTARTS   AGE
router-default-74f5cbcf6-qgtqx   1/1     Running   0          38m
router-default-74f5cbcf6-t742h   1/1     Running   0          38m
router-int1-c56c7dd96-4b72b      1/1     Running   0          11m
router-int1-c56c7dd96-4nvb5      1/1     Running   0          11m

4.
% oc -n openshift-ingress get svc router-internal-int1 -o=jsonpath="{.spec.ports}"
[{"name":"http","port":80,"protocol":"TCP","targetPort":"http"},{"name":"https","port":443,"protocol":"TCP","targetPort":"https"},{"name":"metrics","port":1936,"protocol":"TCP","targetPort":"metrics"}]% 

5.
 % oc -n openshift-ingress get ep router-internal-int1 -o=jsonpath="{.subsets[0].ports}"
[{"name":"metrics","port":23936,"protocol":"TCP"},{"name":"https","port":23443,"protocol":"TCP"},{"name":"http","port":23080,"protocol":"TCP"}]% 

6.
% oc -n openshift-ingress get deployment router-int1 -oyaml | grep -E -A9 "ivenessProbe:|readinessProbe:|startupProbe:"
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 23936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          terminationGracePeriodSeconds: 10
--
        readinessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz/ready
            port: 23936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
--
        startupProbe:
          failureThreshold: 120
          httpGet:
            host: localhost
            path: /healthz/ready
            port: 23936
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1

7.
% oc -n openshift-ingress get deployment router-int1 -oyaml | grep -A6 " - configMap:"
      - configMap:
          defaultMode: 420
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: service-ca-bundle
          optional: false

8.
% oc -n openshift-ingress get svc router-internal-int1 -o=jsonpath="{.spec.sessionAffinity}" 
None%                    

9.  create a load balancer ingress controller and check its service
% print $(oc -n openshift-ingress get svc router-int2 -o=jsonpath="{.spec.type}\n{.spec.sessionAffinity}")
LoadBalancer
None

@candita
Copy link
Contributor

candita commented May 20, 2024

/test e2e-hypershift

@candita
Copy link
Contributor

candita commented May 20, 2024

@Miciah thanks for the thorough analysis and the bug numbers we should plan to backport to fix all the spurious updates seen in 4.12.40. At the moment I think we can safely ignore the logs from CI testing. I have a summary of all the spurious updates seen in 4.12.40 and will coordinate the backports of this bug and any other new one needed, including one to address #1016

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That is a lot of overlooked deployment updates.

@@ -1514,7 +1520,7 @@ func hashableProbe(probe *corev1.Probe) *corev1.Probe {

var hashableProbe corev1.Probe

copyProbe(probe, &hashableProbe)
copyProbe(probe, &hashableProbe, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will make a big difference.

@@ -1645,7 +1652,7 @@ func copyProbe(a, b *corev1.Probe) {
Port: a.ProbeHandler.HTTPGet.Port,
Host: a.ProbeHandler.HTTPGet.Host,
Copy link
Contributor

@candita candita May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the entire function be gated by the copyDefaultValues? I see here it still allows changes to HTTPGet and TerminationGracePeriod. In the future, how will developers know which values to allow changes, and which to prevent changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port is required, and the default value for Host is the empty string, so I believe we can copy them unconditionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, if you wouldn't mind, can you add a comment like Values that are always updated and group them, and another comment Values that are only updated when requested and group them all under a
if copyDefaultValues {. I feel like this may help/force future developers to make a decision about what to do with values in copyProbe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3faf391.

@candita
Copy link
Contributor

candita commented May 20, 2024

/test e2e-aws-ovn-single-node

@Miciah
Copy link
Contributor Author

Miciah commented May 21, 2024

we should plan to backport to fix all the spurious updates seen in 4.12.40.

There have been a number of fixes for spurious updates since 4.12 (though not all of these have been confirmed to affect 4.12 generally or 4.12.40 specifically):

At the moment I think we can safely ignore the logs from CI testing.

I don't understand what you mean by that. I think we should always be looking at the logs from CI testing, especially when we are modifying update logic, and most especially when we are specifically trying to address spurious updates. Do you mean that we should ignore the three issues that I mentioned in #1052 (comment) for now and not try to fix the "successfully updated Infra CR with Ingress Load Balancer IPs" log message or diagnose the other anomalies?

I have a summary of all the spurious updates seen in 4.12.40 and will coordinate the backports of this bug and any other new one needed, including one to address #1016

All right, I will hold off on fixing the "successfully updated Infra CR with Ingress Load Balancer IPs" issue until a separate bug report is filed.

In my opinion, we should backport in chronological order (#879 first, then #927, then #947, and so on) in order to keep the history consistent among branches and avoid conflicts as much as possible, but we will encounter merge conflicts along the way in any case with some of these PRs (certainly with #1052).

@candita
Copy link
Contributor

candita commented May 21, 2024

I don't understand what you mean by that. I think we should always be looking at the logs from CI testing, especially when we are modifying update logic, and most especially when we are specifically trying to address spurious updates. Do you mean that we should ignore the three issues that I mentioned in #1052 (comment) for now and not try to fix the "successfully updated Infra CR with Ingress Load Balancer IPs" log message or diagnose the other anomalies?

Yes, that is what I meant.

@candita
Copy link
Contributor

candita commented May 23, 2024

* pkg/operator/controller/ingress/deployment.go (copyProbe): Group together
the assignments for the case that copyDefaultValues is true.
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 24, 2024
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-23221, which is invalid:

  • expected the bug to target either version "4.17." or "openshift-4.17.", but it targets "4.16.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

internalServiceChanged: Fix target port logic

Fix internalServiceChanged to copy target ports from the expected service rather than preserving target ports from the current service. Only copy NodePort and other fields that we want to preserve.

Before this PR, the internalServiceChanged function explicitly preserved the current service's ports' target ports when reconciling the service. However, the function was intended to update the target ports if they changed.

In particular, if the current service specifies a numeric target port for the metrics port, then the operator should replace the port number with a port name so that the service uses the correct port even if the pod container's port number changes. Without this change, updating an IngressController's spec.endpointPublishingStrategy.hostNetwork.statsPort API field can break metrics as the service's metrics port's numeric target port may no longer match the pod container's metrics port.

Follow-up to #864.

Test_*Changed: Verify that some update was made

This is a follow-up to the previous change for consistency.

crdChanged, nodePortServiceChanged: Fix preserve logic

Fix the update logic in crdChanged and nodePortServiceChanged to preserve fields that are ignored when checking whether an update is necessary.

Follow-up to #567, #890, and #927.

Set SessionAffinity to avoid confusing diffs

If a service does not specify spec.sessionAffinity, the API sets a default value of "None". This means that when creating or updating a service, the empty value and "None" are equivalent, and so when reconciling a service, cluster-ingress-operator treats these values as equal. This results in the correct update behavior: The operator ignores the API defaulting, but if someone sets any value other than the empty value or "None", the operator reverts the update; and if the operator leaves spec.sessionAffinity empty when reverting the update, the API again defaults it to "None".

However, even the updating behavior is correct, it can cause confusing log output when the operator logs the diff with the changes: If the operator has the empty value in the updated object but the current value is actually "None", this shows up in the diff, even though the actual update does not change the field value when taking defaulting into account.

This change adds an explicit spec.sessionAffinity: None setting to the services that the operator manages to avoid having the API defaulting show up in these diffs.

Follow-up to #407.

Test_desiredLoadBalancerService: Add assertions

Add assertions for service type, externalTrafficPolicy, internalTrafficPolicy, and ports, for consistency with Test_desiredInternalIngressControllerService.

Test_desiredInternalIngressControllerService: Fix godoc

Fix capitalization of the test function name in the godoc.

TestHostNetworkPortBinding: Use t.Log, t.Cleanup

Update TestHostNetworkPortBinding to comply with current conventions. Convert some comments into t.Log statements, replace defer with t.Cleanup, and tweak some error messages.

TestHostNetworkPortBinding: Check service ports

Update TestHostNetworkPortBinding to verify that the internal service has the expected ports. In particular, the metrics port should have its target port set to the name "metrics" rather than a port number.

Test_deploymentConfigChanged: Use k8s.io/utils/ptr

Replace use of the deprecated "k8s.io/utils/pointer" package by using "k8s.io/utils/ptr" instead.

Set "service-ca-bundle" volume's default mode

Set an explicit value of 420 decimal (0644 octal) for the router deployment's "service-ca-bundle" volume's default mode.

Previously, the router deployment yaml manifest had defaultMode indented incorrectly, and so the operator did not actually set the intended value for the volume's default mode. Because the intended value is equal to the default value, the volume ultimately ends up with the intended default mode anyway. However, the fact that the operator did not specify the default mode made log output show a bogus diff for the volume when the operator updated the deployment.

After this change, the yaml is correctly indented, and the log output for a deployment update has a less noisy diff.

Follow-up to #399.

Specify default mode on configmap/secret volumes

Set an explicit value of 420 decimal (0644 octal) for the default mode of all configmap or secret volumes in the router deployment that are defined in Go code (as opposed to being defined in the yaml manifest): client-ca, default-certificate, error-pages, metrics-certs, rsyslog-config, service-ca-bundle, and stats-auth.

After this change, the log output for a deployment update has a less noisy diff.

Specify parameter values for router probes

Set explicit values for the liveness, readiness, and startup probes in the router deployment's yaml manifest.

Previously, the router deployment yaml manifest omitted these values when the intended values were equal to the default values. This meant that the API server would set the default values for these parameters, and so the probes ultimately ended up with the intended parameter values. However, omitting these values in the yaml manifest also had the unintended side-effect of making the operator log bogus diffs for these parameters whenever it updated the deployment.

After this change, the log output for a deployment update has a less noisy diff.

copyProbe: Refactor to group copying of defaults

Group together the assignments for the case that copyDefaultValues is true.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor Author

Miciah commented May 24, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 24, 2024
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-23221, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 24, 2024
Copy link
Contributor

openshift-ci bot commented May 24, 2024

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn 3faf391 link false /test e2e-azure-ovn
ci/prow/e2e-aws-gatewayapi 3faf391 link false /test e2e-aws-gatewayapi
ci/prow/e2e-gcp-ovn 3faf391 link false /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-single-node 3faf391 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants