Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.15] OCPBUGS-28928: Upgrade Validation plugin for SHA1 certs #585

Open
wants to merge 12 commits into
base: release-4.15
Choose a base branch
from

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Apr 23, 2024

Manual cherry-pick of #555, #575, #588, #587 (in that order). All required to support new Upgrade Validation plugin.

This PR adds support for setting a new status condition type, UnservableInFutureVersions, when a SHA1 cert is detected on a route. It accomplishes this by adding a new plugin, the upgrade validation plugin. This plugin is simple: when a route configuration that is unsupported in the future is detected, it adds a UnservableInFutureVersions status condition to the route. When the problematic configuration is no longer detected, it clears/removes the UnservableInFutureVersions condition.

In order to support updating a new route status condition, the routeRejectionRecorder and associated methods were refactored to be more generic. The writerlease logic used to update route status had to be refactored because the old code made assumptions that only one type of status condition would ever be updated.

It also updates certificates for existing unit tests, as some certificates were SHA1 and incorrectly labeled as expired.

Depends on openshift/api#1751
Related Ingress Operator PR openshift/cluster-ingress-operator#1014

DELTA FROM #555:

HOW I CREATED THIS PR:

  1. Cherry-Pick the commits in OCPBUGS-26498: Upgrade Validation plugin for SHA1 certs #555. 7ad7ec1 is the merge commit for this PR:
    # Get all commits in this merge commit:
    $ git log --oneline  
    7ad7ec1d2390aeeb851874ec1ab7ba0dadbf1158^..7ad7ec1d2390aeeb851874ec1ab7ba0dadbf1158
    7ad7ec1d Merge pull request #555 from gcs278/OCPBUGS-26498-upgradeable-status
    c22b003f Add Upgrade Validation router plugin
    e8b2579d Remove original condition clearing from recordIngressCondition
    e70ba933 Fix recordIngressCondition changed=false issue.
    79d80c4e Bump openshift/api
    6bec8018 Bump golang 1.21
    2c881701 Regenerate certs for unit tests
    ce709bc3 Add .gitleaks.toml
    
    # Cherry-Pick the first two before `6bec8018 Bump golang 1.21`
    $ git cherry-pick ce709bc3 2c881701
    [...no conflicts...]
    
    # Create the `Bump openshift/api` commit with the 4.15 API
    $ go get github.com/openshift/api@93d6bda14341f1d18319774367829c3278c171e7
    $ go mod tidy
    $ go mod vendor
    $ git add -A
    $ git commit
    [...message...]
    
    # Finish cherry-picking commits after `79d80c4e Bump openshift/api`
    $ git cherry-pick e70ba933 e8b2579d c22b003f
    [...no conflicts...]
    
  2. Cherry-Pick the commit in OCPBUGS-26498: Add Upgrade Validation force arguments for running E2E tests #575. d2d6892 is the merge commit for this PR:
    # Get all commits in this merge commit:
    $ git log --oneline d2d6892ca0bf71574a8f95579dfe7a848a5cf359^..d2d6892ca0bf71574a8f95579dfe7a848a5cf359
    d2d6892c Merge pull request #575 from gcs278/OCPBUGS-26498-upgradeable-status-E2E-args
    0ff5da48  Add Upgrade Validation force arguments for running E2E tests
    
    # Cherry-Pick the non-merge commit
    $ git cherry-pick 0ff5da48
    [...no conflicts...]
    
  3. Cherry-Pick the commits in OCPBUGS-26498: Make ingressConditionsEqual more efficient #588. b6f7c63 is the merge commit for this PR:
    # Get all commits in this merge commit:
    $ git log --oneline b6f7c63c52e6b20fde2549a3c3e69a052ab453cc^..b6f7c63c52e6b20fde2549a3c3e69a052ab453cc
    b6f7c63c Merge pull request #588 from gcs278/efficient-contention-comparison
    44f217fe Make ingressConditionsEqual more efficient
    3f171901 Add log for attempting to update route status
    66eb2237 Add verbose logging for plugin chain execution
    efcba1b6 Log election results in writerlease
    
    # Cherry-Pick the commits
    $ git cherry-pick efcba1b6 66eb2237 3f171901 44f217fe
    [...no conflicts...]
    
  4. Cherry-Pick the commits in OCPBUGS-26498: Optimize Upgrade Validation plugin by skipping unnecessary changes #587. 5610ac8 is the merge commit for this PR:
    # Get all commits in this merge commit:
    $ git log --oneline 5610ac80b43350131694dddf42660e8cc844389d^..5610ac80b43350131694dddf42660e8cc844389d
    5610ac80 Merge pull request #587 from gcs278/efficient-route-status-updates
    13ab1dd8 Optimize Upgrade Validation plugin by skipping unnecessary changes
    
    # Cherry-Pick the 13ab1dd8 commit
    $ git cherry-pick 13ab1dd8
    [...no conflicts...]
    

@openshift-ci-robot openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Apr 23, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 23, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-28928, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected dependent Jira Issue OCPBUGS-26498 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is POST instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Manual cherry-pick of #555

This PR adds support for setting a new status condition type, UnservableInFutureVersions, when a SHA1 cert is detected on a route. It accomplishes this by adding a new plugin, the upgrade validation plugin. This plugin is simple: when a route configuration that is unsupported in the future is detected, it adds a UnservableInFutureVersions status condition to the route. When the problematic configuration is no longer detected, it clears/removes the UnservableInFutureVersions condition.

In order to support updating a new route status condition, the routeRejectionRecorder and associated methods were refactored to be more generic. The writerlease logic used to update route status had to be refactored because the old code made assumptions that only one type of status condition would ever be updated.

It also updates certificates for existing unit tests, as some certificates were SHA1 and incorrectly labeled as expired.

Depends on openshift/api#1751
Related Ingress Operator PR openshift/cluster-ingress-operator#1014

WIP: Need openshift/api#1751 to merge so we can vendor && it would great if #584 merged to avoid merge conflicts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from candita and knobunc April 23, 2024 01:34
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from gcs278. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gcs278 gcs278 changed the title [WIP] OCPBUGS-28928: Upgrade Validation plugin for SHA1 certs [WIP] [release-4.15] OCPBUGS-28928: Upgrade Validation plugin for SHA1 certs Apr 23, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Apr 23, 2024

Odd failures that I don't see locally
/test unit

@gcs278 gcs278 force-pushed the OCPBUGS-26498-upgradeable-status-4.15 branch from b2150de to 2cf6fdd Compare May 6, 2024 02:51
@gcs278 gcs278 changed the title [WIP] [release-4.15] OCPBUGS-28928: Upgrade Validation plugin for SHA1 certs [release-4.15] OCPBUGS-28928: Upgrade Validation plugin for SHA1 certs May 6, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-28928, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected dependent Jira Issue OCPBUGS-26498 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Manual cherry-pick of #555, #588, #587 (All required to support new Upgrade Validation plugin)

This PR adds support for setting a new status condition type, UnservableInFutureVersions, when a SHA1 cert is detected on a route. It accomplishes this by adding a new plugin, the upgrade validation plugin. This plugin is simple: when a route configuration that is unsupported in the future is detected, it adds a UnservableInFutureVersions status condition to the route. When the problematic configuration is no longer detected, it clears/removes the UnservableInFutureVersions condition.

In order to support updating a new route status condition, the routeRejectionRecorder and associated methods were refactored to be more generic. The writerlease logic used to update route status had to be refactored because the old code made assumptions that only one type of status condition would ever be updated.

It also updates certificates for existing unit tests, as some certificates were SHA1 and incorrectly labeled as expired.

Depends on openshift/api#1751
Related Ingress Operator PR openshift/cluster-ingress-operator#1014

DELTA FROM #555:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented May 22, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 22, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-28928, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.z) matches configured target version for branch (4.15.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-26498 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-26498 targets the "4.16.0" version, which is one of the valid target versions: 4.16.0
  • bug has dependents

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Avoid scanning validation_test.go, plugin_test.go, and router_test.go
since they always contain fake/example certificates.
SHA1 certs will be unsupported in the future. Clean up unit tests to
organize clearly between SHA1 and SHA256 certs. The unit tests had SHA1
certs that weren't tested separately. Also, the unit tests claimed to
be using an expired certificate, but it was NOT expired. Document all
certs by adding openssl comands for the regenerated certs to help future
developers recreate certs if needed.
Bump openshift/api to get new route UnservableInFutureVersions
status condition type.

  go get github.com/openshift/api@93d6bda14341f1d18319774367829c3278c171e7
  go mod tidy
  go mod vendor
When an route ingress already exists, but an expected condition isn't
found, recordIngressCondition would previously not change/update
anything. This wasn't an issue for when only had Admitted because the
route ingress was updated synchronously with the Admitted condition.
However, adding another route status condition exposes the scenario in
which a route ingress will exist, but the desired condition won't.

This also adds a unit test for recordIngressCondition which explicitly
tests this case along with other edge cases.
Previously, recordIngressCondition sometimes clears the original return value
conditions depending on whether the condition was found and changed. This
doesn't seem consistent and makes it hard to unit test.

The original return value only affects logging. Logging the original conditions
is advantageous because modifying conditions can also lead to contention,
and having visibility into the modified conditions is beneficial.

Therefore, it seems reasonable to remove condition clearing as it
benefits logging and it's inconsistent.

Update unit testing to now verify original return value from
recordIngressCondition.
Add the Upgrade Validation plugin which provides a framework for adding
a status to a route to indicate future upgrade issues. For this
initial implementation, if a route has a SHA1 certificate, the upgrade
validation plugin will mark the route with a UnservableInFutureVersions
status. If the route no longer has a SHA1 certificate, remove the status
instead of setting UnservableInFutureVersions=false.

This is specific to OpenShift 4.15 and is expected to be removed
from 4.16 once backported to avoid confusion. It's specific to 4.15
because 4.15 accepts SHA1 certs, but 4.16 does not. Therefore,
validating the upgrade should be only done in 4.15.

Update the RejectionRecorder interface to be the RouteStatusRecorder to
support more generic status-related use cases as we have more than just
Admitted status now.

Add UPGRADE_VALIDATION env variable, but default to true, so that we can
disable in the future. Follow existing pattern similar to existing
EXTENDED_VALIDATION env variable.

Key the writerlease logic off of route UID AND condition type. This is
now essential because if only route UID, then the lease logic will only
update one condition as it doesn't differentiate from the work items for
the different conditions.

Adjust findMostRecentIngress to use the latest condition instead of just
Admitted as we can no longer assume that Admitted is the only condition.

Add a type for writelease WorkKey to ensure strings aren't used anymore
as the previous logic was only using route.UID. Add a type for
ContentionTracker key to ensure it is different than writelease WorkKey.

Update ingressEqual function to detect contention if any condition is
not equal. Previously, it hardcoded the "Admitted" status. Add a unit
test, TestRouterContentionUpdateCondition to test these changes.
Add --debug-upgrade-validation-force-add-condition and
--debug-upgrade-validation-force-remove-condition as arguments
to the openshift-router. These force add or force remove the
UnservableInFutureVersions condition to allow for an E2E test
to do contention testing on a condition-level. For debugging
and testing purposes only.
Add missing key to log statement and log when a worker becomes elected
to leader or demoted to a follower for better clarity in router logs.
Introduces verbose (level 10) logging to trace plugin chain execution.
A bug was identified where the plugin chain halted unexpectedly, and
logging the plugin chain steps aids in debugging.
Add verbose logging to help indicate the router is attempting to update
a route status. This is helpful for debugging racy router status logic.
We found that the router's contention tracker was slower after updates
were made to support the UnservableInFutureVersions condition. This
lag sometimes causes routers to update the route status before the
contention tracker detected an update, leading to the per-route
contention logic not activating.

In such scenarios, the maxContentions logic eventually activates, temporarily
preventing the router from making any updates to the routes. This commit
improves the efficiency of the ingressConditionsEqual function to help
mitigate this type of issues.
Both performIngressConditionUpdate and performIngressConditionRemoval functions
add tasks to the writerlease queue even if no work needed to be done. This commit
optimizes the Upgrade Validation plugin by ensuring that tasks for updating
UnservableInFutureVersions are only added to the queue when changes are required.

This reduction in unnecessary work significantly lowers the frequency of lease
extensions. Previously, excessive lease extensions could delay route status
updates under certain conditions, such as during temporary contention periods
where a router pod gets demoted to a follower. After the contention is resolved,
the pod’s subsequent retry will be delayed more than necessary.

Additionally, this update provides a clearer indication of when updates are actually
required, but have been completed by another router pod. The prior logic did not
clearly distinguish between unnecessary updates and updates that were completed by
another pod.

The selective updates are only added to the interface used by Upgrade Validation
plugin to avoid perturbing existing Admitted condition logic. This means
nominal clusters without SHA1 routes should not be impacted by the Upgrade
Validation plugin, minimizing the risk associated with its introduction.
@gcs278 gcs278 force-pushed the OCPBUGS-26498-upgradeable-status-4.15 branch from a188598 to 5afca9e Compare May 22, 2024 20:40
@gcs278
Copy link
Contributor Author

gcs278 commented May 22, 2024

I did my own testing with this PR in cluster bot with the backport of the origin test openshift/origin#28820, just as a sanity check:

$ ./openshift-tests run-test "[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router converges when multiple routers are writing conflicting upgrade validation status [Suite:openshift/conformance/parallel]"
...
  Ran 1 of 1 Specs in 73.082 seconds
  SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped

Copy link
Contributor

openshift-ci bot commented May 22, 2024

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-upgrade 5afca9e link true /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor Author

gcs278 commented May 29, 2024

/assign @frobware

@gcs278
Copy link
Contributor Author

gcs278 commented May 29, 2024

/assign @frobware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants