Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Controlled Canary Releases for Function Deployments #487

Open
Bryce-huang opened this issue Aug 29, 2023 · 8 comments
Open

Comments

@Bryce-huang
Copy link
Contributor

Proposal
The aim of this proposal is to introduce the Canary Release strategy to enhance our software release process for functions. Canary Release involves gradually deploying new features or code changes by exposing them to a small subset of users for testing and validation in a real environment. This approach helps minimize potential production issues, provides early feedback, and ensures system stability and reliability.

Motivation
Risk Mitigation: Every software release carries inherent risks. By initially deploying new features to a limited group of users, we can closely monitor their interactions and assess the impact of the changes. This empowers us to identify and address any unforeseen issues before they cascade to a wider user base, reducing the potential for widespread disruption.

Early Issue Detection: The Canary Release strategy provides an avenue for gathering real-time user feedback from an early stage. This invaluable input allows us to address any usability issues, performance bottlenecks, or bugs promptly, resulting in a smoother and more user-friendly experience once the features are fully deployed.

Enhanced Control: With Canary Releases, we gain a finer degree of control over the deployment process. By adjusting traffic distribution between old and new instances, we can precisely regulate the exposure of new functionalities to different user groups. This control enables us to tailor the user experience and assess the impact of changes under varying conditions.

Flexibility for Rollbacks: The ability to swiftly roll back changes in case of critical issues is a critical aspect of Canary Releases. If unforeseen problems arise during the release phase, we can minimize user disruption by reverting to the previous version while ensuring that only the Canary users are affected.

Data-Driven Decisions: The controlled Canary Release strategy provides us with a wealth of data that can inform future decisions. By analyzing performance metrics, user behavior patterns, and feedback from the Canary group, we can make informed choices about the further development and refinement of the deployed features.

Goals

Risk Mitigation: By incrementally releasing features, we can test new functionalities within a smaller user group, significantly reducing the potential impact on the entire system.
Early Issue Detection: Canary Releases provide an opportunity for early user feedback in the production environment, enabling prompt issue identification and resolution.
Rollback Capability: In case of critical issues, we can swiftly revert to the previous state by affecting only the Canary users while keeping the rest of the user base unaffected.

Example
image

Action Items

To implement the Canary Release strategy for functions, we will follow these steps:

2.1 Add Canary Release Field to Function's Struct:

Add a new field to the function's struct to indicate whether the function is in Canary Release mode. This field could be a boolean value, such as isCanaryRelease.

2.2 Set Gray Instance Traffic Weight:

To support controlled traffic distribution, we will introduce the concept of traffic weight. When the function is in Canary Release mode, users will be allowed to set the traffic weight ratios for new and old instances. This can be configured through a settings file or a management interface, e.g., canaryWeight and oldWeight.

refers:https://gateway-api.sigs.k8s.io/guides/traffic-splitting/

@benjaminhuo
Copy link
Member

refers:https://gateway-api.sigs.k8s.io/guides/traffic-splitting/

That's a good feature to have if any gateway backend especially contour implements this gateway spec.
@Bryce-huang @wanjunlei @wrongerror @tpiperatgod does the contour version we used now support gateway API traffic splitting?

@Bryce-huang
Copy link
Contributor Author

Bryce-huang commented Aug 31, 2023

functions spec:

CanarySteps []CanaryStep `json:"CanarySteps,omitempty"`
type CanaryStep struct {
	Weight *int32 `json:"weight,omitempty"`
	Pause  Pause  `json:"pause",omitempty`
}
type Pause struct {
	// Duration the amount of time to wait before moving to the next step.
	// +optional
	Duration *int32 `json:"duration,omitempty"`
}

functions status:

        StableServing     *Condition    `json:"stableServing,omitempty"`
	StableServingHash string        `json:"stableServingHash,omitempty"`
	CanaryServingHash string        `json:"canaryServingHash,omitempty"`
	CanaryStatus      *CanaryStatus `json:"canaryStatus,omitempty"`
// CanaryStatus status fields that only pertain to the canary release
type CanaryStatus struct {
	CurrentStepIndex int32           `json:"currentStepIndex"`
	CurrentStepState CanaryStepState `json:"currentStepState"`
	Message          string          `json:"message,omitempty"`
	LastUpdateTime   *metav1.Time    `json:"lastUpdateTime,omitempty"`
	Phase            CanaryPhase     `json:"phase"`
}
type CanaryStepState string

const (
	CanaryStepStateUpgrade   CanaryStepState = "StepUpgrade"
	CanaryStepStatePaused    CanaryStepState = "StepPaused"
	CanaryStepStateReady     CanaryStepState = "StepReady"
	CanaryStepStateCompleted CanaryStepState = "Completed"
)

// CanaryPhase are a set of phases that this Canary release
type CanaryPhase string

const (
	// CanaryPhaseInitial indicates a function canary release is Initial
	CanaryPhaseInitial CanaryPhase = "Initial"
	// CanaryPhaseHealthy indicates  a function canary release is healthy
	CanaryPhaseHealthy CanaryPhase = "Healthy"
	// CanaryPhaseProgressing indicates  a function canary release is not yet healthy but still making progress towards a healthy state
	CanaryPhaseProgressing CanaryPhase = "Progressing"
	// CanaryPhaseTerminating indicates  a function canary release is terminated
	CanaryPhaseTerminating CanaryPhase = "Terminating"
)

@JasonChen86899
Copy link

Hi @Bryce-huang maybe Argo Rollouts is a better choice for canary and blue-Green

@Bryce-huang
Copy link
Contributor Author

Hi @Bryce-huang maybe Argo Rollouts is a better choice for canary and blue-Green

Do you have any experience using cargo rollout on functions?

@JasonChen86899
Copy link

yeah just a little with dapr and you reminded me.

  1. Argo-rollout is used in Dapr 1.11 and 1.12. Guidance for Blue Green deployments  dapr/dapr#6855
  2. Argo-rollout can be combined with Knative. Argo Rollouts extensions argoproj/argo-rollouts#2133

here is my simple design

  1. design openfunction rollout CR(if need)
  2. using Argo Rollouts Plugin when creating knative service(have no try)

Maybe this is not a good design but we can make a further discussion with community

@Bryce-huang
Copy link
Contributor Author

Bryce-huang commented Sep 20, 2023

yeah just a little with dapr and you reminded me.

  1. Argo-rollout is used in Dapr 1.11 and 1.12. Guidance for Blue Green deployments  dapr/dapr#6855
  2. Argo-rollout can be combined with Knative. Argo Rollouts extensions argoproj/argo-rollouts#2133

here is my simple design

  1. design openfunction rollout CR(if need)
  2. using Argo Rollouts Plugin when creating knative service(have no try)

Maybe this is not a good design but we can make a further discussion with community

Thank you for your introduction. I have considered argo rollout in the past, but this requires the use of argo rollout cr, which means the introduction of new components and increases the architectural complexity of openfunction.
Considering that the function can be reduced to 0, I think it is most suitable to implement it in the k8s gateway api. It takes very little work to achieve canary publishing in my PRhttps://github.com//pull/490

@JasonChen86899
Copy link

JasonChen86899 commented Sep 20, 2023

okay I found that Knative Serving also had traffic management and rollout with revision https://knative.dev/docs/serving/traffic-management/#traffic-routing-examples.
I think we can also use this feature. we can add another implementing even using your defined cr

@benjaminhuo
Copy link
Member

I would suggest the following changes:

apiVersion: core.openfunction.io/v1beta2
kind: Function
metadata:
  name: test-server
spec:
  rolloutStrategy:
    canary:
      steps:
      - weight: 20
        pause:
          duration: 60
      - weight: 80
        pause:
          duration: 120
  image: brycehuang/web-service-image:v1
  serving:
    template:
      containers:
      - imagePullPolicy: IfNotPresent
        name: function
  version: latest
  workloadRuntime: OCIContainer
status:
  # status.serving and status.revision should always be the stable serving and revision
  # status.canary.serving and status.canary.revision should be the canary ones
  serving:
  revision: 
  rollout:
    canary:
      status: # CanaryStatus
      serving:
      revision:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants