Skip to content

Latest commit

 

History

History
275 lines (180 loc) · 17.6 KB

README.md

File metadata and controls

275 lines (180 loc) · 17.6 KB

Reference Implementation - Infrastructure

Table of contents


The Azure Mission-Critical reference implementation follows a layered and modular approach. This approach achieves the following goals:

  • Cleaner and manageable deployment design
  • Ability to switch service(s) with other services providing similar capabilities depending on requirements
  • Separation between layers which enables implementation of RBAC easier in case multiple teams are responsible for different aspects of Azure Mission-Critical application deployment and operations

The Azure Mission-Critical reference implementations are composed of three distinct layers:

  • Infrastructure
  • Configuration
  • Application

Infrastructure layer contains all infrastructure components and underlying foundational services required for Azure Mission-Critical reference implementation. It is deployed using Terraform.

Note: Bicep (ARM DSL) was considered during the early stages as part of a proof-of-concept, but discontinued for the time being.

Configuration layer applies the initial configuration and additional services on top of the infrastructure components deployed as part of infrastructure layer.

Application layer contains all components and dependencies related to the application workload itself.

Architecture

Architecture overview

Stamp independence

Every stamp - which usually corresponds to a deployment to one Azure Region - is considered independent. Stamps are designed to work without relying on components in other regions (i.e. "share nothing").

The main shared component between stamps which requires synchronization at runtime is the database layer. For this, Azure Cosmos DB was chosen as it provides the crucial ability of multi-region writes i.e., each stamp can write locally with Cosmos DB handling data replication and synchronization between the stamps.

Aside from the database, a geo-replicated Azure Container Registry (ACR) is shared between the stamps. The ACR is replicated to every region which hosts a stamp to ensure fast and resilient access to the images at runtime.

Stamps can be added and removed dynamically as needed to provide more resiliency, scale and proximity to users.

A global load balancer is used to distribute and load balance incoming traffic to the stamps (see Networking and connectivity for details).

Stateless compute clusters

As much as possible, no state should be stored on the compute clusters with all states externalized to the database. This allows users to start a user journey in one stamp and continue it in another.

Scale Units

In addition to stamp independence and stateless compute clusters, each "stamp" is considered to be a Scale Unit (SU) following the Deployment stamps pattern. All components and services within a given stamp are configured and tested to serve requests in a given range. This includes auto-scaling capabilities for each service as well as proper minimum and maximum values and regular evaluation.

An example Scale Unit design in Azure Mission-Critical consists of scalability requirements i.e. minimum values / the expected capacity:

Scalability requirements

Metric max
Users 25k
New records/sec. 200
Get records/sec. 5000

This definition is used to evaluate the capabilities of a SU on a regular basis, which later then needs to be translated into a Capacity Model. This in turn will inform the configuration of a SU which is able to serve the expected demand:

Configuration

Component min max
AKS nodes 3 12
Ingress controller replicas 3 24
CatalogService replicas 3 24
BackgroundProcessor replicas 3 12
Event Hub throughput units 1 10
Cosmos DB RUs 4000 40000

Note: Cosmos DB RUs are scaled in all regions simultaneously.

Each SU is deployed into an Azure region and is therefore primarily handling traffic from that given area (although it can take over traffic from other regions when needed). This geographic spread will likely result in load patterns and business hours that might vary from region to region and as such, every SU is designed to scale-in/-down when idle.

Infrastructure

Available Azure Regions

The reference implementation of Azure Mission-Critical deploys a set of Azure services. These services are not available across all Azure regions. In addition, only regions which offer Availability Zones (AZs) are considered for a stamp. AZs are gradually being rolled-out and are not yet available across all regions. Due to these constraints, the reference implementation cannot be deployed to all Azure regions.

As of May 2022, following regions have been successfully tested with the reference implementation of Azure Mission-Critical:

Europe/Africa

  • northeurope
  • westeurope
  • germanywestcentral
  • francecentral
  • uksouth
  • norwayeast
  • swedencentral
  • switzerlandnorth
  • southafricanorth

Americas

  • westus2
  • eastus
  • eastus2
  • centralus
  • southcentralus
  • brazilsouth
  • canadacentral

Asia Pacific

  • australiaeast
  • southeastasia
  • eastasia
  • japaneast
  • koreacentral
  • centralindia

Note: Depending on which regions you select, you might need to first request quota with Azure Support for some of the services (mostly for AKS VMs and Cosmos DB).

It's worth calling out that where an Azure service is not available, an equivalent service may be deployed in its place. Availability Zones are the main limiting factor as far as the reference implementation of AZ is concerned.

As regional availability of services used in reference implementation and AZs ramp-up, we foresee this list changing and support for additional Azure regions improving where reference implementation can be deployed.

Note: If the target availability SLA for your application workload can be achieved without AZs and/or your workload is not bound with compliance related to data sovereignty, an alternate region where all services/AZs are available can be considered.

Global resources

Azure Front Door

  • Front Door is used as the only entry point for user traffic. All backend systems are locked down to only allow traffic that comes through the AFD instance.
  • Each stamp comes with a pre-provisioned Public IP address resource, which DNS name is used as a backend for Front Door.
  • Diagnostic settings are configured to store all log and metric data for 30 days (retention policy) in Log Analytics.

Azure Cosmos DB

  • SQL-API (Cosmos DB API) is being used
  • Multi-region write is enabled
  • The account is replicated to every region in which there is a stamp deployed.
  • zone_redundancy is enabled for each replicated region.
  • Request Unit autoscaling is enabled on container-level.

Azure Container Registry

  • sku is set to Premium to allow geo-replication.
  • georeplication_locations is automatically set to reflect all regions that a regional stamp was deployed to.
  • zone_redundancy_enabled provides resiliency and high availability within a specific region.
  • admin_enabled is set to false. The admin user access will not be used. Access to images stored in ACR, for example for AKS, is only possible using AzureAD role assignments.
  • Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Log Analytics for Global Resources

  • Used to collect diagnostic logs of the global resources
  • daily_quota_gb is set to prevent overspend, especially on environments that are used for load testing.
  • retention_in_days is used to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.

Stamp resources

A stamp is a regional deployment and can also be considered as a scale-unit. For now we only always deploy one stamp in an Azure Region but this can be extended to allow multiple stamps per region if required.

Networking

The current networking setup consists of a single Azure Virtual Network per stamp that consists of one subnet dedicated for Azure Kubernetes Service (AKS).

  • Each stamp infrastructure includes a pre-provisioned static Public IP address resource with a DNS name ([prefix]-cluster.[region].cloudapp.azure.com). This Public IP address is used for the Kubernetes Ingress controller Load Balancer and as a backend address for Azure Front Door.
  • Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Key Vault

  • Key Vault is used as the sole configuration store by the application for both secret as well as non-sensitive values.
  • sku_name is set to standard.
  • Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Kubernetes Service

Azure Kubernetes Service (AKS) is used as the compute platform as it is most versatile and as Kubernetes is the de-facto compute platform standard for modern applications, both inside and outside of Azure.

This Azure Mission-Critical reference implementation uses Linux-only clusters as its sample workload is written in .NET Core and there is no requirement for any Windows-based containers.

  • role_based_access_control (RBAC) is enabled.
  • sku_tier set to Paid (Uptime SLA) to achieve the 99.95% SLA within a single region (with availability_zones enabled).
  • http_application_routing is disabled as it is not recommended for production environments, a separate Ingress controller solution will be used.
  • Managed Identities (SystemAssigned) are used, instead of Service Principals.
  • azure_policy_enabled is set to true to enable the use of Azure Policies in Azure Kubernetes Service. The policy configured in the reference implementation is in "audit-only" mode. It is mostly integrated to demonstrate how to set this up through Terraform.
  • oms_agent is configured to enable the Container Insights addon and ship AKS monitoring data to Azure Log Analytics via an in-cluster OMS Agent (DaemonSet).
  • Diagnostic settings are configured to store all log and metric data in Log Analytics.
  • default_node_pool (used as system node pool) settings
    • availability_zones is set to 3 to leverage all three AZs in a given region.
    • enable_auto_scaling is configured to let all node pools automatically scale out if needed.
    • os_disk_type is set to Ephemeral to leverage Ephemeral OS disks for performance reasons.
    • upgrade_settings max_surge is set to 33% which is the recommended value for production workloads.
  • Separate "workload" (aka user) node pool with same settings as "system" node pool but different VM SKUs and auto-scale settings.
    • The user node pool is configured with a taint workload=true:NoSchedule to prevent non-workload pods from being scheduled. The node_label set to role=workload can be used to target this node pool when deploying a workload (see charts/catalogservice for an example).

Individual stamps are considered ephemeral and stateless. Updates to the infrastructure and application are following a Zero-downtime Update Strategy and do not touch existing stamps. Updates to Kubernetes are therefore primarily rolled out by releasing new versions and replacing existing stamps. To update node images between two releases, the automatic_channel_upgrade in combination with maintenance_window is used:

  • automatic_channel_upgrade is set to node-image to automatically upgrade node pools with the most recent AKS node image.
  • maintenance_window contains the allowed window to run automatic_channel_upgrade upgrades. It is currently set to allowed on Sunday between 0 and 2 AM.

Azure Log Analytics for Stamp Resources

Each region has an individual Log Analytics workspace configured to store all log and metric data. As each stamp deployment is considered ephemeral, these workspaces are deployed as part of the global resources and do not share the lifecycle of a stamp. This ensures that when a stamp is deleted (which happens regularly), logs are still available. Log Analytics workspaces reside in a separate resource group <prefix>-monitoring-rg.

  • sku is set to PerGB2018.
  • daily_quota_gb is set to 30 GB to prevent overspend, especially on environments that are used for load testing.
  • retention_in_days is set to 30 days to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.
  • For the Health Model, a set of Kusto Functions needs to be added to LogAnalytics. There is a sub-resource type called SavedSearch. Because these queries can get quite bulky, they are loaded from files instead of specified inline in Terraform. They are stored in the subdirectory monitoring/queries in the /src/infra directory.

Azure Application Insights

As with Log Analytics, Application Insights is also deployed per-region and does not share the lifecycle of an stamp. All Application Insight resources are deployed in a separate resource group <prefix>-monitoring-rg and are deployed as part of the global resources deployment.

  • Log Analytics Workspace-attached mode is being used.
  • daily_data_cap_in_gb is set to 30 GB to prevent overspend, especially on environments that are used for load testing.

Azure Policy

Azure Policy is used to monitor and enforce certain baselines. All policies are assigned on a per-stamp, per-resource group level. Azure Kubernetes Service is configured to use the azure_policy addon to leverage Policies configured outside of Kubernetes.

Azure Event Hub

  • Each stamp has one standard tier, zone_redundant Event Hub Namespace.
  • Auto-inflate (auto-scaleup) can be optionally enabled via a Terraform variable.
  • The namespace holds one Event Hub backendqueue-eh with dedicated consumer groups for each consumer (currently only one).
  • Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Storage Accounts

  • Two storage accounts are deployed per stamp:
    • A "public" storage account with "static web site" enabled. This is used to host the UI single-page application.
    • A "private" storage account which is used for internals such as the health service and the Event Hub checkpointing.
  • Both accounts are deployed in zone-redundant mode (ZRS).

Supporting services

This repository also contains a couple of supporting services for the Azure Mission-Critical project:

These supporting services are required / optional based on how you chose to use Azure Mission-Critical.

Naming conventions

All resources used for Azure Mission-Critical follow a pre-defined and consistent naming structure to make it easier to identify them and to avoid confusion. Resource abbreviations are based on the Cloud Adoption Framework. These abbreviations are typically attached as a suffix to each resource in Azure.

A prefix is used to uniquely identify "deployments" as some names in Azure must be worldwide unique. Examples of these include Storage Accounts, Container Registries and CosmosDB accounts.

Resource groups

Resource group names begin with the prefix and then indicate whether they contain per-stamp or global resources. In case of per-stamp resource groups, the name also contains the Azure region they are deployed to.

<prefix><suffix>-<global | stamp>-<region>-rg

This will, for example, result in aoprod-global-rg for global services in prod or aoprod7745-stamp-eastus2-rg for a stamp deployment in eastus2.

Resources

<prefix><suffix>-<region>-<resource> for resources that support - in their names and <prefix><region><resource> for resources such as Storage Accounts, Container Registries and others that do not support - in their names.

This will result in, for example, aoprod7745-eastus2-aks for an AKS cluster in eastus2.


Back to documentation root