ASVM and landing zone configuration lifecycle #501

arne21a · 2023-09-20T07:51:20Z

Hi, I'm not sure if this is the best place to ask this, but our efforts to reach someone through our Microsoft contacts have not succeeded, I thought I'd post it here. Perhaps we can initiate a discussion.

Disclaimer: This text has been proofread by an AI tool. Though the language might bear an AI-generated feel, rest assured that the content and sentiments expressed are entirely my own. I've double-checked the corrections to ensure the original meaning remains intact. I just wanted to save you from my English skills ;)

TL;DR: We've built an Enterprise Scale/CAF platform and have been using the ASVM to generate landing zone configurations. While it has been effective so far, we're now facing lifecycle management challenges. We have some solution ideas, but we're eager to hear perspectives from the community.

Context about the project:
We are building an enterprise-scale architecture/CAF platform for a mid-sized company. We began with the January release of 2022 and have followed the advancements since then. During this period, we also developed several smaller and larger add-ons and contributed to aztfmod. The platform, as described in the reference architecture, has been deployed and tailored to our requirements. For instance, we added firewall vnet-attached spokes to circumvent limitations associated with secured vWAN hubs.

Many of our modifications center around the Azure Subscription Vending Machine (ASVM). We've incorporated standardized landing zones with networking, automated DNS delegation, a consumable RBAC system, automatic Azure DevOps bootstrapping, auto-scaling VMSS DevOps agents for each landing zone, and more.

All these components are generated as tfvars through the ASVM process. The initial rollouts proceeded smoothly. However, we're now delving into concepts for Day 3 operations, which present a myriad of lifecycle management challenges. While I have some strategies in mind to tackle a few of these challenges, I'm curious to know if there are pre-established concepts from other projects addressing these concerns. Tackling this problem seems to necessitate a fair amount of custom solutions, and I'm eager to explore any insights or best practices that the community might have.

The primary challenges we've identified include:

Versioning and Tagging (arguably the most straightforward):

We plan to employ semantic versioning (semver) for the ASVM code responsible for generating our landing zone configuration. Our intended versioningscheme adheres to the standard: MAJOR.MINOR.PATCH. Here's how we classify changes:
- PATCH: Iterative modifications that don't necessitate migration efforts.
- MINOR: Adjustments that introduce migration requirements.
- MAJOR: Revisions that enact breaking changes to the existing architecture.
  To enhance traceability, the ASVM version utilized will be embedded into the resulting configuration. Additionally, we aim to tag all associatedrepositories, like supermodules, to guarantee the ability to reproduce configurations in subsequent periods.
Customizations of Level 3 Code

Up to now, regenerating the ASVM configuration overwrites all existing files of the landing zone. This approach is suitable for workloads that exclusively utilize the default landing zone configuration. However, in certain instances, modifications to the generated Level 3 code become essential due to reasons such as:
- Specialized networking needs (e.g., specific subnets or additional vnets).
- Level 4 solution requirements necessitating elevated privileges, like AAD apps.
- Adjustments to the DevOps setup.
We've pondered multiple strategies to manage this, though none seem both robust and straightforward. Some preliminary ideas include:

Kubernetes Kustomize Approach - Given the parallels between our requirements and Kubernetes manifests, especially regarding the popular "config as data" approach, we're considering adopting a lightweight version of Kubernetes Kustomize tailored for CAF Terraform tfvars.

Simple Overwrite Folders:
Under this method, each state would have a dedicated config overwrite folder. This folder would contain configurations immune to ASVM generation. If a resource type is outlined in the overlay, its definition would replace all resources of the same category in the generated folder. Although this is the most straightforward to implement, it may not cater to all needs.

Merging Overwrites:
Similar to the previous approach, each state possesses a config overwrite folder immune to ASVM interference. During a pre-plan phase, both the ASVM and overlay configurations would merge into a unified file, enabling the coexistence of resources of identical types in both ASVM and the overlay. However, this model doesn't support altering ASVM resources.

Complex Merging Overwrites:
Building on the merging overwrites concept, we could consider notations allowing the modification or deletion of configurations initiated by the ASVM.

'Introduce a Second Customization State to Level 3:'
This would involve adding a new state only for customizations, making sure it is untouched by ASVM configurations. One state would be used just for generated configurations and the other for customizations. A potential issue could come up from spreading the creation of Level 3 resources across different states. For instance, there is an issue in aztfmod where most modules expect the subnet to be in the same state as the vnet. So, splitting the creation of the vnet (platform) and the subnets (sized by the workload in some cases) might introduce problems in Level 4. Implementing this would also mean changes to the Level 3+4 state management and permission setup. However, this approach seems like a cleaner way for Level 3 customizations by providing a defined interface and separate blastradius for the generated code and the customizations.

As an initial measure, we could prioritize detection. This might involve:
- Creating a new branch for the new code generated by the asvm.
- Generate new configs.
- Examining the git diff against a pre-existing whitelist for alterations.
- Issuing an alert upon detecting discrepancies.
- trigger manual resolution.
Migration Steps Outside of Terraform
Patch-level changes should not require any migration path. Regenerating the configuration and then running a simple terraform apply should suffice as an updateprocedure. However, for some updates (especially minor and patch changes), there are operations that require the removal of an old resource before introducing a newone.

An example of this is the disassociation of a VNet from a hub and its subsequent peering with a new VNet. If this operation is conducted within a single apply action, there's a significant risk of encountering a race or dependency condition. Such conditions can disrupt the apply process and potentially corrupt the Terraform state.

For these scenarios, migration scripts should be provided for every minor and major update. These can be implemented in the form of Ansible roles or a detailed playbook using tagging.

Additionally, there should be an option to indicate the need for manual intervention after the automated update completes. This is relevant, for instance, if a resource is created by a policy due to a resource being created by the automation. As of now, only our route tables are handled this way.

I would love to hear the community and maintainers' perspectives on these topics. Have you encountered similar challenges? Are there established patterns or best practices to apply in these situations?

An other remotly related question:
Firstly, I'd like to clarify that my intention isn't to critique your efforts. Collaborating through pull requests and chat has been genuinely enjoyable, and the overall product experience is commendable. However, we've faced challenges when trying to gather information about CAF and Enterprise Scale from customer-facing teams at Microsoft. We've engaged with several key account managers and other representatives, but it's been difficult to get answers to basic questions about CAF or Enterprise Scale. Many discussions have concluded with promises of follow-up information, but unfortunately, there hasn't been any concrete progress. Is there a specific community within Microsoft that customers can join for better insights into roadmaps, collaboration opportunities, and potentially architectural reviews?

yves-vogl · 2023-09-20T07:58:50Z

Thank you @arne21a for this comprehensive proposal 👍
Comparing with AWS and the versioning of Control Tower Landing Zones this is currently the most important feature missing in Azure right now. Lifecycle Management is often ignored at the beginning even though it is one of the most crucial success factors in the long run.

I'm curious what @arnaudlh thinks about this.

arne21a added the enhancement New feature or request label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASVM and landing zone configuration lifecycle #501

ASVM and landing zone configuration lifecycle #501

arne21a commented Sep 20, 2023 •

edited

yves-vogl commented Sep 20, 2023

ASVM and landing zone configuration lifecycle #501

ASVM and landing zone configuration lifecycle #501

Comments

arne21a commented Sep 20, 2023 • edited

yves-vogl commented Sep 20, 2023

arne21a commented Sep 20, 2023 •

edited