Architecture Decisions

Why this layout, what we skipped, and when to revisit

Architecture Decisions

Why This Layout

This landing zone makes deliberate trade-offs: simplicity over completeness, speed over perfection. Every decision below is reversible. None of them will paint you into a corner.

Management Groups

What We Deploy

Tenant Root Group
└── mg-<yourcompany>
    ├── sub-<yourcompany>-prod
    └── sub-<yourcompany>-nonprod

Why a Single Management Group

Azure Landing Zone (ALZ) uses a deep hierarchy:

Tenant Root Group
└── mg-company
    ├── mg-platform
    │   ├── mg-management
    │   ├── mg-connectivity
    │   └── mg-identity
    ├── mg-landing-zones
    │   ├── mg-corp
    │   └── mg-online
    ├── mg-sandbox
    └── mg-decommissioned

This exists because enterprises have hundreds of subscriptions owned by different teams with different compliance requirements. You don’t. You have 2-5 subscriptions and one team making all the decisions.

A single management group gives you:

One place to apply policies that cover everything
Zero hierarchy to maintain or explain to new hires
Easy migration later — moving subscriptions between management groups is a 10-second operation

When to Add More

Add a second management group level when:

You have 5+ subscriptions and need different policies for different teams
Compliance requirements differ between workloads (e.g., PCI vs non-PCI)
You hire a dedicated platform team that needs its own governance scope

Subscription Topology

Two Subscriptions: Prod and Non-Prod

Subscription	Contains	RBAC
`yourcompany-prod`	Production workloads, production databases, customer-facing services	Admins: Owner, Developers: Reader, CI/CD: Contributor
`yourcompany-nonprod`	Dev, staging, QA, CI/CD agents, experiments	Admins: Owner, Developers: Contributor

Why Not One Subscription?

The subscription is Azure’s strongest isolation boundary. Separating prod from non-prod gives you:

Cost isolation for free — No tagging gymnastics to figure out dev vs prod spend
RBAC without custom roles — Developers get Contributor on non-prod, Reader on prod
Blast radius containment — az group delete in dev can’t touch prod
Quota isolation — Non-prod experiments won’t consume prod resource quotas

⚠️ One Workload Per Subscription

This layout assumes one primary workload per subscription. If you’re deploying a second independent workload (separate team, separate lifecycle, separate cost center), don’t put it in the same subscription — create a new one.

Why this matters:

Resource groups are not isolation boundaries. They don’t provide separate RBAC inheritance, cost tracking, or policy scopes the way subscriptions do.
Habits form early. Once your team starts treating resource groups as workload boundaries, it becomes an embedded practice that’s expensive to refactor later.
Subscriptions are free. Adding a subscription costs nothing — the overhead is only in CI/CD and RBAC setup, which takes an hour.

If you find yourself adding a second workload, that’s a graduation signal.

Why Not Three+ Subscriptions?

You can. Common third subscriptions:

Sandbox — For unrestricted experimentation (no policies, auto-delete after 30 days)
Data — If you have a dedicated data platform (Databricks, Synapse, data lakes)
Shared Services — If you have cross-cutting services (container registry, key vault)

But don’t create them until you feel the pain of not having them. Each subscription is more RBAC to manage, more policies to assign, more cost to track.

Networking

No Hub VNet

The full ALZ deploys a hub-spoke topology:

Hub VNet (Connectivity subscription)
├── Azure Firewall
├── VPN Gateway / ExpressRoute Gateway
├── Azure Bastion
└── Peered to all spoke VNets

Spoke VNets (Landing Zone subscriptions)
├── Peered to hub
└── All egress routes through hub firewall

This costs ~$1,500/month minimum (Azure Firewall alone is $900+) and adds operational complexity you don’t need.

What We Deploy Instead

Self-contained VNets per subscription with no peering:

vnet-<co>-prod (10.0.0.0/16)       vnet-<co>-nonprod (10.1.0.0/16)
├── snet-aks      /20               ├── snet-aks      /20
├── snet-app      /22               ├── snet-app      /22
├── snet-data     /22               ├── snet-data     /22
└── snet-shared   /24               └── snet-shared   /24

Each VNet is an island. Subnets are sized for growth:

/20 for AKS (4k IPs — enough for most clusters using Azure CNI with sane max-pods and surge settings, and plenty for Cilium overlay)
/22 for App Service / Container Apps VNet integration
/22 for Private Endpoints (databases, storage, caches)
/24 for shared services (CI/CD agents, jump boxes)

When You Need a Hub

Add a hub VNet when any of these apply:

VPN/ExpressRoute — You need hybrid connectivity to on-prem or another cloud
Centralized egress filtering — Compliance requires all outbound traffic to go through a firewall
DNS resolution at scale — You have 5+ Private DNS Zones that need to be shared across VNets
Cross-subscription communication — Workloads in prod need to talk to shared services in another subscription

Identity

Entra ID Essentials

Component	What to Do	Why
Security Defaults	Enable	Free MFA for everyone, blocks legacy auth
Break-glass account	Create 1	Cloud-only Global Admin with hardware key, for emergencies
Named admin groups	Create `sg-azure-admins`	Never assign roles to individuals, always groups
Workload Identity Federation	Use for CI/CD	No secrets to rotate, OIDC-based, supported by GitHub Actions and Azure DevOps

What About Entra ID P1/P2?

P1 (included with M365 Business Premium): Adds Conditional Access, self-service password reset. Worth it when you have 15+ users.
P2: Adds PIM (just-in-time access), Access Reviews, Identity Protection. Worth it at Series B or when compliance demands it.

Don’t buy P2 to check a box. Buy it when you have an actual operational need for just-in-time admin access.

Policy Baseline

We assign a minimal set of policies at the subscription level:

Policy	Mode	Purpose
Microsoft Cloud Security Benchmark	Audit	Security recommendations without blocking deployments
Require tag: `environment` on resource groups	Deny	Cost tracking and resource lifecycle management
Require tag: `team` on resource groups	Deny	Ownership tracking and cost allocation
Inherit tag: `environment` from resource group	Modify	Auto-propagate environment tag to child resources
Inherit tag: `team` from resource group	Modify	Auto-propagate team tag to child resources
Allowed locations	Deny	Prevent accidental resource deployments to wrong regions
Allowed locations for resource groups	Deny	Prevent resource group creation in wrong regions
Deploy diagnostic settings for Activity Log	DeployIfNotExists	Ensure all control plane actions are logged

Why Audit Mode for Security Benchmark?

Because Deny mode on security policies will block legitimate deployments and create friction that drives engineers to find workarounds. Start with Audit to understand your posture, then selectively move specific policies to Deny as your team matures.

Policies We Intentionally Skip

CIS Benchmark — Overlaps heavily with MCSB, adds noise
NIST / ISO / PCI initiatives — Add when compliance requires it
Custom policies — Write them when built-in ones don’t cover a specific need

Rollback and Recovery

Terraform State Rollback

If a terraform apply goes wrong:

Immediate: If still running, Ctrl+C will stop and leave state consistent with what was applied so far
Revert code, re-apply: The safest approach — revert your .tf files to the previous commit and run terraform apply again. Terraform will converge to the desired state.
State surgery (last resort): Use terraform state rm to remove a problematic resource from state, then re-import or recreate it. Never edit the state file directly.

# Revert to previous commit and re-apply
git checkout HEAD~1 -- infra/terraform/
terraform -chdir=infra/terraform plan    # Review what will change
terraform -chdir=infra/terraform apply   # Apply the rollback

Bicep Deployment Reversal

Bicep deployments are incremental by default — they add/update but don’t delete. To roll back:

Revert code, re-deploy: Check out the previous version and deploy again
Manual cleanup: If new resources were created, delete them via CLI or portal

# Revert to previous deployment
git checkout HEAD~1 -- infra/bicep/
az deployment sub create \
  --location eastus2 \
  --template-file infra/bicep/main.bicep \
  --parameters infra/bicep/parameters/prod.bicepparam

Emergency Policy Override

If a Deny policy is blocking a critical deployment:

Temporary exemption: Create a policy exemption (preferred — auditable and time-limited)

az policy exemption create \
  --name "emergency-deploy-$(date +%Y%m%d)" \
  --policy-assignment "/subscriptions/$SUB_ID/providers/Microsoft.Authorization/policyAssignments/allowed-locations" \
  --exemption-category Waiver \
  --description "Emergency deployment — expires in 24h"

Switch to Audit: Change the policy assignment’s enforce mode from true to false — this converts Deny to Audit temporarily
Never delete the policy assignment — you’ll lose compliance history and audit trail

Multi-Region and Disaster Recovery

This landing zone deploys to a single region. That’s intentional — multi-region adds significant complexity that most startups don’t need on day one.

When to Go Multi-Region

SLA requirements: You need 99.99%+ uptime (single-region gives you ~99.95% for most services)
Latency: Your users are distributed across continents and need <100ms response times
Compliance: Data residency requirements mandate replicas in specific regions
Business continuity: You can’t tolerate a full region outage (rare but possible)

How to Prepare Without Over-Engineering

Even in a single region, you can prepare for a future multi-region expansion:

Use paired regions: Deploy to a region with a good pair (e.g., East US 2 + Central US). Azure prioritizes recovery for paired regions.
Avoid region-locked resources: Don’t hardcode region names. This landing zone uses a location parameter for a reason.
Use geo-redundant storage (GRS): For critical data, switch from LRS to GRS. Your data is asynchronously replicated to the paired region at no extra compute cost (storage cost ~2x).
Database backups: Azure SQL and Cosmos DB support geo-restore by default. Verify your backup retention meets your RPO.

Multi-Region Architecture (When You’re Ready)

Azure Front Door (global load balancer + WAF)
├── Region 1 (primary)
│   ├── Container Apps / AKS
│   ├── Azure SQL (primary replica)
│   └── Redis
└── Region 2 (secondary)
    ├── Container Apps / AKS
    ├── Azure SQL (geo-replica, read-only)
    └── Redis

Key additions when you graduate to multi-region:

Azure Front Door for global load balancing and automatic failover
Azure SQL geo-replication or Cosmos DB multi-region writes
Azure Traffic Manager or Front Door health probes for failover detection
Separate Terraform workspaces or Bicep parameter files per region

Architecture Decisions

Architecture Decisions

Why This Layout

Management Groups

What We Deploy

Why a Single Management Group

When to Add More

Subscription Topology

Two Subscriptions: Prod and Non-Prod

Why Not One Subscription?

⚠️ One Workload Per Subscription

Why Not Three+ Subscriptions?

Networking

No Hub VNet

What We Deploy Instead

When You Need a Hub

Identity

Entra ID Essentials

What About Entra ID P1/P2?

Policy Baseline

Why Audit Mode for Security Benchmark?

Policies We Intentionally Skip

Rollback and Recovery

Terraform State Rollback

Bicep Deployment Reversal

Emergency Policy Override

Multi-Region and Disaster Recovery

When to Go Multi-Region

How to Prepare Without Over-Engineering

Multi-Region Architecture (When You’re Ready)

See Also