Architecture Decisions
Why this layout, what we skipped, and when to revisit
Architecture Decisions
Why This Layout
This landing zone makes deliberate trade-offs: simplicity over completeness, speed over perfection. Every decision below is reversible. None of them will paint you into a corner.
Management Groups
What We Deploy
Tenant Root Group
└── mg-<yourcompany>
├── sub-<yourcompany>-prod
└── sub-<yourcompany>-nonprod
Why a Single Management Group
Azure Landing Zone (ALZ) uses a deep hierarchy:
Tenant Root Group
└── mg-company
├── mg-platform
│ ├── mg-management
│ ├── mg-connectivity
│ └── mg-identity
├── mg-landing-zones
│ ├── mg-corp
│ └── mg-online
├── mg-sandbox
└── mg-decommissioned
This exists because enterprises have hundreds of subscriptions owned by different teams with different compliance requirements. You don’t. You have 2-5 subscriptions and one team making all the decisions.
A single management group gives you:
- One place to apply policies that cover everything
- Zero hierarchy to maintain or explain to new hires
- Easy migration later — moving subscriptions between management groups is a 10-second operation
When to Add More
Add a second management group level when:
- You have 5+ subscriptions and need different policies for different teams
- Compliance requirements differ between workloads (e.g., PCI vs non-PCI)
- You hire a dedicated platform team that needs its own governance scope
Subscription Topology
Two Subscriptions: Prod and Non-Prod
| Subscription | Contains | RBAC |
|---|---|---|
yourcompany-prod |
Production workloads, production databases, customer-facing services | Admins: Owner, Developers: Reader, CI/CD: Contributor |
yourcompany-nonprod |
Dev, staging, QA, CI/CD agents, experiments | Admins: Owner, Developers: Contributor |
Why Not One Subscription?
The subscription is Azure’s strongest isolation boundary. Separating prod from non-prod gives you:
- Cost isolation for free — No tagging gymnastics to figure out dev vs prod spend
- RBAC without custom roles — Developers get Contributor on non-prod, Reader on prod
- Blast radius containment —
az group deletein dev can’t touch prod - Quota isolation — Non-prod experiments won’t consume prod resource quotas
⚠️ One Workload Per Subscription
This layout assumes one primary workload per subscription. If you’re deploying a second independent workload (separate team, separate lifecycle, separate cost center), don’t put it in the same subscription — create a new one.
Why this matters:
- Resource groups are not isolation boundaries. They don’t provide separate RBAC inheritance, cost tracking, or policy scopes the way subscriptions do.
- Habits form early. Once your team starts treating resource groups as workload boundaries, it becomes an embedded practice that’s expensive to refactor later.
- Subscriptions are free. Adding a subscription costs nothing — the overhead is only in CI/CD and RBAC setup, which takes an hour.
If you find yourself adding a second workload, that’s a graduation signal.
Why Not Three+ Subscriptions?
You can. Common third subscriptions:
- Sandbox — For unrestricted experimentation (no policies, auto-delete after 30 days)
- Data — If you have a dedicated data platform (Databricks, Synapse, data lakes)
- Shared Services — If you have cross-cutting services (container registry, key vault)
But don’t create them until you feel the pain of not having them. Each subscription is more RBAC to manage, more policies to assign, more cost to track.
Networking
No Hub VNet
The full ALZ deploys a hub-spoke topology:
Hub VNet (Connectivity subscription)
├── Azure Firewall
├── VPN Gateway / ExpressRoute Gateway
├── Azure Bastion
└── Peered to all spoke VNets
Spoke VNets (Landing Zone subscriptions)
├── Peered to hub
└── All egress routes through hub firewall
This costs ~$1,500/month minimum (Azure Firewall alone is $900+) and adds operational complexity you don’t need.
What We Deploy Instead
Self-contained VNets per subscription with no peering:
vnet-<co>-prod (10.0.0.0/16) vnet-<co>-nonprod (10.1.0.0/16)
├── snet-aks /20 ├── snet-aks /20
├── snet-app /22 ├── snet-app /22
├── snet-data /22 ├── snet-data /22
└── snet-shared /24 └── snet-shared /24
Each VNet is an island. Subnets are sized for growth:
/20for AKS (4k IPs — enough for most clusters using Azure CNI with sane max-pods and surge settings, and plenty for Cilium overlay)/22for App Service / Container Apps VNet integration/22for Private Endpoints (databases, storage, caches)/24for shared services (CI/CD agents, jump boxes)
When You Need a Hub
Add a hub VNet when any of these apply:
- VPN/ExpressRoute — You need hybrid connectivity to on-prem or another cloud
- Centralized egress filtering — Compliance requires all outbound traffic to go through a firewall
- DNS resolution at scale — You have 5+ Private DNS Zones that need to be shared across VNets
- Cross-subscription communication — Workloads in prod need to talk to shared services in another subscription
Identity
Entra ID Essentials
| Component | What to Do | Why |
|---|---|---|
| Security Defaults | Enable | Free MFA for everyone, blocks legacy auth |
| Break-glass account | Create 1 | Cloud-only Global Admin with hardware key, for emergencies |
| Named admin groups | Create sg-azure-admins |
Never assign roles to individuals, always groups |
| Workload Identity Federation | Use for CI/CD | No secrets to rotate, OIDC-based, supported by GitHub Actions and Azure DevOps |
What About Entra ID P1/P2?
- P1 (included with M365 Business Premium): Adds Conditional Access, self-service password reset. Worth it when you have 15+ users.
- P2: Adds PIM (just-in-time access), Access Reviews, Identity Protection. Worth it at Series B or when compliance demands it.
Don’t buy P2 to check a box. Buy it when you have an actual operational need for just-in-time admin access.
Policy Baseline
We assign a minimal set of policies at the subscription level:
| Policy | Mode | Purpose |
|---|---|---|
| Microsoft Cloud Security Benchmark | Audit | Security recommendations without blocking deployments |
Require tag: environment on resource groups |
Deny | Cost tracking and resource lifecycle management |
Require tag: team on resource groups |
Deny | Ownership tracking and cost allocation |
Inherit tag: environment from resource group |
Modify | Auto-propagate environment tag to child resources |
Inherit tag: team from resource group |
Modify | Auto-propagate team tag to child resources |
| Allowed locations | Deny | Prevent accidental resource deployments to wrong regions |
| Allowed locations for resource groups | Deny | Prevent resource group creation in wrong regions |
| Deploy diagnostic settings for Activity Log | DeployIfNotExists | Ensure all control plane actions are logged |
Why Audit Mode for Security Benchmark?
Because Deny mode on security policies will block legitimate deployments and create friction that drives engineers to find workarounds. Start with Audit to understand your posture, then selectively move specific policies to Deny as your team matures.
Policies We Intentionally Skip
- CIS Benchmark — Overlaps heavily with MCSB, adds noise
- NIST / ISO / PCI initiatives — Add when compliance requires it
- Custom policies — Write them when built-in ones don’t cover a specific need
Rollback and Recovery
Terraform State Rollback
If a terraform apply goes wrong:
- Immediate: If still running, Ctrl+C will stop and leave state consistent with what was applied so far
- Revert code, re-apply: The safest approach — revert your
.tffiles to the previous commit and runterraform applyagain. Terraform will converge to the desired state. - State surgery (last resort): Use
terraform state rmto remove a problematic resource from state, then re-import or recreate it. Never edit the state file directly.
# Revert to previous commit and re-apply
git checkout HEAD~1 -- infra/terraform/
terraform -chdir=infra/terraform plan # Review what will change
terraform -chdir=infra/terraform apply # Apply the rollback
Bicep Deployment Reversal
Bicep deployments are incremental by default — they add/update but don’t delete. To roll back:
- Revert code, re-deploy: Check out the previous version and deploy again
- Manual cleanup: If new resources were created, delete them via CLI or portal
# Revert to previous deployment
git checkout HEAD~1 -- infra/bicep/
az deployment sub create \
--location eastus2 \
--template-file infra/bicep/main.bicep \
--parameters infra/bicep/parameters/prod.bicepparam
Emergency Policy Override
If a Deny policy is blocking a critical deployment:
- Temporary exemption: Create a policy exemption (preferred — auditable and time-limited)
az policy exemption create \ --name "emergency-deploy-$(date +%Y%m%d)" \ --policy-assignment "/subscriptions/$SUB_ID/providers/Microsoft.Authorization/policyAssignments/allowed-locations" \ --exemption-category Waiver \ --description "Emergency deployment — expires in 24h" - Switch to Audit: Change the policy assignment’s
enforcemode fromtruetofalse— this converts Deny to Audit temporarily - Never delete the policy assignment — you’ll lose compliance history and audit trail
Multi-Region and Disaster Recovery
This landing zone deploys to a single region. That’s intentional — multi-region adds significant complexity that most startups don’t need on day one.
When to Go Multi-Region
- SLA requirements: You need 99.99%+ uptime (single-region gives you ~99.95% for most services)
- Latency: Your users are distributed across continents and need <100ms response times
- Compliance: Data residency requirements mandate replicas in specific regions
- Business continuity: You can’t tolerate a full region outage (rare but possible)
How to Prepare Without Over-Engineering
Even in a single region, you can prepare for a future multi-region expansion:
- Use paired regions: Deploy to a region with a good pair (e.g., East US 2 + Central US). Azure prioritizes recovery for paired regions.
- Avoid region-locked resources: Don’t hardcode region names. This landing zone uses a
locationparameter for a reason. - Use geo-redundant storage (GRS): For critical data, switch from LRS to GRS. Your data is asynchronously replicated to the paired region at no extra compute cost (storage cost ~2x).
- Database backups: Azure SQL and Cosmos DB support geo-restore by default. Verify your backup retention meets your RPO.
Multi-Region Architecture (When You’re Ready)
Azure Front Door (global load balancer + WAF)
├── Region 1 (primary)
│ ├── Container Apps / AKS
│ ├── Azure SQL (primary replica)
│ └── Redis
└── Region 2 (secondary)
├── Container Apps / AKS
├── Azure SQL (geo-replica, read-only)
└── Redis
Key additions when you graduate to multi-region:
- Azure Front Door for global load balancing and automatic failover
- Azure SQL geo-replication or Cosmos DB multi-region writes
- Azure Traffic Manager or Front Door health probes for failover detection
- Separate Terraform workspaces or Bicep parameter files per region
See Also
- Graduation Guide — Full migration path to enterprise-scale
- Networking Deep Dive — When to add a hub VNet
- Security Baseline — Defender, RBAC, logging
- Cost Management — Budgets and cost optimization
- Troubleshooting — Common deployment errors