Terraform Pull Request Automation for Beginners
Getting started with Terraform? Learn how pull-request automation adds tests, policy checks and previews so beginners merge infrastructure code safely.
The Reality of Terraform at Scale
Here's what actually happens: Your team starts with local Terraform runs. Everything's fine until someone overwrites production. You add basic CI checks. Then state conflicts emerge. You implement Atlantis. It works great... until it doesn't.
The pattern is predictable because infrastructure complexity grows exponentially, not linearly. What works for 10 engineers breaks at 50. What works at 50 becomes a nightmare at 200.
Stage 1: Manual Coordination (1-10 Engineers)
Small teams operate on trust. You've got a shared AWS account, maybe two environments, and everyone knows what everyone else is doing. Sort of.
The Setup
# backend.tf - The classic S3 backend everyone starts with
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
}
}
What Actually Happens
# Developer A at 2:47 PM
$ terraform apply
Acquiring state lock...
# Developer B at 2:48 PM
$ terraform apply
Error: Error acquiring the state lock
Another process is already holding a lock on the state.
You implement basic PR checks:
# .github/workflows/terraform.yml
name: Terraform CI
on:
pull_request:
paths:
- '**.tf'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -out=tfplan
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
This works until someone comments "LGTM" on a PR that creates 47 expensive EC2 instances. Time for actual automation.
Stage 2: Basic Automation with Atlantis (10-50 Engineers)
Atlantis changes the game. No more local applies. Everything happens in pull requests. It feels like magic.
Setting Up Atlantis
# atlantis.yaml
version: 3
projects:
- name: production
dir: environments/prod
terraform_version: v1.5.0
autoplan:
when_modified: ["*.tf", "*.tfvars"]
enabled: true
apply_requirements: ["approved", "mergeable"]
- name: staging
dir: environments/staging
terraform_version: v1.5.0
autoplan:
when_modified: ["*.tf", "*.tfvars"]
enabled: true
The Workflow That Actually Works (For Now)
# Developer creates PR
# Atlantis automatically comments:
# Ran Plan for 2 projects:
#
# 1. project: `production` dir: `environments/prod` workspace: `default`
# 2. project: `staging` dir: `environments/staging` workspace: `default`
#
# ### 1. project: `production`
# ```
# Terraform will perform the following actions:
#
# + aws_instance.app_server
# ami: "ami-0c55b159cbfafe1f0"
# instance_type: "t3.medium"
# ```
Teams love it. PR reviews include infrastructure changes. The audit trail exists. But then growth happens.
Stage 3: Governance Requirements Emerge (50-200 Engineers)
At 50 engineers, you've got multiple teams. The platform team wants governance. The security team wants compliance. The finance team wants to know why the AWS bill doubled.
Policy as Code Becomes Non-Negotiable
# policies/cost_control.rego
package terraform.policies.cost_control
import future.keywords.if
import input.tfplan as tfplan
deny[msg] {
r := tfplan.resource_changes[_]
r.type == "aws_instance"
r.change.after.instance_type == "m5.24xlarge"
msg := sprintf(
"Instance type %s requires VP approval. Use m5.xlarge or smaller.",
[r.change.after.instance_type]
)
}
deny[msg] {
cost := sum([cost |
r := tfplan.resource_changes[_]
cost := instance_cost(r)
])
cost > 1000
msg := sprintf(
"Total monthly cost increase $%d exceeds $1000 limit",
[cost]
)
}
The Atlantis Integration That Starts to Crack
# atlantis.yaml with custom workflows
workflows:
policy-check:
plan:
steps:
- init
- plan
- run: |
# This gets messy fast
conftest verify --policy policies/ $PLANFILE
apply:
steps:
- run: |
# Hope the policies still pass?
conftest verify --policy policies/ $PLANFILE
- apply
You start hitting walls. Atlantis processes one run at a time. Your 15-minute deploys become 2-hour queues. The single Atlantis server becomes a single point of failure.
Stage 4: Enterprise Scale Operations (200+ Engineers)
Large organizations need more than automation—they need a platform. Multiple cloud accounts. Regulatory compliance. Self-service for developers. Governance for security.
What Enterprise Teams Actually Need
# modules/governed-vpc/main.tf
# This module enforces company standards
variable "environment" {
type = string
validation {
condition = contains(["prod", "staging", "dev"], var.environment)
error_message = "Environment must be prod, staging, or dev"
}
}
variable "cost_center" {
type = string
validation {
condition = can(regex("^CC-[0-9]{4}$", var.cost_center))
error_message = "Cost center must match pattern CC-XXXX"
}
}
locals {
required_tags = {
Environment = var.environment
CostCenter = var.cost_center
ManagedBy = "Terraform"
Team = data.scalr_identity.current.email
# This doesn't work in Atlantis without custom tooling
}
}
Self-Service That Actually Scales
# scalr-module-registry.yaml
# Teams consume approved modules without knowing the details
modules:
- name: eks-cluster
source: terraform-aws-modules/eks/aws
version_constraint: "~> 19.0"
variable_overrides:
cluster_endpoint_public_access: false # Security requirement
enable_irsa: true # Always enabled
policy_sets:
- eks-security-baseline
- cost-management
- tagging-standards
When Atlantis Hits the Wall
Let me be direct about when Atlantis stops being the answer. It's not about Atlantis being bad—it revolutionized Terraform workflows. But architecture decisions made for simplicity become limitations at scale.
The Performance Cliff
# What actually happens in your deployment queue
# (Not actual Atlantis code, but the effect)
queue = [
{"team": "platform", "duration": 15, "started": "14:00"},
{"team": "backend", "duration": 20, "waiting": True},
{"team": "frontend", "duration": 5, "waiting": True},
{"team": "data", "duration": 45, "waiting": True},
{"team": "platform", "duration": 10, "waiting": True}, # Yes, same team waiting on itself
]
# Total time: 95 minutes for work that could parallelize to 45
The Hidden Costs
Cost Factor | Atlantis | Enterprise Platform (e.g., Scalr) |
---|---|---|
Licensing | $0 | $500-2000/month |
Engineering (maintenance) | 40-60 hours/month | 0 hours |
Engineering (features) | 80+ hours for RBAC/policies | Built-in |
Downtime risk | High (single server) | SLA guaranteed |
Compliance features | DIY everything | SOC2, HIPAA ready |
True monthly cost | ~$10,000 | $500-2000 |
Comparing Enterprise Solutions
When you outgrow Atlantis, the market offers several paths. Each has its sweet spot.
Terraform Cloud
- Pros: Native HashiCorp integration, strong brand recognition
- Cons: Unpredictable pricing (billed per resource), recent BSL licensing concerns
- Best for: Teams committed to HashiCorp ecosystem despite IBM acquisition
Spacelift
- Pros: Multi-IaC support, powerful policy engine
- Cons: Complexity can overwhelm smaller teams, premium pricing for features you might not need
- Best for: Organizations using multiple IaC tools who need maximum flexibility
env0
- Pros: Great UX, strong cost management features, responsive support
- Cons: Newer platform, some enterprise features still maturing
- Best for: Teams prioritizing developer experience and cost visibility
Scalr
- Pros: Purpose-built for enterprise Terraform, managed service model, hierarchical organizations
- Cons: Terraform/OpenTofu focus (not multi-IaC), requires commitment to structured workflows
- Best for: Enterprises wanting Terraform done right without operational overhead
Making the Migration Decision
Here's the framework that actually works:
Immediate Migration Triggers
migration_required:
- deployment_delays > 30 minutes
- availability_requirements > 99%
- compliance_audit == "failed"
- on_call_incidents.tool_related > 2/month
Strategic Migration Indicators
consider_migration:
- terraform_developers > 5
- environments > 3
- teams.count > 2
- monthly_maintenance_hours > 20
- custom_rbac_needed == true
A Real Migration Timeline
gantt
title Atlantis to Scalr Migration
dateFormat YYYY-MM-DD
section Assessment
Current state audit :done, 2024-01-01, 7d
Requirements gathering :done, 2024-01-08, 7d
section Pilot
Dev environment migration :active, 2024-01-15, 14d
Policy configuration :active, 2024-01-22, 7d
Team training :2024-01-29, 7d
section Production
Staging migration :2024-02-05, 14d
Production migration :2024-02-19, 14d
Atlantis decommission :2024-03-05, 7d
Summary: Right Tool, Right Time
The evolution from manual Terraform to enterprise platforms isn't about good tools versus bad tools. It's about matching capabilities to requirements.
Stage | Team Size | Right Tool | Monthly Cost | Key Trigger for Next Stage |
---|---|---|---|---|
Manual | 1-10 | GitHub Actions + S3 | ~$0 | State conflicts, deployment inconsistency |
Basic Automation | 10-50 | Atlantis | ~$500 (hosting) | Queueing delays, governance needs |
Governance Required | 50-200 | Scalr/env0 | $500-1500 | Compliance, multi-cloud, enterprise features |
Enterprise Scale | 200+ | Scalr/Spacelift | $1500+ | Complex hierarchies, self-service platform |
The pattern is clear: start simple, adopt Atlantis when coordination becomes painful, then migrate to an enterprise platform when governance and scale demand it.
For most organizations hitting the 50+ engineer mark, Scalr represents the sweet spot—enterprise capabilities without enterprise complexity. It's purpose-built for Terraform, eliminates operational overhead, and provides the governance features that become non-negotiable as you grow.
The question isn't whether you'll need enterprise Terraform management. It's whether you'll recognize the need before it becomes a crisis.