Understanding & Detecting Infrastructure Drift - Part 1
Understand why infrastructure drift happens, how to spot it in your IaC early, and prevent security, compliance & cost surprises before they land.
Infrastructure as Code (IaC) has undeniably transformed IT operations, bringing automation, consistency, and speed to provisioning and managing complex environments. Tools like Terraform and its open-source fork, OpenTofu, empower teams to define their desired infrastructure state declaratively. However, even with the best IaC practices, a silent saboteur lurks: infrastructure drift.
This is the first post in a three-part series where we'll explore the challenge of infrastructure drift, how to detect it, and the landscape of tools available to manage it, with a particular look at how platforms like Scalr are addressing this crucial issue.
What Exactly is Infrastructure Drift?
Infrastructure drift, or configuration drift, occurs when the actual, live state of your deployed infrastructure diverges from the intended state defined in your IaC configuration files (e.g., your .tf
files) and ideally reflected in your state file (e.g., terraform.tfstate
). Essentially, your code no longer accurately represents what's running in your cloud environment. This loss of declarative control can lead to a cascade of problems.
Why Does Drift Happen? Common Culprits:
Drift isn't usually malicious; it often creeps in through everyday operational realities:
- Manual Interventions ("ClickOps"): The most common cause. An engineer makes a quick change directly via the cloud provider's console (AWS, Azure, GCP) to fix an urgent issue or test something, bypassing the IaC workflow.
- Overlapping Automation: Multiple tools managing the same resources without proper coordination can lead to conflicting changes. Imagine Terraform provisioning a server and Ansible later altering its network configuration independently.
- Ad-hoc Scripts: Operations teams or developers might run scripts to modify resources, again outside the purview of the primary IaC tool.
- Emergency Hotfixes: Critical incidents sometimes necessitate immediate manual changes to restore service. If these aren't backported to the IaC code, they become persistent drift.
- Lack of IaC Adherence: Team members unfamiliar with IaC principles might make direct changes, underestimating the impact.
- Dynamic Cloud Services: Auto-scaling groups, managed databases performing automated maintenance, or other cloud provider-initiated changes can alter resource configurations dynamically, causing them to differ from the last IaC-applied state.
The High Stakes of Unchecked Drift:
Ignoring drift is not an option. The risks are significant and can have direct business impacts:
- Security Gaps: Drift can undo carefully configured security settings, like altering a firewall rule or an S3 bucket policy, inadvertently opening vulnerabilities.
- Compliance Nightmares: Unauthorized changes can lead to non-compliance with regulations like PCI DSS or HIPAA, resulting in failed audits, fines, and reputational damage.
- Budget Blowouts: Unmanaged resources or unintended scaling can lead to surprise cost increases and operational overhead in tracking "ghost" infrastructure.
- Stability and Reliability Woes: When your code isn't the source of truth, troubleshooting becomes a guessing game, leading to unpredictable behavior, application errors, and downtime.
- Reduced Agility: If teams can't trust their IaC to reflect reality, they become hesitant to deploy changes, slowing down innovation and increasing friction.
Drift is an almost inevitable byproduct of dynamic cloud environments. Addressing it requires more than just tools; it needs a strategy encompassing detection, management processes, and a culture of IaC discipline.
The First Responders: Native Drift Detection in Terraform & OpenTofu
Terraform and its open-source alternative, OpenTofu, provide foundational commands to help you spot discrepancies.
Core Commands: plan
and refresh
terraform plan
(or tofu plan
): The plan
command is your primary tool for previewing changes. It reads your configuration, refers to the state file (and typically refreshes it against the live environment), and then compares this actual state with your desired state (your code). If the output of terraform plan
shows proposed creations, updates, or destructions when you haven't intentionally changed your code, that's a clear indication of drift.
# For Terraform
terraform plan
# For OpenTofu
tofu plan
Both Terraform and OpenTofu automatically perform a refresh as part of plan
and apply
operations by default, ensuring decisions are based on current reality.
terraform refresh
(or tofu refresh
): This command updates your local state file to match the actual state of resources in your cloud environment. It queries the cloud provider APIs and, if differences are found, modifies the terraform.tfstate
file. Crucially, refresh
only changes the state file, not your infrastructure or your .tf
code files. If drift is detected and the state is refreshed, your code might still be out of sync.
# For Terraform
terraform refresh
# For OpenTofu (Note: 'tofu refresh' is deprecated, see below)
# tofu refresh
OpenTofu has notably deprecated the standalone tofu refresh
command due to potential risks, like misconfigured credentials leading to an incorrect state update. Instead, the OpenTofu community strongly recommends using tofu apply -refresh-only
. This performs the same refresh but allows a review of changes before committing them to the state, promoting safer operations.
# Recommended for OpenTofu (and also works for Terraform)
tofu apply -refresh-only
terraform apply -refresh-only
Strengths of Native Detection:
- Built-in: No extra tools are needed for these basic checks.
- Authoritative: Directly compares your code's intent with the (refreshed) state.
- Foundation for Automation: The
plan
output can be scripted or programmatically analyzed.
But Native Tools Have Their Limits:
While essential, plan
and refresh
often fall short in complex, scaled environments:
- Risk with
refresh
: As OpenTofu highlights, automatic state updates can be risky. - No Code Reconciliation:
refresh
doesn't update your.tf
files; manual effort is needed if the drifted state is accepted. - Scalability: Running
refresh
orplan
constantly across many workspaces can be cumbersome and lead to state contention. - Managed Resources Only: A major blind spot! These commands only detect drift for resources they know about (defined in your configuration and state file). Resources created manually or by other tools ("unmanaged resources" or "shadow IT") go completely unnoticed.
- Verbose Output: Sifting through lengthy
plan
outputs to find drift can be difficult. - State File Integrity: Accuracy depends entirely on a healthy state file.
OpenTofu's Stance:
OpenTofu, born from Terraform, maintains functional parity for these core commands. The key difference is its explicit deprecation of tofu refresh
in favor of the safer tofu apply -refresh-only
workflow. This community-driven decision emphasizes a more cautious approach to state management.
Moving Beyond Native Detection
Native commands are the bedrock of drift detection. However, their limitations highlight the need for more advanced solutions in many real-world scenarios. Organizations often require automated, continuous monitoring, clearer reporting across numerous projects, better handling of unmanaged resources, and streamlined remediation workflows.
In the next part of this series, we'll explore how Scalr, an IaC management platform, provides a comprehensive and controlled approach to tackling infrastructure drift, building upon these foundational concepts but offering much more in terms of automation, insight, and user control.