Troubleshooting Common Terraform Atlantis Issues

Resolve Terraform Atlantis hiccups fast: step-by-step fixes for plan/apply failures, repo config, credentials, webhooks, drift detection and more.

Introduction: The Atlantis Advantage and Its Challenges

Terraform Atlantis has become a popular choice for teams looking to implement GitOps for their infrastructure. By bringing terraform plan and apply directly into pull request workflows, it fosters collaboration and standardizes IaC management. However, this power comes with operational complexity. Effectively managing the interactions between Atlantis, your Version Control System (VCS), Terraform versions, provider credentials, and cloud environments is crucial. When issues arise, they can be time-consuming to diagnose and resolve, impacting team productivity. While Atlantis offers a great degree of control, the setup and maintenance overhead for these common issues can lead teams to explore platforms like Scalr, which often provide more integrated solutions to these very challenges, abstracting away some of the underlying friction.

This post aims to equip you with the knowledge to tackle some of the most frequent problems encountered with Atlantis.

Problem 1: Credential Misconfigurations

Credential misconfigurations are a frequent source of trouble. Symptoms often include errors like AccessDenied, UnauthorizedOperation, or NoCredentialProviders from AWS during plans or applies. Azure users might see AuthenticationFailed or InvalidSubscriptionId, while GCP interactions can fail with PermissionDenied or service account key errors. When Atlantis interacts with the VCS, fatal: could not read Username... terminal prompts disabled or HTTP 401/403 errors signal authentication problems.

The common causes for these issues are varied: incorrect IAM roles or permissions for the Atlantis server or Terraform execution; missing, incorrect, or unexported environment variables (e.g., AWS_ACCESS_KEY_ID, ARM_CLIENT_ID); malformed or inaccessible credential files; or expired, revoked, or insufficiently scoped VCS tokens or GitHub App permissions. Misconfigured assume_role policies or instance profiles in AWS are also frequent culprits.

Diagnosing these problems starts with examining Atlantis server logs, ideally with debug logging enabled (--log-level=debug), for detailed error messages. It's critical to verify the credentials within the Atlantis environment itself by shelling into the container/server and using provider-specific commands (e.g., aws sts get-caller-identity, az login, gcloud auth list) and checking relevant environment variables (e.g., env | grep AWS_). Reviewing IAM policies using cloud provider tools like the AWS IAM Policy Simulator and validating VCS token scopes in GitHub/GitLab settings are also key steps.

Resolution involves correcting the identified misconfigurations. This means rectifying IAM policies to grant least-privilege permissions and ensuring environment variables are correctly set and exported to be available to the Atlantis process. For instance, AWS credentials can be set in a Dockerfile or startup script:

# Example: Ensure AWS credentials are set in Atlantis environment
# Dockerfile or startup script
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="your_region"

Preferring cloud-native IAM mechanisms like instance profiles or workload identity over static keys is a best practice. If VCS tokens/apps are faulty, regenerate them with the correct scopes. Lastly, securely manage all secrets using tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.

Scalr Note: Platforms like Scalr often simplify credential management by providing centralized, environment-aware secret stores and IAM integrations, reducing the surface area for such misconfigurations.

Problem 2: Webhook Delivery Failures

Webhook delivery failures mean Atlantis isn't receiving notifications from your VCS. Symptoms are clear: Atlantis doesn't comment on PRs or respond to commands. Your VCS webhook delivery logs will show errors such as "Could not deliver webhook," "Invalid HTTP response: 403/400," or "Signature mismatch." Atlantis logs might show Ignoring unsupported event or errors related to secret validation.

These failures are commonly caused by an incorrect webhook URL or a mismatched HTTP/HTTPS protocol. A mismatched webhook secret between the VCS and Atlantis (e.g., ATLANTIS_GH_WEBHOOK_SECRET) is another frequent offender. Other causes include incorrect event subscriptions in the VCS (e.g., not listening for "Pull requests" or "Issue comments"), firewall or network issues blocking VCS IPs from reaching Atlantis, the Atlantis --repo-allowlist not including the repository, SSL/TLS misconfiguration on the Atlantis server, or a reverse proxy stripping necessary headers or malforming requests.

To diagnose webhook issues, always check the VCS webhook delivery logs first. Concurrently, inspect Atlantis server logs for errors related to webhook processing. Meticulously verify the webhook URL and secret in both the VCS and Atlantis configurations. Test network connectivity using curl from an external source to the Atlantis /events endpoint:

curl -X POST https://atlantis.yourcompany.com/events \
     -H "Content-Type: application/json" \
     -H "X-GitHub-Event: ping" \
     -H "X-Hub-Signature-256: sha256=<calculated_signature_if_secret_is_set>" \
     -d '{"zen": "Approachable is better than simple."}' -v

Also, validate event subscriptions in VCS settings and check the Atlantis --repo-allowlist.

Resolving these failures involves correcting the webhook URL or secret in the VCS or Atlantis configuration and ensuring correct event subscriptions are configured. Adjust firewall/network rules to allow VCS IP ranges. Fix any Atlantis server misconfigurations like --repo-allowlist or SSL settings. If using a reverse proxy, ensure it passes headers like X-Hub-Signature-256 and X-GitHub-Event correctly.

Scalr Note: Managed IaC platforms often handle webhook integrations more seamlessly, with guided setup and pre-configured listeners, minimizing these common setup errors.

Problem 3: Plan/Apply Lock Contention

Lock contention occurs when Atlantis prevents concurrent operations on the same project. Symptoms include Atlantis commenting "This project is currently locked by another pull request: #123," atlantis plan or apply commands being rejected due to an existing lock, or the Atlantis UI showing active locks.

The common causes are legitimate concurrent operations on the same project/workspace, or stuck/failed plans/applies not releasing their locks. Long-running Terraform operations can hold locks for extended periods, and PRs with acquired locks that aren't merged or closed promptly also contribute. Broad project definitions in atlantis.yaml can cause unnecessary contention by locking larger sections of infrastructure than intended.

Diagnosing lock contention is usually straightforward. Check PR comments where Atlantis typically indicates the PR holding the lock. The Atlantis UI, if enabled, provides a clear overview of active locks. Reviewing open PRs for recent Atlantis activity on the locked project can also help.

For resolution, if the operation is legitimate, waiting is the simplest option. For stale locks, you can manually unlock by commenting atlantis unlock on the PR holding the lock or by using the "Discard Plan and Unlock" button in the Atlantis UI. Addressing the underlying issue in stuck or failed PRs, or closing them, will release locks. Optimizing plan/apply times (see Performance Bottlenecks) helps, as does refining atlantis.yaml project granularity to split broad project definitions into more specific ones, reducing the scope of locks. For example, instead of one broad project, define multiple granular ones:

# After: More granular
version: 3
projects:
- name: frontend_app
  dir: apps/frontend
  workspace: production
  workflow: default
  autoplan:
    when_modified: ["**/*.tf*", "../../modules/network/**/*.tf"]
    enabled: true
- name: backend_api
  dir: apps/backend
  workspace: production
  workflow: default
  autoplan:
    when_modified: ["**/*.tf*", "../../modules/database/**/*.tf"]
    enabled: true
Scalr Note: Scalr's run queuing and workspace-level concurrency management can provide more sophisticated handling of parallel operations, reducing the likelihood of simple lock contention becoming a major blocker.

Problem 4: Plan Inconsistencies (Local vs. Atlantis)

Plan inconsistencies arise when Atlantis plan output differs from a local terraform plan. Symptoms include Atlantis proposing unexpected resource changes or showing "No changes" when local plans indicate modifications, or vice-versa. Atlantis might also propose to destroy/recreate resources not intentionally modified.

The common causes for such discrepancies are environmental. A Terraform version mismatch between local and Atlantis environments is a key factor, as is a provider version mismatch, often due to an uncommitted or inconsistent .terraform.lock.hcl file. Differing environment variables (like TF_VAR_* or AWS_DEFAULT_REGION), backend configuration discrepancies where Atlantis targets a different state or workspace, uncommitted .tfvars files, or Atlantis's merge checkout strategy (which plans against a temporary merge commit) can all lead to differing plans.

Diagnosing these inconsistencies requires comparing environments. Verify the Terraform version in Atlantis logs, atlantis.yaml (terraform_version), or via the atlantis version PR command. Compare local terraform providers output with the .terraform.lock.hcl Atlantis uses, ensuring this lock file is committed and identical. Check environment variables, perhaps by using a custom workflow step like run: env in Atlantis, and verify the backend configuration in code and Atlantis logs.

To resolve these issues, align Terraform versions using terraform_version in atlantis.yaml and required_version in your Terraform code:

# atlantis.yaml
version: 3
projects:
- dir: .
  terraform_version: v1.5.0 # Specify version
```hcl
# main.tf
terraform {
  required_version = ">= 1.5.0"
  # ...
}

Always commit and maintain the .terraform.lock.hcl file. Standardize environment variable injection using committed .tfvars or secure server-side injection. Ensure consistent backend configuration. If Atlantis uses a merge checkout strategy, local comparison plans should also be against a similarly merged state.

Scalr Note: Scalr enforces consistent execution environments, including Terraform versions and provider handling, across all runs. Variables and credentials are also managed centrally per environment, drastically reducing these types of inconsistencies.

Problem 5: atlantis.yaml Syntax Errors or Misconfigurations

Errors in atlantis.yaml can disrupt Atlantis's custom behavior. Symptoms include autoplan failures where Atlantis doesn't trigger plans for matching file changes, incorrect workflow execution or fallback to default workflows, "Project not found" errors for atlantis plan -p <project_name>, YAML parsing errors in Atlantis server logs, or unexpected behavior in custom workflows like scripts failing or environment variables not being set.

Common causes range from simple YAML syntax errors (incorrect indentation, hyphens, colons) to an incorrect version: 3 directive. Misconfigured projects arrays (e.g., wrong dir, workspace, or missing name for disambiguation) are frequent. Flawed when_modified patterns (incorrect glob syntax, paths not relative to project dir, forgetting file types like .tfvars) can prevent autoplan. Misconfigured custom workflows (typos, script errors, tools not in PATH, env var scope issues) and server-side restrictions in repos.yaml (disallowing overrides or custom workflows) also lead to problems.

Diagnosing atlantis.yaml issues involves validating YAML syntax locally before committing. Check Atlantis server logs (at debug level) for parsing errors and project matching logic. Using atlantis plan --verbose in PR comments can provide execution logs. Systematically test when_modified patterns, remembering paths are relative to the project dir. For example, for a project in envs/dev using a shared module at modules/vpc:

# Correct when_modified
version: 3
projects:
- name: dev_vpc
  dir: envs/dev
  autoplan:
    when_modified:
    - "**/*.tf"
    - "**/*.tfvars"
    - "../../modules/vpc/**/*.tf" # Path relative to 'envs/dev'
    enabled: true

Simplify custom workflows to isolate issues and verify server-side repos.yaml for allowed_overrides and allow_custom_workflows.

Resolution involves correcting YAML syntax and ensuring version: 3. Refine project definitions (dir, workspace, name) and improve when_modified patterns. Debug custom workflows by ensuring scripts are executable, tools are installed, and env/multienv are used correctly. Adjust server-side configuration if it's restricting atlantis.yaml functionality.

Scalr Note: Scalr's UI-driven configuration for project setup, OPA policy integration, and custom hooks can reduce the likelihood of syntax errors and misconfigurations common in YAML-heavy setups, while still offering powerful customization.

Problem 6: Performance Bottlenecks

Performance bottlenecks can slow down Atlantis operations significantly. Symptoms include excessively long plan or apply times, high CPU/memory usage on the Atlantis server leading to unresponsiveness, a slow Atlantis web UI, and increased lock contention due to long operation times.

The common causes are often large Terraform state files or complex configurations with numerous modules and resources. Insufficient Atlantis server resources (CPU, RAM, disk I/O) are a major factor, as is a suboptimal --parallel-pool-size (default 15) for the server's capacity. Slow disk I/O for the --data-dir, network latency to cloud providers, inefficient Terraform code, or a disabled/slow Terraform plugin cache can also contribute.

Diagnosing performance issues requires monitoring Atlantis server resources (top, htop, K8s metrics). Enabling the Atlantis profiling API (--enable-profiling-api) allows use of pprof for deeper analysis. Analyze plan/apply durations from PR comments/logs, check Terraform state file sizes, and use Terraform profiling locally (terraform plan -profile=...) for slow configurations.

Resolutions often involve optimizing Terraform code by modularizing and splitting large configurations/states. Increasing Atlantis server resources and tuning --parallel-pool-size based on server capacity are crucial. Using faster storage (SSDs) for --data-dir helps, as does enabling --skip-clone-no-changes for GitHub/GitLab with atlantis.yaml. Ensure the Terraform plugin cache (--use-tf-plugin-cache=true) is enabled and on fast storage.

Scalr Note: Scalr's architecture is designed for scalability, with options for self-hosted agents that can be sized appropriately for demanding workloads. Its state management and run execution are optimized, often mitigating some of these common Terraform performance issues at the platform level.

Problem 7: Basic Security Oversights

Neglecting basic security can expose Atlantis and your infrastructure. These symptoms are usually found via audit rather than direct errors: an Atlantis web UI accessible without authentication, webhooks over HTTP or with weak/missing secrets, overly broad IAM permissions for Atlantis/Terraform, VCS tokens with excessive permissions, hardcoded secrets in atlantis.yaml or scripts, or unrestricted custom workflows allowing arbitrary command execution.

Common causes include leaving default configurations unchanged, prioritizing convenience over security during setup, insecure secret management practices, or globally permissive custom workflow settings (allow_custom_workflows: true).

Diagnosing these oversights involves auditing. Review Atlantis server configuration flags (--web-basic-auth, --ssl-cert-file, --repo-allowlist, --allow-fork-prs). Inspect VCS webhook settings for HTTPS and strong secrets. Audit cloud provider IAM roles/permissions for least privilege. Check VCS token scopes and scan atlantis.yaml and custom scripts for hardcoded secrets.

Hardening Atlantis involves several steps. Secure Atlantis server access by enabling UI authentication (e.g., --web-basic-auth=true --web-username=<user> --web-password=<pass>) and enforcing HTTPS (--ssl-cert-file, --ssl-key-file). Secure webhooks with HTTPS and strong, unique secrets (ATLANTIS_GH_WEBHOOK_SECRET). Implement least privilege for all credentials. Manage secrets securely using tools like HashiCorp Vault or AWS Secrets Manager. Restrict custom workflows, disabling them globally or enabling selectively via repos.yaml with allowed_overrides:

# Server-side repos.yaml example
repos:
- id: /.*/ # Or specific regex for trusted repos
  allow_custom_workflows: true
  allowed_overrides: [workflow, plan_requirements, apply_requirements]
- id: /.*/ # Default for others
  allow_custom_workflows: false # Disable for untrusted repos
Scalr Note: Security is a core tenet of platforms like Scalr, which provide built-in role-based access control (RBAC), secure variable management with masking, and integration with OPA for policy enforcement, addressing many of these security concerns at the platform level.

Problem 8: Stale Plans and Diverged Branches

Stale plans occur when the base branch or remote state changes after a plan is generated. Symptoms include atlantis apply failing due to remote state drift, Atlantis blocking applies if the undiverged requirement is set and the PR is behind its base branch, or applies succeeding but with unintended changes due to an outdated plan. Inconsistent behavior of the undiverged check itself can also be a symptom.

The common causes are the base branch (e.g., main) being updated after plan generation, misconfiguration of checkout_strategy (default is branch) and the undiverged requirement, or delayed PR merging. Potential flakiness in Atlantis's undiverged check implementation in some versions can also contribute.

To diagnose this, compare the PR branch with its base branch using Git commands or the VCS UI. Check Atlantis PR comments for undiverged failure messages. Verify the --checkout-strategy server flag and the undiverged configuration in plan_requirements/apply_requirements.

Resolving stale plans involves rebasing or merging the base branch into the PR branch before applying, then re-planning. A robust solution is to configure the merge checkout strategy (--checkout-strategy=merge server flag) along with the undiverged requirement in atlantis.yaml or repos.yaml:

# repos.yaml or project in atlantis.yaml
apply_requirements: [approved, undiverged] # Add undiverged
# Optionally for earlier feedback:
# plan_requirements: [undiverged]

Enforcing branch protection rules in the VCS (requiring branches to be up-to-date) helps. Automating PR updates with external tools (Mergify, GitHub Actions) and promptly merging approved PRs also mitigate this issue.

Scalr Note: Scalr's run lifecycle management ensures that applies are based on the latest code and configuration. It typically re-evaluates plans against the current state before apply, and its environment locking mechanisms can prevent conflicting operations more robustly than simple branch divergence checks.

Summary of Common Issues

Problem Area

Common Symptom(s)

Typical Resolution Category

Credential Misconfigurations

AccessDenied, NoCredentialProviders, VCS auth errors

Correct IAM, Env Vars, Secure Secrets

Webhook Delivery Failures

No PR comments, VCS delivery errors, signature mismatch

Fix URL/Secret, Network Rules, Event Subscriptions

Plan/Apply Lock Contention

"Project locked by PR #XYZ", UI shows active locks

Manual Unlock, Fix Stuck PRs, Refine atlantis.yaml projects

Plan Inconsistencies

Atlantis plan differs from local plan

Align TF/Provider Versions, .terraform.lock.hcl, Env Vars

atlantis.yaml Errors

Autoplan fails, wrong workflow, project not found

Fix YAML syntax, when_modified patterns, workflow logic

Performance Bottlenecks

Slow plans/applies, high server load

Optimize TF code/state, Increase Server Resources, Tune Parallelism

Basic Security Oversights

Unauthenticated UI, HTTP webhooks, broad permissions

Enable Auth/HTTPS, Least Privilege, Secure Secrets, Restrict Workflows

Stale Plans/Diverged Branches

Apply fails on drift, undiverged blocks apply

Rebase PR, Use merge strategy + undiverged req, VCS rules

Conclusion: Mastering Atlantis and Considering Alternatives

Terraform Atlantis offers valuable automation for IaC workflows. However, as we've seen, navigating its operational complexities requires diligence, a good understanding of its internals, and careful configuration. For teams managing a few repositories with straightforward needs, mastering these troubleshooting steps can lead to a stable Atlantis setup.

Yet, as infrastructure scale and team size grow, the cumulative effort of managing these issues—ensuring credential hygiene across numerous projects, debugging atlantis.yaml intricacies, optimizing performance, and maintaining a robust security posture—can become substantial. This is often the point where organizations begin to evaluate more comprehensive IaC management platforms. Solutions like Scalr are designed to address many of these challenges natively, offering features such as integrated secret management, hierarchical configuration, role-based access control, sophisticated run orchestration, and policy enforcement through Open Policy Agent (OPA). By abstracting away some of the lower-level operational burdens, these platforms can empower teams to focus more on defining and evolving their infrastructure, rather than on the intricacies of the automation tooling itself.

Ultimately, the right choice depends on your team's specific needs, scale, and tolerance for operational overhead. Understanding the common pitfalls of Atlantis is the first step, whether your goal is to optimize your current setup or to identify when it's time to explore alternatives that might offer a more streamlined path to secure and efficient Infrastructure as Code.