What are Terraform Data Sources?

Terraform data sources enable you to fetch and leverage information about existing infrastructure or external systems within your configurations, facilitating dynamic and context-aware deployments.

Terraform is great at provisioning new infrastructure, but often you need to integrate with existing resources or pull dynamic information into your deployments. This is where Terraform data sources become extremely useful. They allow your Terraform configurations to read information about infrastructure components that are managed outside of the current Terraform state, whether they were created manually, by another Terraform configuration, or by other means.

The Basics of Using Data Sources

A Terraform data source is declared using a data block. Unlike resource blocks, which manage the lifecycle of infrastructure objects (creating, updating, and deleting them), data blocks are read-only. They query an external API or system for information and expose it as attributes that can be referenced elsewhere in your configuration.

The general syntax for a data source block is:

data "<PROVIDER>_<TYPE>" "<NAME>" {
  # Configuration arguments to filter or identify the data
}
  • <PROVIDER>: The name of the Terraform provider (e.g., aws, azurerm, google).
  • <TYPE>: The type of data you want to retrieve (e.g., ami, vpc, subnet).
  • <NAME>: A local name you assign to this data source, used for referencing its attributes within your configuration.

Once defined, you can access the attributes of a data source using the syntax data.<PROVIDER>_<TYPE>.<NAME>.<ATTRIBUTE>.

Examples of Using Data Sources

Here are three common scenarios where Terraform data sources are valuable:

1. Finding the Latest Amazon Machine Image (AMI)

Instead of hardcoding AMI IDs, which change frequently, you can use a data source to always select the latest suitable AMI for your EC2 instances.

# main.tf
data "aws_ami" "latest_amazon_linux" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_instance" "web_server" {
  ami           = data.aws_ami.latest_amazon_linux.id
  instance_type = "t2.micro"
  # ... other instance configuration
}

In this example, the aws_ami data source filters AMIs to find the most recent Amazon Linux 2 HVM image. The ami attribute of the aws_instance then references the id of the found AMI, ensuring your EC2 instance always uses the latest version.

2. Referencing an Existing Virtual Private Cloud (VPC)

When deploying resources into an existing network infrastructure, you often need to reference an already provisioned VPC.

# main.tf
data "aws_vpc" "existing_vpc" {
  filter {
    name   = "tag:Name"
    values = ["my-production-vpc"]
  }
  # Or you could use: id = "vpc-0123456789abcdef0"
}

resource "aws_subnet" "app_subnet" {
  vpc_id            = data.aws_vpc.existing_vpc.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
}

Here, the aws_vpc data source retrieves information about a VPC tagged "my-production-vpc". The aws_subnet resource then uses the id from this data source to ensure it's created within the correct VPC.

3. Retrieving Secrets from a Secret Manager

For sensitive information like database passwords or API keys, it's best practice to store them in a dedicated secret management service. Data sources allow you to retrieve these secrets securely at deployment time.

# main.tf
data "aws_secretsmanager_secret" "db_password_secret" {
  name = "my-database-password"
}

data "aws_secretsmanager_secret_version" "db_password_version" {
  secret_id = data.aws_secretsmanager_secret.db_password_secret.id
}

resource "aws_db_instance" "my_database" {
  # ... other database configuration
  password = data.aws_secretsmanager_secret_version.db_password_version.secret_string
}

This setup first retrieves the my-database-password secret, then its latest version, and finally uses the secret_string attribute to set the database password. This prevents sensitive data from being hardcoded in your Terraform files.

By using data sources, you make your Terraform configurations more dynamic, robust, and maintainable, enabling seamless interaction with existing infrastructure and external systems.

Data Sources vs Outputs

Sometimes there is confusion about the difference between data sources and outputs in Terraform. Terraform data sources and outputs serve distinct purposes related to how data flows within and out of your configurations.

A data source is used to read information about infrastructure or external data that already exists and is not directly managed by your current Terraform configuration. It acts as a query to a provider's API, fetching attributes of an existing resource (like the latest AMI ID, an existing VPC's ID, or secrets from a vault) to be used as inputs for new resources your Terraform code is creating. In contrast, an output is used to expose specific values or attributes of resources that your current Terraform configuration has just created or modified.

Outputs act like return values from a module or a root configuration, making information such as a newly provisioned server's public IP address or a database's endpoint accessible on the command line, to other Terraform modules, or to external automation tools.

In essence, data sources bring external data into your configuration, while outputs send internal data out from your configuration.