What are Terraform Data Sources?
Learn what Terraform data sources are, when to use them, and how to use them.
What are Data Sources?
Think of a datasource as a query to your cloud provider. For instance, you can use a datasource to find the ID of an Amazon Machine Image (AMI), an existing Virtual Private Cloud (VPC), or a specific security group. Terraform performs this query during the planning phase, before any changes are applied, to ensure that the referenced information exists and is up to date.
Datasources are invaluable for creating dynamic and flexible Terraform configurations.
- Referencing existing infrastructure: You can get details about a VPC, subnet, or other resources that were created either manually or by a separate Terraform configuration. This is crucial for building new infrastructure that depends on a pre-existing network setup.
- Dynamic configuration: Instead of hardcoding values, you can use a datasource to get the latest version of an object. A classic example is finding the most recent AMI for an EC2 instance, so you don't have to manually update your code when a new image is released.
- Cross-workspace communication: When you have multiple Terraform configurations that manage different parts of your infrastructure, you can use a datasource to read outputs from one configuration's state file and use them as inputs in another.
- Avoiding hardcoded values: Datasources help you avoid writing hardcoded IDs or names, making your code more reusable and less prone to errors.
When to Use Terraform Data Sources vs Variables
Data sources and variables are fundamental constructs for making your configurations dynamic and reusable, but they serve distinct purposes. Understanding when to use each is important for writing efficient and maintainable infrastructure as code.
Variables are used to parameterize your Terraform configurations. They allow you to define values that can be customized for different environments (e.g., development, staging, production) or deployments, without altering the core configuration logic. Think of them as inputs to your Terraform module or root configuration. You define variables using the variable
block, and their values can be provided via CLI flags, environment variables, or .tfvars
files. Variables are ideal for static values that rarely change or values you want to explicitly control from outside your Terraform code, such as region, instance types, or resource prefixes.
Data sources, on the other hand, are used to fetch or compute information about existing resources or external systems, and then make that information available within your Terraform configuration. They act as a read-only bridge to infrastructure that is either managed by another Terraform configuration, provisioned manually, or resides outside of Terraform's direct management. For instance, you might use a data source to retrieve the ID of the most recent AMI, query existing VPCs or subnets, or get details about a pre-existing S3 bucket. Data sources ensure your infrastructure adapts to the current state of external components and avoids hardcoding values that might change.
In summary, use variables when you need to provide input values to your configuration, allowing for flexibility and reusability across different deployments. Use data sources when you need to retrieve information about existing infrastructure or external systems to inform your current Terraform configuration, making your infrastructure deployments more dynamic and context-aware.
The Basics of Using Data Sources
A Terraform data source is declared using a data
block. Unlike resource
blocks, which manage the lifecycle of infrastructure objects (creating, updating, and deleting them), data
blocks are read-only. They query an external API or system for information and expose it as attributes that can be referenced elsewhere in your configuration.
The general syntax for a data source block is:
data "<PROVIDER>_<TYPE>" "<NAME>" {
# Configuration arguments to filter or identify the data
}
<PROVIDER>
: The name of the Terraform provider (e.g.,aws
,azurerm
,google
).<TYPE>
: The type of data you want to retrieve (e.g.,ami
,vpc
,subnet
).<NAME>
: A local name you assign to this data source, used for referencing its attributes within your configuration.
Once defined, you can access the attributes of a data source using the syntax data.<PROVIDER>_<TYPE>.<NAME>.<ATTRIBUTE>
.
Examples of Using Data Sources
Here are three common scenarios where Terraform data sources are valuable:
1. Finding the Latest Amazon Machine Image (AMI)
Instead of hardcoding AMI IDs, which change frequently, you can use a data source to always select the latest suitable AMI for your EC2 instances.
# main.tf
data "aws_ami" "latest_amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
resource "aws_instance" "web_server" {
ami = data.aws_ami.latest_amazon_linux.id
instance_type = "t2.micro"
# ... other instance configuration
}
In this example, the aws_ami
data source filters AMIs to find the most recent Amazon Linux 2 HVM image. The ami
attribute of the aws_instance
then references the id
of the found AMI, ensuring your EC2 instance always uses the latest version.
2. Referencing an Existing Virtual Private Cloud (VPC)
When deploying resources into an existing network infrastructure, you often need to reference an already provisioned VPC.
# main.tf
data "aws_vpc" "existing_vpc" {
filter {
name = "tag:Name"
values = ["my-production-vpc"]
}
# Or you could use: id = "vpc-0123456789abcdef0"
}
resource "aws_subnet" "app_subnet" {
vpc_id = data.aws_vpc.existing_vpc.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
}
Here, the aws_vpc
data source retrieves information about a VPC tagged "my-production-vpc". The aws_subnet
resource then uses the id
from this data source to ensure it's created within the correct VPC.
3. Retrieving Secrets from a Secret Manager
For sensitive information like database passwords or API keys, it's best practice to store them in a dedicated secret management service. Data sources allow you to retrieve these secrets securely at deployment time.
# main.tf
data "aws_secretsmanager_secret" "db_password_secret" {
name = "my-database-password"
}
data "aws_secretsmanager_secret_version" "db_password_version" {
secret_id = data.aws_secretsmanager_secret.db_password_secret.id
}
resource "aws_db_instance" "my_database" {
# ... other database configuration
password = data.aws_secretsmanager_secret_version.db_password_version.secret_string
}
This setup first retrieves the my-database-password
secret, then its latest version, and finally uses the secret_string
attribute to set the database password. This prevents sensitive data from being hardcoded in your Terraform files.
By using data sources, you make your Terraform configurations more dynamic, robust, and maintainable, enabling seamless interaction with existing infrastructure and external systems.
Remote State Data Source
In Terraform, "remote data sources" primarily refer to the special data source terraform_remote_state
. This is a unique and very powerful data source that allows you to fetch outputs from other, independently managed, Terraform state files in Scalr and Terraform Cloud.
Here's a breakdown of what that means and why it's so useful:
- Sharing Outputs Between Configurations: In larger infrastructure setups, it's common to break down your infrastructure into multiple, smaller Terraform configurations. For example, you might have one configuration that sets up your core network (VPC, subnets, routing tables), and another configuration that deploys applications into that network. The application configuration needs to know the IDs of the VPC or subnets created by the network configuration.
- Decoupling Infrastructure:
terraform_remote_state
enables this decoupling. Instead of managing all your infrastructure in one giant Terraform state file, you can manage different parts of your infrastructure separately. This improves team collaboration, reduces the blast radius of changes, and makes your deployments more modular. - How it Works: When you declare a
terraform_remote_state
data source, you specify the backend where the other configuration's state file is stored (e.g., S3, Azure Blob Storage, Scalr, Terraform Cloud) and the key/path to that state file. Terraform then reads that remote state file and makes all of its output values available for use in your current configuration.
See examples on remote state sharing here.
Data Sources vs Outputs
Sometimes there is confusion about the difference between data sources and outputs in Terraform. Terraform data sources and outputs serve distinct purposes related to how data flows within and out of your configurations.
A data source is used to read information about infrastructure or external data that already exists and is not directly managed by your current Terraform configuration. It acts as a query to a provider's API, fetching attributes of an existing resource (like the latest AMI ID, an existing VPC's ID, or secrets from a vault) to be used as inputs for new resources your Terraform code is creating. In contrast, an output is used to expose specific values or attributes of resources that your current Terraform configuration has just created or modified.
Outputs act like return values from a module or a root configuration, making information such as a newly provisioned server's public IP address or a database's endpoint accessible on the command line, to other Terraform modules, or to external automation tools.
In essence, data sources bring external data into your configuration, while outputs send internal data out from your configuration.