Let me start with something most infrastructure engineers might not say out loud — Terraform solves Day-1 beautifully and then kinda leaves you hanging.
You write your HCL, run terraform apply, and everything is provisioned perfectly. The state file appears impeccable. But six months later that same infrastructure has been poked, patched, manually changed and silently drifted away from what terraform thinks exists. No one realizes this until something breaks in production.
This article is about that “gap” between provisioning and actually managing infrastructure across its entire lifetime.
Day-2 Is Where Infrastructure Goes to Die (Slowly)
When a full stack is provisioned onto AWS using Terraform it has a good state and everything is the same, and then after some time passes and a deployment fails, someone logs into the console and changes a security group rule; now the deployment has been successful… but this change has not been documented and no tickets have been raised regarding this change.
When they run the scheduled terraform apply, Terraform sees the difference and resets the security group to the original state, resulting in production breaking. Everyone is confused because there were no code changes made.
The root cause of this issue is that the tools have not been designed for such usage; Terraform's core capability was to provide an infrastructure provisioning capability.
Therefore, what are teams doing for their Day-2 operations? Most have a combination of:
- Bash scripts that contain parts nobody understands
- AWS Console changes that are made manually and never documented
- Ad-hoc Ansible runs that don't tie back to Terraform state in any way
- Lambda functions that are each triggering another Lambda function creating a non-traceable chain
In total, over 30 different tools are managing a single hybrid infrastructure estate, which is being actively managed by organizations in the field.
The Lifecycle Nobody Talks About Enough
Infrastructure has four phases and most of the industry focuses heavily over two of them.
The first phase, or "Day-0", is the "Build Phase." In this phase an organisation will form their infrastructure and define policies. There has not been any provisioning yet and is done in partnership with the platform and security teams.
The second phase, or "Day-1", is "Deploy Phase." In this phase terraform apply will run, infrastructure will be built, and the application teams will deploy their workloads. This is where terraform really starts to show its capabilities.
Day-2 or "Manage Phase." This phase is where management happens, patches are installed, configurations are changed, certificates are renewed and scaled as needed and where compliance is checked for validity and accuracy. Day 2 can take years to complete and it is also were all of the operational pain will occur. Terraform traditionally has no place in this phase.
Day-N "Decommission Phase." This phase is where everything is removed and cleaned up.
Over the last ten years the DevOps industry has been focused on perfecting Day-1 tooling; however, there are very few tools available for Day-2.
Terraform Actions — What Changed in v1.14
Terraform Actions were added as stable functionality in Terraform v1.14 and were unveiled at HashiConf 2025. Now, providers can execute an action that does more than just CRUD - calling a lambda function, stopping an EC2, invalidating a CloudFront cache, or triggering an Ansible playbook.
These new actions are located in their own top-level action block in your HCL. Terraform can automatically execute them based on event triggers during a resource's lifecycle, or they can be invoked manually via the CLI without the need to do a complete terraform apply.
You can invoke an operational action (such as calling a lambda to warm up a cache) without having Terraform re-evaluate the entire state of your infrastructure. This is a significant change in how you will use your infrastructure from now on.
The AWS provider currently has:
aws_lambda_invokeaws_ec2_stop_instanceaws_cloudfront_create_invalidation
How Actions Actually Work — The Syntax
There are two pieces. The action block itself, and the trigger that fires it.
Defining an Action
action "aws_lambda_invoke" "warm_cache" {
config {
function_name = aws_lambda_function.cache_warmer.function_name
payload = jsonencode({
source = "terraform_action"
})
}
}
Note the config {} wrapper. Provider-specific arguments go inside config, not directly in the action block.
Meta-arguments like count and provider exist outside config:
action "aws_lambda_invoke" "warm_cache" {
count = var.invoke_on_deploy ? 1 : 0
provider = aws.us_east_1
config {
function_name = aws_lambda_function.cache_warmer.function_name
payload = jsonencode({ source = "terraform_action" })
}
}
Triggering an Action on Resource Lifecycle Events
This goes inside the resource's lifecycle block:
resource "aws_lambda_function" "api" {
function_name = "my-api-handler"
# ... rest of config
lifecycle {
action_trigger {
events = [after_create, after_update]
actions = [action.aws_lambda_invoke.warm_cache]
}
}
}
Two main things to understand:
-
eventsuses unquoted keywords —after_createandafter_update -
actionsis plural and takes a list, not a single reference
You can also add a condition to guard the action:
lifecycle {
action_trigger {
events = [after_create]
actions = [action.ansible_playbook.patch_instance]
condition = var.enable_auto_patching
}
}
When condition is false, the action is skipped completely. This is useful when the configuration should exist but only run in certain environments, like production.
Running Actions from the CLI
This is where it gets useful for Day-2 workflows:
# Just plan the action, don't run it
terraform plan -invoke=action.aws_lambda_invoke.warm_cache
# Actually run the action
terraform apply -invoke=action.aws_lambda_invoke.warm_cache
Terraform only executes that one action. No evaluation or change of any other part of your configuration occurs. Each action can only be executed once at a time; therefore, multiple -invoke cannot be run in a single command.
Provisioning EC2 + Immediate Patching via Ansible Automation Platform
One of the most important and widely used use cases is linking EC2 provisioning and automated patching through Ansible Automation Platform (AAP).
The challenge it solves is simple; there are usually many security patches pending for an Ubuntu AMI that has been provisioned several months prior. If EC2 instances are provisioned, and then you manually take the time to patch each one independently, then at some point (most likely within 30 days) you will not patch an instance you provisioned. Thus, the solution is to link the patching process to the lifecycle of the Terraform instance provisioning so that patching cannot be missed.
The Terraform Side
variable "instance_count" {
type = number
default = 2
}
variable "ubuntu_ami" {
type = string
description = "AMI ID — use a recent Ubuntu LTS, patching will handle the rest"
}
variable "aap_controller_url" {
type = string
sensitive = true
}
variable "aap_oauth_token" {
type = string
sensitive = true
}
variable "allow_instance_reboot" {
type = bool
default = false
}
resource "aws_instance" "app_servers" {
count = var.instance_count
ami = var.ubuntu_ami
instance_type = "t3.medium"
subnet_id = aws_subnet.public.id
key_name = aws_key_pair.deployer.key_name
vpc_security_group_ids = [aws_security_group.allow_ssh.id]
tags = {
Name = "app-server-${count.index}"
ManagedBy = "terraform"
}
lifecycle {
action_trigger {
events = [after_create, after_update]
actions = [action.ansible_aap_job.patch_servers]
}
}
}
The after_update event is critical; should an instance be replaced (due to AMI update, instance type modification, or any reason that would force a new instance to be created), the patching will occur on the newly-created instance automatically without any manual intervention required.
The Action Block
action "ansible_aap_job" "patch_servers" {
config {
controller_url = var.aap_controller_url
oauth_token = var.aap_oauth_token
job_template_name = "EC2 Linux Patching"
extra_vars = jsonencode({
vm_hosts = [
for instance in aws_instance.app_servers : {
instance_id = instance.id
public_ip = instance.public_ip
}
]
allow_reboot = var.allow_instance_reboot
})
}
}
Credentials are stored in HCP Terraform's sensitive variable store. Instance IDs and IPs come straight from resource state at runtime, so AAP always gets current values.
Note: You must refer to your provider documentation to verify the argument names of your AAP action based upon the version you are using, while the structure remains valid.
The Ansible Playbook
AAP receives vm_hosts as an extra variable, builds inventory dynamically, and patches:
---
- name: Patch EC2 Instances
hosts: all
gather_facts: yes
become: yes
pre_tasks:
- name: Wait for SSH connectivity
ansible.builtin.wait_for_connection:
timeout: 120
delay: 10
- name: Gather package facts
ansible.builtin.package_facts:
manager: apt
tasks:
- name: Update apt package index
ansible.builtin.apt:
update_cache: yes
cache_valid_time: 3600
- name: Apply security patches
ansible.builtin.apt:
upgrade: dist
only_upgrade: yes
register: patch_result
- name: Check if reboot is required
ansible.builtin.stat:
path: /var/run/reboot-required
register: reboot_required_file
- name: Reboot if needed and allowed
ansible.builtin.reboot:
reboot_timeout: 300
post_reboot_delay: 30
when:
- reboot_required_file.stat.exists
- allow_reboot | default(false) | bool
post_tasks:
- name: Verify instance is up after patching
ansible.builtin.ping:
/var/run/reboot-required is a file Ubuntu creates automatically when a package update (typically a kernel patch) requires a restart to take effect. The playbook checks for this file rather than blindly rebooting. And even then, it only reboots if allow_reboot is true, which is controlled from your Terraform variables.
AAP Job Template Configuration
With respect to Ansible Automation Platform:
- Project is a reference to the Git repo containing your playbook.
-
Inventory is created dynamically from the
vm_hostsvalue which is assigned to you by Terraform when it runs. - Credentials - Your SSH Private RSA key will be stored in AAP's credential vault and will be used to connect to the VM via SSH. This is a very secure separation of the enterprise applications and their configurations (Terraform) and how to connect to them (Ansible).
What the Full Workflow Looks Like
An engineer modifies the instance_count from 2 to 5 and then sends a Git commit with the modification.
The engineer pushes it to HCP Terraform. HCP Terraform recognizes the alteration made and initiates a plan, which indicates that Terraform will create 3 new AWS EC2 instances; and that Terraform will submit an action request once the provisioned instances are created.
After an engineer reviews and approves the proposed plan, Terraform executes the apply phase, creating three EC2 instances within AWS. At this time, the action_trigger will be invoked. Terraform calls AAP's API, sending the newly created instance IDs and public IP addresses, to activate the patching job.
The AAP will make a dynamic inventory through Terraform and subsequently will wait until all three instances can be accessed through SSH. Once that occurs, AAP will execute an apt dist-upgrade command to identify whether a reboot is required, and then reboot the instance, if allowed. Finally, AAP will, upon each instance coming back online & responding normally, send a report back to Terraform.
Upon the completion of the reports from AAP, Terraform will acknowledge completion of the run.
Other Places Actions Are Immediately Useful
CloudFront invalidation after S3 deployments
action "aws_cloudfront_create_invalidation" "bust_cache" {
config {
distribution_id = aws_cloudfront_distribution.website.id
paths = ["/*"]
}
}
resource "aws_s3_object" "site_bundle" {
lifecycle {
action_trigger {
events = [after_update]
actions = [action.aws_cloudfront_create_invalidation.bust_cache]
}
}
}
Lambda warm-up after deployments: After deployments, cold starts on the first production request are common sources of failures in Lambda functions. First, there may be cold starts on the first production request after deployments, which is typical of failures; hence, the function is invoked immediately after deployment to ensure users do not encounter any faults.
action "aws_lambda_invoke" "warm_up" {
config {
function_name = aws_lambda_function.api_handler.function_name
payload = jsonencode({ source = "warmup" })
}
}
resource "aws_lambda_function" "api_handler" {
lifecycle {
action_trigger {
events = [after_create, after_update]
actions = [action.aws_lambda_invoke.warm_up]
}
}
}
Stopping dev instances on demand
action "aws_ec2_stop_instance" "stop_dev" {
config {
instance_id = aws_instance.dev_server.id
}
}
terraform apply -invoke=action.aws_ec2_stop_instance.stop_dev
Chaining multiple actions — actions is a list, order is respected, each one completes before the next starts:
lifecycle {
action_trigger {
events = [after_create]
actions = [
action.ansible_aap_job.patch_servers,
action.aws_lambda_invoke.register_in_cmdb,
action.aws_lambda_invoke.notify_slack
]
}
}
Things That Will Catch You Out
An action that fails prevents a run's end:
By default, how long Terraform waits for an action to finish is determined by the status of the action being waited on. This allows visibility into the status of actions, but introduces potential for issues if AAP has gone down just prior to a critical deployment, preventing Terraform from being able to wait for AAP to complete. Use condition guards for actions that have minimal impact on the overall deployment if they're interrupted.
Idempotency is not a luxury:
Every time there is a resource change, the after_update event fires. That means that your playbooks and lambda handlers will be invoked multiple times over the lifecycle of your infrastructure. It is acceptable to run apt dist-upgrade multiple times but not acceptable to perform a database migration multiple times. You must design your programming for re-execution from the very beginning of the process.
Actions do not write to state:
When an action is executed there is no record of its execution being written to a statefile in terraform. The only way to tell that the action was executed is from the run history in HCP Terraform and in the logs for any other systems that were involved i.e., AAP job history, cloud watch, etc... You have to plan how you will be able to see/understand when your Terraform works based on its observability functionality.
The provider support continues to grow:
v1.14 of the AWS Provider supports a narrow set of operations (action types). Always refer to the Terraform Registry and Provider changelogs prior to assuming any operation is an action.
CLI invocation requires existing resources:
If your action references instance.id, but the instance doesn't exist in state, the -invoke option will fail during the plan phase. Actions that use CLI should reference existing resources that have already been provisioned.
The Actual Shift
Almost all infrastructure management involves Day-2 operations.
In the past, Day-2 operations were recorded in runbooks, as part of a Jenkins job that only a few people understood, or as a bash script that was modified years ago. Day-2 operations were delivered in a reactive manner, which means if something breaks, somebody performs an action.
With Terraform Actions, Day-2 operations can now be housed with the infrastructure they manage, from the same repository and pull request workflow and using the same audit trails as the infrastructure they are provisioned on. Patch management will be defined in terms of the infrastructure your organization will provision.
This kind of change reduces the number of incidents occurring at 2:00 AM.
Terraform Actions is stable from Terraform CLI v1.14.0. Check developer.hashicorp.com/terraform/language/invoke-actions for official documentation and your provider's registry page for supported action types.
Technical insights sourced from a community session on Terraform Day-2 operations.
For more developer content, visit vickybytes.com
Top comments (0)