Detecting Infrastructure Drift with Terraform and Python
How I built AWSDriftGuard, a CLI tool to detect discrepancies between your AWS infrastructure and Terraform state files.
The Problem
Infrastructure drift is one of those problems that sneaks up on you. Someone makes a manual change in the AWS console, a script modifies a security group, or an auto-scaling event creates resources that Terraform doesn't know about. Before you know it, your Terraform state and actual infrastructure are out of sync.
What is Infrastructure Drift?
Drift occurs when the actual state of your cloud resources differs from what your Infrastructure as Code (IaC) tool believes the state to be. This can lead to:
- Security vulnerabilities — ports opened manually that bypass your code review process
- Deployment failures — Terraform plans that fail because reality doesn't match expectations
- Cost overruns — orphaned resources that nobody knows about
Building AWSDriftGuard
I built AWSDriftGuard as a Python CLI tool that compares AWS resources against Terraform state files. The core architecture is straightforward:
def detect_drift(terraform_state, aws_resources):
drift_report = []
for resource_type, tf_resources in terraform_state.items():
aws_actual = aws_resources.get(resource_type, [])
# Find resources in Terraform but not in AWS (deleted)
for tf_res in tf_resources:
if tf_res["id"] not in [a["id"] for a in aws_actual]:
drift_report.append({
"type": "DELETED",
"resource": tf_res,
})
# Find resources in AWS but not in Terraform (unmanaged)
for aws_res in aws_actual:
if aws_res["id"] not in [t["id"] for t in tf_resources]:
drift_report.append({
"type": "UNMANAGED",
"resource": aws_res,
})
return drift_reportSupported Resources
The tool currently supports drift detection for:
- EC2 instances — including tags, security groups, and instance types
- S3 buckets — policies, versioning, and encryption settings
- RDS instances — parameter groups and backup configurations
- IAM roles — attached policies and trust relationships
- Security groups — inbound and outbound rules
Slack Integration
Detection is only useful if the right people know about it. I integrated the Slack API to send drift reports directly to the team's infrastructure channel:
def send_slack_notification(drift_report, channel):
blocks = format_drift_as_blocks(drift_report)
client.chat_postMessage(
channel=channel,
blocks=blocks,
text=f"Infrastructure drift detected: {len(drift_report)} issues found"
)Running Modes
AWSDriftGuard supports two modes:
Detect Mode
Outputs drift results to the console. Perfect for local development and CI pipelines.
awsdriftguard detect --state ./terraform.tfstate --region us-east-1Report Mode
Sends a formatted drift report to Slack, ideal for scheduled runs via cron or CloudWatch Events.
awsdriftguard report --state ./terraform.tfstate --slack-channel #infrastructureLessons Learned
Boto3 pagination is essential
AWS API responses are paginated. Always use paginators when listing resources, or you'll miss resources in accounts with many items.
State file parsing requires care
Terraform state files can be complex, especially with modules and nested resources. The json module handles the parsing, but the structure varies between Terraform versions.
Rate limiting
When checking many resources, AWS API rate limits become a factor. Implementing exponential backoff and batching requests made the tool reliable for large accounts.
What's Next
I'm working on adding support for more resource types (Lambda functions, DynamoDB tables) and implementing a web dashboard for historical drift tracking.