Why I Wrote This
Every AI IDE comparison I found was written by someone who spent a weekend on a todo app and called it a production test.
I work as an SRE on large-scale microservices infrastructure. My day is Terraform, Go automation scripts, Kubernetes YAML, Datadog monitors, incident runbooks, and CI/CD pipelines. I need tools that work on real infrastructure code, not just React components.
So I ran all six major AI coding tools through the same five real tasks I do every week. I kept notes on every command I ran, every diff I got back, every time a tool failed mid-task, and every time I had to correct something.
This is that report. Every code block in here is real. Every error message is real. Every correction turn is real.
The Six Tools
Kiro is AWS's spec-driven IDE. Built on VS Code, available as a standalone download for macOS, Windows, and Linux. You write a prompt and it generates a requirements document, a design document, and a task list before writing a single line of code. It has hooks for event-driven automations and steering files for persistent project context. It is the only tool in this list that enforces your team's conventions automatically.
Cursor is the most popular AI IDE right now. Also VS Code-based. Chat-first, inline autocomplete, strong model selection. Cursor 3 launched in 2026 with Composer 2.5. It is what most engineers reach for first.
Windsurf was built by Codeium and acquired by Cognition in 2025. It now ships with Devin built in. It has its own model called SWE-1.6. Flow-state editing is its signature feature. Cascade, its agent, indexes your project automatically.
Claude Code is Anthropic's terminal-based agent. Not an IDE. You run it from the command line. It uses Claude Opus 4.7 with a 1M token context window. It tops SWE-bench Verified at 87.6%. It is the most capable tool in this list and the most inconvenient to use.
OpenAI Codex is OpenAI's agentic coding tool. Available as a web app, CLI, and IDE extension. It runs GPT-5.3 Codex. Pricing changed to token-based billing in April 2026. It is excellent for Python and mediocre for everything else.
Google Antigravity is Google's answer to Cursor. Powered by Gemini 3.1 Pro. Antigravity 2.0 launched at Google I/O 2026 with a new CLI and SDK. The pricing has been chaotic since March 2026 and the infrastructure training data is stale.
Pricing in May 2026
| Tool | Free Tier | Paid Starts At | Notes |
|---|---|---|---|
| Kiro | Yes | ~$19/mo | Credit-based for agentic tasks |
| Cursor | Yes | 60/mo Pro+, $200/mo Ultra | Most predictable pricing |
| Windsurf | Yes (25 credits/mo) | $20/mo Pro | Devin included in all paid plans |
| Claude Code | No | 100/mo Max | Max needed for serious daily use |
| Codex | Limited | ~200/mo average | Token-based since April 2026 |
| Antigravity | Yes (~20 req/day) | 100/mo Ultra | Credit system is confusing |
Cursor is the most predictable. Claude Code Max at $100 per month is expensive but justified if you are doing heavy agentic work. Antigravity's credit restructuring in March 2026 was a mess. The free tier dropped 92% overnight with no warning. Codex token billing makes monthly costs hard to predict for teams.
My Test Environment
Before the tasks, here is the repo I was working in. This matters because the quality difference between tools is almost entirely about how well they read existing context.
text
infrastructure/
modules/
networking/
main.tf # VPC, subnets, NAT gateway
variables.tf # 23 variables, all with descriptions
outputs.tf # 14 outputs
ecs-service/
main.tf # ECS task definition, service, IAM roles
variables.tf
outputs.tf
monitoring/
main.tf # Datadog monitors, SLO alerts
variables.tf
slo-payment-service.tf # existing SLO monitor I use as template
services/
payment-worker/
main.tf # calls the modules above
terraform.tfvars
go/
monitoring/
collector.go # 847 lines
metrics.go # 312 lines
types.go # 89 lines
alerting.go # 203 lines
... 8 more files
The naming convention in this repo uses var.name not var.service_name. The AWS provider is pinned to ~> 5.0. The Datadog provider is ~> 3.0. Every SLO monitor has a runbook_url tag. These are the things that separate a tool that read your codebase from a tool that generated generic output.
Task 1: Writing a Terraform Module
The task: write a Terraform module for a new microservice. It needs a VPC with public and private subnets, security groups for the service and ALB, an ECS Fargate task definition, an Application Load Balancer, and CloudWatch alarms for CPU, memory, and error rate.
This is the most common infrastructure task I do. I run it at least twice a week.
Kiro
I had already set up a steering file before running this task. This is the step most people skip and then wonder why Kiro generates generic output.
In the Kiro panel, click Steering. Click +. Select Workspace. Name it terraform-standards.md. Write this:
markdown
---
inclusion: fileMatch
fileMatchPattern: "**/*.tf"
---
# Terraform Standards
## Naming
- Use var.name not var.service_name
- Resource names follow the pattern: local.name_prefix-resource_type
- All variables must have a description field
## Provider Versions
- AWS provider: ~> 5.0
- Datadog provider: ~> 3.0
- Never use deprecated resources: use aws_lb not aws_alb
## Module Structure
Follow the pattern in: #[[file:modules/networking/main.tf]]
Follow the variable style in: #[[file:modules/networking/variables.tf]]
## Required Tags
Every resource must have:
- Name
- Environment
- ManagedBy = "terraform"
- Team
Save it. Now open Specs. Click +. Select Feature. Type this:
text
Create a Terraform module for a new microservice at modules/ecs-worker/.
It needs:
- VPC with public and private subnets across 2 AZs
- Security groups for the ECS tasks and ALB
- ECS Fargate task definition with configurable CPU and memory
- Application Load Balancer with HTTPS listener
- CloudWatch alarms for CPU utilization, memory utilization, and ALB 5xx error rate
- IAM role for the ECS task with least-privilege permissions
Kiro does not write code. It generates requirements.md first. Here is what it produced:
markdown
## Requirement 1: Network Isolation
**User Story:** As an infrastructure engineer, I want the ECS tasks
to run in private subnets so that they are not directly reachable
from the internet.
### Acceptance Criteria
- WHEN the module is applied THE SYSTEM SHALL create a VPC with
CIDR 10.0.0.0/16 unless overridden by var.vpc_cidr.
- THE SYSTEM SHALL create two private subnets in separate AZs
with CIDRs derived from the VPC CIDR.
- THE SYSTEM SHALL create two public subnets for the ALB only.
- IF var.enable_nat_gateway is true THE SYSTEM SHALL create one
NAT gateway per AZ for high availability.
- THE SYSTEM SHALL output the VPC ID, private subnet IDs, and
public subnet IDs for use by other modules.
## Requirement 2: Compute
**User Story:** As an infrastructure engineer, I want the ECS task
definition to be configurable so that different services can use
different resource allocations.
### Acceptance Criteria
- THE SYSTEM SHALL create an ECS Fargate task definition with
var.cpu and var.memory as configurable inputs.
- THE SYSTEM SHALL create an IAM execution role with permissions
to pull from ECR and write to CloudWatch Logs.
- THE SYSTEM SHALL create a task role with no permissions by
default, allowing callers to attach additional policies.
- WHEN var.container_port is set THE SYSTEM SHALL expose that
port in the container definition.
I read through it. The VPC CIDR defaulting to 10.0.0.0/16 was wrong for my environment. I edited that line to say var.vpc_cidr with no default, caller must provide it. Clicked Approve.
Kiro generated design.md. It included the full module file structure, every variable name, every output name, and the IAM policy document. I approved it.
Then tasks.md:
markdown
- [ ] Task 1: Create modules/ecs-worker/variables.tf with all input variables
- [ ] Task 2: Create modules/ecs-worker/main.tf - VPC and networking resources
- [ ] Task 3: Create modules/ecs-worker/main.tf - Security groups
- [ ] Task 4: Create modules/ecs-worker/main.tf - ECS task definition and IAM roles
- [ ] Task 5: Create modules/ecs-worker/main.tf - ALB and target group
- [ ] Task 6: Create modules/ecs-worker/main.tf - CloudWatch alarms
- [ ] Task 7: Create modules/ecs-worker/outputs.tf
- [ ] Task 8: Create modules/ecs-worker/README.md
I clicked Run All Tasks. Here is the variables.tf it produced:
hcl
variable "name" {
description = "Name of the ECS worker service. Used as a prefix for all resources."
type = string
}
variable "vpc_cidr" {
description = "CIDR block for the VPC. Must not overlap with existing VPCs in the account."
type = string
}
variable "environment" {
description = "Deployment environment. Used for tagging and resource naming."
type = string
}
variable "team" {
description = "Team that owns this service. Used for tagging."
type = string
}
variable "cpu" {
description = "CPU units for the ECS Fargate task. Valid values: 256, 512, 1024, 2048, 4096."
type = number
default = 512
}
variable "memory" {
description = "Memory in MiB for the ECS Fargate task. Must be compatible with the cpu value."
type = number
default = 1024
}
variable "container_port" {
description = "Port the container listens on."
type = number
default = 8080
}
variable "container_image" {
description = "Docker image URI for the ECS task. Include the tag."
type = string
}
variable "enable_nat_gateway" {
description = "Whether to create NAT gateways for private subnet internet access."
type = bool
default = true
}
variable "alarm_cpu_threshold" {
description = "CPU utilization percentage that triggers the CloudWatch alarm."
type = number
default = 80
}
variable "alarm_memory_threshold" {
description = "Memory utilization percentage that triggers the CloudWatch alarm."
type = number
default = 80
}
variable "alarm_5xx_threshold" {
description = "Number of ALB 5xx errors per minute that triggers the CloudWatch alarm."
type = number
default = 10
}
variable "alarm_actions" {
description = "List of ARNs to notify when an alarm fires. Typically an SNS topic ARN."
type = list(string)
default = []
}
Notice: var.name not var.service_name. Every variable has a description. The steering file worked.
Here is the CloudWatch alarm section from main.tf:
hcl
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "${local.name_prefix}-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = var.alarm_cpu_threshold
alarm_description = "ECS CPU utilization is above ${var.alarm_cpu_threshold}%"
alarm_actions = var.alarm_actions
ok_actions = var.alarm_actions
dimensions = {
ClusterName = aws_ecs_cluster.this.name
ServiceName = aws_ecs_service.this.name
}
tags = local.common_tags
}
resource "aws_cloudwatch_metric_alarm" "memory_high" {
alarm_name = "${local.name_prefix}-memory-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "MemoryUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = var.alarm_memory_threshold
alarm_description = "ECS memory utilization is above ${var.alarm_memory_threshold}%"
alarm_actions = var.alarm_actions
ok_actions = var.alarm_actions
dimensions = {
ClusterName = aws_ecs_cluster.this.name
ServiceName = aws_ecs_service.this.name
}
tags = local.common_tags
}
resource "aws_cloudwatch_metric_alarm" "alb_5xx_high" {
alarm_name = "${local.name_prefix}-alb-5xx-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = var.alarm_5xx_threshold
alarm_description = "ALB 5xx error count exceeded ${var.alarm_5xx_threshold} in the last minute"
alarm_actions = var.alarm_actions
ok_actions = var.alarm_actions
treat_missing_data = "notBreaching"
dimensions = {
LoadBalancer = aws_lb.this.arn_suffix
TargetGroup = aws_lb_target_group.this.arn_suffix
}
tags = local.common_tags
}
I ran terraform validate on the output. Zero errors. I ran terraform plan against a dev account. It created 34 resources. All correct.
Total time from opening Kiro to a passing terraform plan: 11 minutes. 8 of those were the spec phase.
Cursor
I opened Cursor, pressed Cmd+K, and typed the same prompt. No steering files. No spec phase. Cursor started writing immediately.
The code appeared in about 35 seconds. Here is what the variables file looked like:
hcl
variable "service_name" {
description = "Name of the service"
type = string
}
variable "environment" {
type = string
}
variable "cpu" {
type = number
default = 256
}
variable "memory" {
type = number
default = 512
}
variable "tags" {
type = map(string)
default = {}
}
Three problems immediately visible.
First, it used var.service_name not var.name. My existing modules use var.name. Every reference to this variable in the calling module would need to change.
Second, the environment variable has no description. My team's convention requires descriptions on every variable. The PR would fail review.
Third, it added a tags variable I did not ask for. My modules use a local.common_tags block that merges required tags automatically. A separate tags variable breaks that pattern.
The CloudWatch alarm section had a more serious problem:
hcl
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "${var.service_name}-cpu-high"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = "300"
statistic = "Average"
threshold = "80"
alarm_description = "CPU utilization is high"
alarm_actions = []
}
The evaluation_periods and period and threshold are strings. In AWS provider 5.x these should be numbers. This would produce a plan-time error:
text
Error: Incorrect attribute value type
on main.tf line 47, in resource "aws_cloudwatch_metric_alarm" "cpu_high":
47: evaluation_periods = "2"
|----------------
| "2" is a string
Inappropriate value for attribute "evaluation_periods": a number is required.
Also the alarm_actions is hardcoded to an empty list. There is no variable for it. If you want to wire this to an SNS topic you have to edit the generated code directly.
I fixed all of this manually. It took about 12 minutes. So the total time was 47 seconds of generation plus 12 minutes of fixing. Longer than Kiro, with no documentation.
To be fair to Cursor: if I had used Cursor Rules to define my conventions, some of these problems would not have happened. Cursor Rules are the equivalent of Kiro's steering files. The difference is that Kiro generates the spec and enforces conventions in one workflow. With Cursor you have to set up Rules separately and remember to keep them updated.
Windsurf
Windsurf's Cascade agent indexed my project automatically when I opened it. I did not configure anything. I opened a new Cascade conversation and typed the same prompt.
Windsurf read my existing modules. The variable names matched. It used var.name. It used aws_lb not aws_alb. The module structure matched my existing pattern.
But the CloudWatch alarms had hardcoded thresholds:
hcl
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "${local.name_prefix}-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 80
alarm_description = "ECS CPU utilization is above 80%"
alarm_actions = []
}
The threshold is hardcoded to 80. The alarm_actions is hardcoded to an empty list. These should be variables. I asked Windsurf to fix it.
It fixed it in one turn. The corrected version used var.alarm_cpu_threshold and var.alarm_actions. Correct.
Total time: about 6 minutes including the correction turn. Better than Cursor. Not as clean as Kiro on the first pass.
Claude Code
Claude Code runs in the terminal. There is no IDE. I opened a terminal and ran:
bash
claude
I typed the same prompt. Claude Code printed this before writing anything:
text
I'll read your existing Terraform modules first to understand your conventions.
Reading modules/networking/main.tf...
Reading modules/networking/variables.tf...
Reading modules/networking/outputs.tf...
Reading modules/ecs-service/main.tf...
Reading modules/ecs-service/variables.tf...
Reading modules/monitoring/main.tf...
Reading modules/monitoring/slo-payment-service.tf...
Reading services/payment-worker/main.tf...
I can see your conventions:
- Variable naming: var.name (not var.service_name)
- AWS provider: ~> 5.0 (using aws_lb not aws_alb)
- All variables have descriptions
- Common tags via local.common_tags
- Alarm actions via a list variable
Creating modules/ecs-worker/...
The Terraform it produced was the best of all six tools. Here is the IAM section, which is where most tools get lazy:
hcl
data "aws_iam_policy_document" "ecs_execution_assume_role" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
data "aws_iam_policy_document" "ecs_execution" {
statement {
effect = "Allow"
actions = [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
]
resources = ["*"]
}
statement {
effect = "Allow"
actions = [
"logs:CreateLogStream",
"logs:PutLogEvents",
]
resources = [
"${aws_cloudwatch_log_group.this.arn}:*",
]
}
}
resource "aws_iam_role" "ecs_execution" {
name = "${local.name_prefix}-ecs-execution"
assume_role_policy = data.aws_iam_policy_document.ecs_execution_assume_role.json
tags = local.common_tags
}
resource "aws_iam_role_policy" "ecs_execution" {
name = "${local.name_prefix}-ecs-execution"
role = aws_iam_role.ecs_execution.id
policy = data.aws_iam_policy_document.ecs_execution.json
}
resource "aws_iam_role" "ecs_task" {
name = "${local.name_prefix}-ecs-task"
assume_role_policy = data.aws_iam_policy_document.ecs_execution_assume_role.json
tags = local.common_tags
}
Notice the CloudWatch Logs permission is scoped to the specific log group ARN, not *. That is least-privilege. Kiro also did this. Cursor used * for the logs resource.
Claude Code also added a README.md without being asked. It included usage examples, variable descriptions, and outputs. I did not prompt this. It inferred from my existing modules that every module has a README.
The only problem: no IDE. I was looking at diffs in the terminal. To review the full output I had to open the files in a separate editor. That friction is real.
terraform validate: zero errors. terraform plan: 34 resources, all correct.
Total time: 4 minutes.
OpenAI Codex
I used the Codex CLI:
bash
codex "Create a Terraform module at modules/ecs-worker/ for a new microservice.
It needs a VPC, security groups, ECS Fargate task definition, ALB, and
CloudWatch alarms for CPU, memory, and ALB 5xx errors."
Codex generated the module. Here is the ALB resource it produced:
hcl
resource "aws_alb" "main" {
name = "${var.service_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
tags = {
Name = "${var.service_name}-alb"
}
}
aws_alb is deprecated. The correct resource in AWS provider 5.x is aws_lb. This is not a breaking change but it generates a deprecation warning on every plan:
text
Warning: Argument is deprecated
with aws_alb.main,
on main.tf line 1, in resource "aws_alb" "main":
1: resource "aws_alb" "main" {
Use aws_lb instead.
The ECS task definition had a more serious problem. It used the old JSON string format for container definitions:
hcl
resource "aws_ecs_task_definition" "main" {
family = var.service_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.ecs_execution.arn
container_definitions = jsonencode([
{
name = var.service_name
image = var.container_image
cpu = var.cpu
memory = var.memory
essential = true
portMappings = [
{
containerPort = var.container_port
hostPort = var.container_port
protocol = "tcp"
}
]
}
])
}
The jsonencode approach works but it is the old pattern. My existing modules use the container_definitions block syntax introduced in AWS provider 4.x. Mixing patterns in the same repo is a maintenance problem.
Also: var.service_name again. Codex did not read my existing modules.
I ran terraform validate. It passed. I ran terraform plan. It worked but with deprecation warnings. I would not merge this to main without fixing the aws_alb reference and the naming convention.
Codex is fast. The CLI is clean. But it is clearly optimized for Python. Its Terraform knowledge is about 18 months behind.
Google Antigravity
I used Antigravity 2.0's CLI, which launched at Google I/O 2026:
bash
antigravity "Create a Terraform module at modules/ecs-worker/ for a new microservice.
It needs a VPC, security groups, ECS Fargate task definition, ALB, and
CloudWatch alarms."
Antigravity started generating. Then this appeared:
text
Rate limit reached. You have used 18/20 of your daily requests.
Generation paused. Resume tomorrow or upgrade to AI Pro.
I was on the free tier. 20 requests per day. I had used 18 testing other things earlier. I upgraded to AI Pro ($20/month) and tried again.
This time it completed. Here is the provider block it generated:
hcl
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.0"
}
}
}
AWS provider 4.x. My project is on 5.x. The resource arguments are different between these versions. The aws_ecs_task_definition resource changed significantly between 4.x and 5.x. Running terraform init with this would either downgrade my provider or fail with a version conflict.
I asked Antigravity to update it to 5.x. It updated the version constraint but kept the 4.x resource arguments. The aws_ecs_task_definition still used the old container_definitions JSON string format and the old placement_constraints syntax.
The Gemini 3.1 Pro model is genuinely good at reasoning. When I asked it to explain why it chose certain IAM permissions, the explanation was correct and detailed. The model understands infrastructure. The training data for Terraform is just stale.
I gave up on this task for Antigravity. The combination of quota interruptions and stale provider knowledge makes it unreliable for infrastructure work right now.
Task 1 Summary
Claude Code produced the best Terraform on the first pass. Kiro produced the most maintainable output because of the spec trail and steering file enforcement. Windsurf was close but needed one correction. Cursor was fast but required manual fixes for naming conventions and type errors. Codex had deprecation warnings and stale patterns. Antigravity had quota problems and provider version issues.
Task 2: Refactoring a Go Monitoring Script Across 12 Files
The task: change the signature of getServiceMetrics from this:
go
func getServiceMetrics(name string) (*ServiceMetrics, error)
to this:
go
func getServiceMetrics(ctx context.Context, name string, opts MetricOptions) (*ServiceMetrics, error)
The function is defined in metrics.go and called in 11 other files. MetricOptions is a new struct that needs to be defined in types.go. Every call site needs to pass a context and an options struct.
This is a real refactor I did last month. I ran it through all six tools to see which ones could handle it without missing files.
Kiro
I used a bugfix spec for this. Click Specs. Click +. Select Bug. Type this:
text
The getServiceMetrics function in go/monitoring/metrics.go needs a new signature.
Current:
func getServiceMetrics(name string) (*ServiceMetrics, error)
New:
func getServiceMetrics(ctx context.Context, name string, opts MetricOptions) (*ServiceMetrics, error)
MetricOptions is a new struct that needs to be defined in go/monitoring/types.go.
It should have these fields:
- Timeout time.Duration (default 30s)
- IncludeHistogram bool (default false)
- Tags map[string]string (default empty)
All 11 callers need to be updated. Where no context is available, use context.Background().
Where no options are needed, use MetricOptions{} as the zero value.
Kiro generated a bugfix spec that listed every file:
markdown
## Bug Condition
The function getServiceMetrics does not accept a context or options,
making it impossible to add timeouts or pass metadata to the metrics
collection layer.
## Files Requiring Changes
1. go/monitoring/types.go - Add MetricOptions struct
2. go/monitoring/metrics.go - Update function signature
3. go/monitoring/collector.go - Update 3 call sites
4. go/monitoring/alerting.go - Update 2 call sites
5. go/monitoring/reporter.go - Update 1 call site
6. go/monitoring/aggregator.go - Update 2 call sites
7. go/monitoring/exporter.go - Update 1 call site
8. go/monitoring/health.go - Update 1 call site
9. go/monitoring/dashboard.go - Update 1 call site
10. go/monitoring/scheduler.go - Update 1 call site
11. go/monitoring/cleanup.go - Update 1 call site
12. go/monitoring/metrics_test.go - Update test call sites
It ran the tasks one by one. Here is the MetricOptions struct it added to types.go:
go
// MetricOptions configures the behavior of getServiceMetrics.
type MetricOptions struct {
// Timeout is the maximum duration to wait for metrics collection.
// Defaults to 30 seconds if zero.
Timeout time.Duration
// IncludeHistogram controls whether histogram data is included in the response.
// Histogram collection is expensive. Only enable when needed.
IncludeHistogram bool
// Tags are additional key-value pairs attached to the collected metrics.
// These are forwarded to the metrics backend as dimensions.
Tags map[string]string
}
// DefaultMetricOptions returns a MetricOptions with sensible defaults.
func DefaultMetricOptions() MetricOptions {
return MetricOptions{
Timeout: 30 * time.Second,
IncludeHistogram: false,
Tags: make(map[string]string),
}
}
It added a DefaultMetricOptions() constructor without being asked. That is the kind of thing a senior engineer adds automatically.
Here is how it updated a call site in collector.go:
go
// Before
metrics, err := getServiceMetrics(svc.Name)
if err != nil {
log.Printf("failed to get metrics for %s: %v", svc.Name, err)
continue
}
// After
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
metrics, err := getServiceMetrics(ctx, svc.Name, MetricOptions{
Tags: map[string]string{
"collector": "automated",
"service": svc.Name,
},
})
if err != nil {
log.Printf("failed to get metrics for %s: %v", svc.Name, err)
continue
}
It did not just add context.Background() everywhere. It used the existing ctx from the function parameter where one was available. Where no context existed, it created one with a timeout. That is correct Go.
I ran go build ./.... Zero errors. I ran go test ./.... All tests passed.
Total time: 6 minutes including the spec phase.
Claude Code
bash
claude "Update getServiceMetrics in go/monitoring/metrics.go to accept
ctx context.Context, name string, opts MetricOptions.
Define MetricOptions in go/monitoring/types.go with fields:
Timeout time.Duration
IncludeHistogram bool
Tags map[string]string
Add a DefaultMetricOptions() constructor.
Update all 11 callers. Use existing ctx where available, context.Background() where not.
Update tests in metrics_test.go."
Claude Code read all 12 files before making any changes. It printed a summary of what it found:
text
Found 11 call sites across 10 files:
- collector.go: 3 calls (has ctx parameter available)
- alerting.go: 2 calls (has ctx parameter available)
- reporter.go: 1 call (no ctx, will use context.Background())
- aggregator.go: 2 calls (has ctx parameter available)
- exporter.go: 1 call (no ctx, will use context.Background())
- health.go: 1 call (has ctx parameter available)
- dashboard.go: 1 call (no ctx, will use context.Background())
- scheduler.go: 1 call (has ctx parameter available)
- cleanup.go: 1 call (no ctx, will use context.Background())
- metrics_test.go: 4 test calls (will use context.Background())
Making changes...
The output was identical quality to Kiro. Same DefaultMetricOptions() constructor. Same context propagation logic. Same test updates.
go build ./...: zero errors. go test ./...: all passed.
Total time: 3 minutes. Faster than Kiro because there was no spec phase.
The difference is the paper trail. Kiro's bugfix spec documents what changed and why. Six months from now when someone asks why getServiceMetrics has a MetricOptions parameter, the spec is there. With Claude Code, the only record is the git commit message.
Cursor
I opened Composer with Cmd+Shift+I and typed the same prompt.
Cursor updated 9 of 12 files. It missed dashboard.go, cleanup.go, and scheduler.go. These three files are in the same directory as the others. Cursor just did not index them.
I pointed Cursor to the missing files explicitly:
text
You missed these files:
- go/monitoring/dashboard.go
- go/monitoring/cleanup.go
- go/monitoring/scheduler.go
Please update the getServiceMetrics call sites in these files too.
Cursor updated them. But the updates in dashboard.go used context.Background() even though dashboard.go has a ctx context.Context parameter in its main function. Cursor did not propagate the context correctly.
I fixed that manually.
go build ./...: zero errors after manual fix. Total time: 9 minutes.
Windsurf
Windsurf missed 2 files: cleanup.go and scheduler.go. Same problem as Cursor. I pointed it to the missing files and it updated them correctly, including proper context propagation.
Total time: 7 minutes.
Codex
Codex updated metrics.go with the new signature. It updated collector.go with 2 of 3 call sites. It missed the third call site in collector.go and all other files.
I asked it to find the remaining call sites. It found 4 more. I asked again. It found 2 more. After 4 rounds of prompting it had updated 8 of 11 files. I gave up and did the remaining 3 manually.
Codex does not handle large multi-file refactors well. It loses track of what it has already changed.
Antigravity
Antigravity hit a quota limit after updating 3 files. I had already used most of my daily requests. I stopped testing it on this task.
Task 2 Summary
Claude Code and Kiro both handled this perfectly. Claude Code was faster. Kiro left documentation. Cursor and Windsurf missed files and needed correction. Codex lost track of the scope. Antigravity hit quota limits.
Task 3: Generating a Datadog SLO Monitor
The task: generate a Datadog monitor for a new SLO. The requirements are 99.9% availability, 30-day rolling window, alert at 99.5%, warn at 99.7%. The monitor must include a runbook_url tag. It must follow the same structure as the existing slo-payment-service.tf in my observability monitoring module.
This task is specifically about whether a tool can follow your team's conventions without you repeating them every time.
Kiro
I already had the terraform-standards.md steering file from Task 1. I added one more steering file specifically for monitoring:
markdown
---
inclusion: fileMatch
fileMatchPattern: "modules/monitoring/**/*.tf"
---
# Monitoring Module Standards
## SLO Monitor Structure
All SLO monitors use the datadog_service_level_objective resource type.
Follow the exact structure in: #[[file:modules/monitoring/slo-payment-service.tf]]
## Required Fields
Every SLO monitor must have:
- name following the pattern: "SLO - [Service Name] - [Metric]"
- description explaining what the SLO measures
- thresholds block with timeframe = "30d"
- warning threshold at 99.7
- critical threshold at 99.5
- tags including "runbook_url:[url]" and "team:[team_name]"
## Naming Convention
The Terraform resource name follows: slo*[service_name]*[metric]
Example: slo_payment_service_availability
I asked Kiro to generate the monitor for a new service called inventory-api.
Here is what it produced:
hcl
resource "datadog_service_level_objective" "slo_inventory_api_availability" {
name = "SLO - Inventory API - Availability"
type = "metric"
description = "Measures the availability of the Inventory API service. Availability is defined as the percentage of requests that return a non-5xx response."
query {
numerator = "sum:trace.web.request.hits{service:inventory-api,!http.status_class:5xx}.as_count()"
denominator = "sum:trace.web.request.hits{service:inventory-api}.as_count()"
}
thresholds {
timeframe = "30d"
target = 99.9
warning = 99.7
}
tags = [
"service:inventory-api",
"team:${var.team}",
"environment:${var.environment}",
"runbook_url:https://runbooks.internal/inventory-api/availability",
"managed_by:terraform",
]
}
resource "datadog_monitor" "slo_inventory_api_availability_alert" {
name = "SLO Alert - Inventory API - Availability"
type = "slo alert"
message = <<-EOT
{{#is_alert}}
Inventory API availability SLO is below 99.5% over the last 30 days.
Current burn rate: {{value}}
Runbook: https://runbooks.internal/inventory-api/availability
@pagerduty-inventory-api
{{/is_alert}}
{{#is_warning}}
Inventory API availability SLO is below 99.7% over the last 30 days.
@slack-inventory-api-alerts
{{/is_warning}}
EOT
query = "error_budget(\"${datadog_service_level_objective.slo_inventory_api_availability.id}\").over(\"30d\") > 1"
thresholds = {
critical = 1
warning = 0.5
}
tags = [
"service:inventory-api",
"team:${var.team}",
"environment:${var.environment}",
"managed_by:terraform",
]
}
This is exactly what I would have written manually. The runbook_url tag is there. The thresholds match. The naming convention matches. The message template matches my existing monitors.
I did not tell Kiro any of this in the prompt. The steering file told it.
Claude Code
bash
claude "Generate a Datadog SLO monitor for inventory-api.
99.9% availability target, 30-day rolling window.
Alert at 99.5%, warn at 99.7%.
Follow the same structure as modules/monitoring/slo-payment-service.tf."
Claude Code read slo-payment-service.tf and generated a correct monitor. The structure matched. The thresholds were correct.
But it did not include the runbook_url tag. That requirement is in my steering file, which Claude Code does not have access to. It does not have persistent project memory between sessions.
I told it to add the runbook_url tag. It added it. One correction turn.
The output after correction was identical to Kiro's output. But I had to remember to ask for the runbook_url. With Kiro, I never have to remember. The steering file remembers for me.
Cursor
Cursor generated a generic Datadog SLO monitor. It did not read my existing slo-payment-service.tf. The structure was different. The naming convention was different. No runbook_url tag. The thresholds were correct because I specified them in the prompt.
I spent about 8 minutes correcting it to match my team's conventions.
Windsurf
Windsurf read my existing monitoring files and generated a monitor that was close to correct. The structure matched. The naming convention matched. But it missed the runbook_url tag and used a slightly different message template format.
Two correction turns to get it right.
Codex and Antigravity
Codex generated a monitor using the Datadog Terraform provider 2.x syntax. My project uses 3.x. The datadog_service_level_objective resource arguments changed between these versions. The output would not apply without errors.
Antigravity generated a monitor that was structurally correct but used hardcoded values instead of variables. It also missed the runbook_url tag.
Task 3 Summary
Kiro wins this task by a significant margin. Steering files are the killer feature for teams with established conventions. You write the convention once. Every future generation follows it automatically. No other tool in this list has an equivalent that works this well.
Claude Code is second. It reads your existing files and matches the pattern. But it does not have persistent memory, so you have to remind it of requirements that are not visible in the files.
Task 4: Debugging a Silent Kubernetes Failure
The task: a deployment was running but not serving traffic. No errors in the logs. The pods were in Running state. The ALB health checks were passing. But requests to the service were returning 503.
This is the worst kind of production issue. Everything looks fine. Nothing is fine.
Here is the deployment YAML:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inventory-api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: inventory-api
version: v2
template:
metadata:
labels:
app: inventory-api
version: v2
spec:
containers:
- name: inventory-api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/inventory-api:v2.1.4
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Here is the Service YAML:
yaml
apiVersion: v1
kind: Service
metadata:
name: inventory-api
namespace: production
spec:
selector:
app: inventory-api
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
Here is the Ingress YAML:
yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: inventory-api
namespace: production
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- host: inventory-api.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: inventory-api
port:
number: 80
I pasted all three files into each tool and asked: why is this not routing traffic?
Claude Code
Claude Code spotted the first issue in about 8 seconds.
The selector in the Service is app: inventory-api. The matchLabels in the Deployment is app: inventory-api, version: v2. The Service selector only matches pods that have app: inventory-api. The pods have both app: inventory-api and version: v2. The Service selector is a subset match, so it should work.
Wait. Claude Code caught something I missed. It said:
text
The Service selector looks correct at first glance, but look at the
Deployment's selector.matchLabels:
selector:
matchLabels:
app: inventory-api
version: v2
The Deployment requires pods to have BOTH labels. The Service selector
only requires app: inventory-api. This is fine for routing.
But here is the actual problem: the Ingress uses target-type: ip, which
means the ALB routes directly to pod IPs. The ALB health check target
is port 80 (the Service port). But the pods listen on port 8080.
The ALB is health-checking port 80 on the pod IPs. Port 80 is not open
on the pods. The health checks are failing silently because the ALB
is configured to pass health checks even when the target is unreachable
(the default behavior with some ALB controller versions).
The fix: add this annotation to the Ingress:
alb.ingress.kubernetes.io/healthcheck-port: "8080"
Or change the Service targetPort to match the ALB health check port.
That was the actual bug. The ALB was health-checking port 80 on the pod IPs. The pods listen on 8080. The health checks were failing but the ALB was still routing traffic because the health check failure threshold had not been reached yet. The 503s were intermittent.
Claude Code found this without me giving it any additional context. It read all three files together and traced the port mismatch through the entire routing chain.
Kiro
Kiro found the port mismatch. It took two prompts. The first prompt identified the Service selector as potentially problematic (it was not). The second prompt, after I told it the selector was fine, found the ALB health check port issue.
Windsurf
Windsurf found both issues in one pass. Its Cascade agent read all three files together and traced the routing chain correctly. Comparable to Claude Code.
Cursor
Cursor found the Service selector issue (which was not actually a problem) and stopped there. It did not trace the ALB health check port mismatch. I had to give it more context.
Codex and Antigravity
Both identified the Service selector as the problem. Neither found the ALB health check port issue. The selector was not actually the problem.
Task 4 Summary
Claude Code and Windsurf tied. Both traced the full routing chain and found the actual bug without additional prompting. Kiro found it in two prompts. Cursor, Codex, and Antigravity identified a non-issue and stopped.
The difference here is context window and reasoning quality. Claude Code and Windsurf read all three files together and reasoned about the full routing path. The other tools read the files but did not connect the dots across all three.
Task 5: Writing an Incident Runbook
The task: generate a structured runbook from a postmortem summary.
Here is the postmortem I gave each tool:
text
Incident: INS-2847
Date: 2026-04-14 02:17 UTC
Duration: 47 minutes
Severity: P1
Service: payment-worker
Summary:
Redis connection pool exhaustion caused payment processing to fail.
The payment-worker service uses Redis for distributed locking during
payment processing. At 02:17 UTC, Redis connection pool hit the
configured maximum of 100 connections. New payment requests could not
acquire locks and failed with a 503 error.
Root cause:
A deployment at 01:55 UTC increased the payment-worker replica count
from 5 to 15 without updating the Redis connection pool size. Each
replica holds up to 10 connections. 15 replicas * 10 connections = 150
connections, exceeding the pool maximum of 100.
Resolution:
1. Scaled payment-worker back to 5 replicas at 02:31 UTC
2. Updated Redis connection pool max to 200 at 02:41 UTC
3. Scaled payment-worker back to 15 replicas at 02:44 UTC
4. Confirmed payment processing resumed at 02:44 UTC
Action items:
- Add pre-deployment check for Redis connection pool capacity
- Add CloudWatch alarm for Redis connection count > 80% of max
- Update deployment runbook to include Redis capacity check
Kiro
I had a steering file for runbook format. The file lives at .kiro/steering/runbook-template.md with inclusion: manual so it only loads when I explicitly reference it. Here is the full file:
markdown
---
inclusion: manual
---
# Runbook Template
All runbooks follow this structure:
## [Service Name]: [Issue Type]
### Detection
How to detect this issue. Include specific metrics, log queries,
and alert names.
### Triage
Step-by-step triage checklist. Each step should be a single action
with an expected outcome.
### Resolution
Numbered resolution steps. Each step includes:
- The exact command to run
- The expected output
- What to do if the output is unexpected
### Rollback
How to undo the resolution if it makes things worse.
### Related Runbooks
Links to related runbooks.
### Post-Incident
What to do after the incident is resolved.
The structure it enforces as a diagram:
flowchart TD
A[Runbook] --> B["Detection\nMetrics, log queries, alert names"]
A --> C["Triage\nStep-by-step checklist with expected outcomes"]
A --> D["Resolution\nExact commands + expected output + fallback"]
A --> E["Rollback\nHow to undo if resolution makes things worse"]
A --> F["Related Runbooks\nLinks to connected runbooks"]
A --> G["Post-Incident\nTickets, alarms, schedule postmortem"]I asked Kiro to generate the runbook using this template. Here is the full output it produced.
Payment Worker: Redis Connection Pool Exhaustion
Detection
This issue is detected by one or more of the following.
Alert: payment-worker-redis-connections-high fires when Redis connection count exceeds 80% of the configured maximum.
Log pattern: search CloudWatch Logs Insights for:
sql
fields @timestamp, @message
| filter @logStream like /payment-worker/
| filter @message like /connection pool exhausted/
| sort @timestamp desc
| limit 50
Metric: redis.clients.connected in Datadog. Navigate to Metrics, search redis.clients.connected, filter by service:payment-worker.
Triage
- Check current Redis connection count:
bash
redis-cli -h $REDIS_HOST -p $REDIS_PORT info clients | grep connected_clients
Expected: a number. If it equals or exceeds maxclients, the pool is exhausted.
- Check current payment-worker replica count:
bash
kubectl get deployment payment-worker -n production -o jsonpath='{.spec.replicas}'
- Calculate expected connection usage. Check
REDIS_MAX_CONNECTIONSin the ConfigMap:
bash
kubectl get configmap payment-worker-config -n production -o yaml | grep REDIS_MAX_CONNECTIONS
Multiply replicas by REDIS_MAX_CONNECTIONS. If the result exceeds maxclients, that is the problem.
- Check Redis
maxclientssetting:
bash
redis-cli -h $REDIS_HOST -p $REDIS_PORT config get maxclients
- If
replicas * connections_per_replica > maxclients, proceed to Resolution.
Resolution
Option A: Scale down replicas (immediate, temporary)
- Scale payment-worker to a safe replica count:
bash
kubectl scale deployment payment-worker -n production --replicas=5
Expected: deployment.apps/payment-worker scaled
- Verify pods are terminating:
bash
kubectl get pods -n production -l app=payment-worker -w
Wait until only 5 pods are in Running state.
- Verify payment processing has resumed:
bash
kubectl logs -n production -l app=payment-worker --tail=20 | grep "payment processed"
Option B: Increase Redis maxclients (permanent fix)
- Update the Redis
maxclientssetting:
bash
redis-cli -h $REDIS_HOST -p $REDIS_PORT config set maxclients 200
Expected: OK
- Verify the change:
bash
redis-cli -h $REDIS_HOST -p $REDIS_PORT config get maxclients
Expected output: maxclients on one line, 200 on the next.
- Scale payment-worker back to the desired replica count:
bash
kubectl scale deployment payment-worker -n production --replicas=15
- Monitor Redis connection count for 5 minutes:
bash
watch -n 5 'redis-cli -h $REDIS_HOST -p $REDIS_PORT info clients | grep connected_clients'
Rollback
If Option B makes things worse:
- Revert maxclients to the previous value:
bash
redis-cli -h $REDIS_HOST -p $REDIS_PORT config set maxclients 100
- Scale payment-worker back to 5 replicas:
bash
kubectl scale deployment payment-worker -n production --replicas=5
- Escalate to the Redis infrastructure team.
Related Runbooks
- Redis Memory Pressure
- Payment Worker High Error Rate
- Kubernetes Deployment Rollback
Post-Incident
- File a ticket to add pre-deployment Redis capacity check
- Add CloudWatch alarm for Redis connection count above 80% of max
- Update the deployment runbook to include Redis capacity check
- Schedule a postmortem review within 48 hours
This is production-ready. I could hand this to an on-call engineer at 2am and they could follow it without asking me anything.
Claude Code
Claude Code generated an excellent runbook. The commands were correct. The structure was logical. But it invented its own format. The section names were different. The order was different. The log query syntax was different from what my team uses.
I asked it to reformat to match my template. It did so correctly. Two turns instead of one.
The content quality was identical to Kiro's output. The difference is that Kiro followed my template automatically because of the steering file.
Cursor
Cursor generated a basic runbook. It had the right sections but the commands were incomplete. The kubectl commands were missing the namespace flag. The Redis commands were missing the host and port flags. The log query was a generic CloudWatch Logs query, not the specific query format my team uses.
I spent about 10 minutes editing it.
Windsurf, Codex, Antigravity
Windsurf generated a runbook that was better than Cursor but still needed editing. The commands were mostly correct but the structure did not match my template.
Codex generated a runbook that was mostly Python-flavored. It suggested using boto3 to query CloudWatch Logs instead of the CloudWatch Logs Insights query language. That is not how my team works.
Antigravity generated a reasonable runbook but hit quota limits before completing the post-incident section.
Task 5 Summary
Kiro wins again because of steering files. When you have a template, Kiro follows it. Claude Code generates excellent content but needs a correction turn to match your format. Cursor, Windsurf, Codex, and Antigravity all require significant editing.
What I Actually Use and Why
I am going to be direct.
Kiro is the right tool for production SRE work on a team.
SRE work is not solo work. You are writing Terraform that three other engineers will review. You are writing runbooks that an on-call engineer will read at 2am. You are generating monitors that need to match the conventions your team agreed on six months ago.
Kiro is the only tool that enforces those conventions automatically. Steering files mean you write the rule once and every future generation follows it. The spec workflow means every change has a paper trail. When someone asks why a module was written a certain way, you have an answer.
The spec workflow feels slow the first week. After that, you stop noticing it. What you do notice is that you stop having conversations about why the code looks different from everything else.
Claude Code is the right tool for complex autonomous tasks.
When I need to refactor a massive codebase, debug a subtle issue across multiple files, or write a complex automation script, Claude Code on Opus 4.7 is the most capable tool available. The 1M token context window is not a marketing number. It genuinely changes what is possible. It can read your entire infrastructure repo and write code that looks like it belongs there.
The terminal-only interface is a real limitation. I use it alongside Cursor for the IDE experience.
Cursor is the right tool for daily inline editing.
Cursor is the most polished IDE experience. The autocomplete is fast and accurate. The chat is responsive. It is the right tool if you want AI assistance without changing how you work. I use it for quick fixes, small changes, and anything where I want to stay in flow.
Windsurf is the right tool if you want Cursor quality at a lower price.
Windsurf 2.0 with Devin integration is genuinely impressive. The SWE-1.6 model is strong. The pricing is more predictable than Antigravity. If your team is budget-conscious and does not need Kiro's spec workflow, Windsurf is a solid choice.
Codex is not the right tool for infrastructure work.
Codex is excellent for Python automation and data pipelines. It is not optimized for Terraform, Go, or YAML-heavy infrastructure repos. The token-based pricing since April 2026 makes costs unpredictable. Use it for what it is good at.
Antigravity is not ready for production infrastructure work.
The Gemini 3.1 Pro model is capable. Antigravity 2.0 launched at Google I/O 2026 with real improvements. But the quota interruptions, the March 2026 pricing chaos, and the stale infrastructure training data make it unreliable for production SRE work right now. Check back in six months.
My Personal Stack
I use three tools, not one.
Kiro for new features and anything that needs to follow team conventions.
Claude Code for large refactors, debugging complex issues, and anything that requires reading the whole codebase.
Cursor for daily editing, autocomplete, quick fixes, and small changes.
This is not a failure of any single tool. It is the reality of 2026. The tools are specialized. The engineers who pick one and stick with it are leaving performance on the table.
Quick Reference
flowchart LR
A{What are you doing?} --> B["New Terraform module\non a team project"]
A --> C["New Terraform module\nsolo"]
A --> D[Multi-file Go refactor]
A --> E["Kubernetes YAML\ndebugging"]
A --> F["Incident runbook\ngeneration"]
A --> G["Datadog monitor\nwith team conventions"]
A --> H[Quick inline fix]
A --> I["Python automation\nscript"]
A --> J["Large codebase\nexploration"]
A --> K["Budget-conscious\nteam"]
B --> L[Kiro]
C --> M[Claude Code]
D --> M
E --> N[Claude Code or Windsurf]
F --> L
G --> L
H --> O[Cursor]
I --> P[Codex or Claude Code]
J --> M
K --> Q[Windsurf]One Last Thing
The question is not which AI IDE is best.
The question is which AI IDE is best for this specific task.
Kiro wins for structured, team-based, convention-heavy work. Claude Code wins for raw capability. Cursor wins for daily ergonomics.
Pick based on your actual workflow. Not based on benchmarks. Not based on what is trending on social media this week.
The tools that will make you faster are the ones that fit how you already work and then push you slightly beyond it.