The "It Works" Trap

We've all been there. You're building a new CI/CD pipeline, and the deployment fails. Permission Denied. You check CloudTrail, see the missing action, and slap it into the IAM policy. It works. You move on.

Three years later, you're sitting in a security audit, and the auditor asks a simple question: "Who can launch an EC2 instance with the AdministratorAccess role?"

You pause. "Well, only the Senior DevOps team," you answer confidently.

"Are you sure?"

That was the moment my stomach dropped. Because deep down, I knew I wasn't sure. I knew that somewhere in our thousands of lines of Terraform, we had been generous with iam:PassRole. Too generous.

This is the story of how we discovered a silent privilege escalation path that existed in our production environment for months, and the comprehensive, architectural overhaul we executed to fix it without breaking a single deployment.

Part 1: The Anatomy of the Vulnerability

To understand why this fix is so crucial, you have to understand the mechanism. iam:PassRole is not an API action you call directly. You don't "pass a role" to a resource like you pass a salt shaker. It's a permission that allows a user (or service) to assign an IAM role to a resource upon creation.

If I can create an EC2 instance, and I can PassRole the AdminRole to it, I am effectively an Administrator. I just launch the box, SSH into it, and boom - I inherit the Admin privileges.

The "Wildcard" Mistake

The most common implementation we see in the wild (and yes, we were guilty of this) looks like this in Terraform:

resource "aws_iam_policy" "jenkins_deploy_policy" {
  name        = "JenkinsDeployPolicy"
  description = "Allows Jenkins to deploy infra"
 
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = [
          "ec2:RunInstances",
          "lambda:CreateFunction",
          "rds:CreateDBInstance"
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = "iam:PassRole"
        Resource = "*"  # <--- THE SILENT KILLER
      }
    ]
  })
}

Do you see it?

We gave our CI/CD Build Agents the ability to pass any role to the resources they created.

The Attack Vector:

A developer commits a malicious generic build script (or a compromised dependency injects one).
The script runs on the Build Agent (which has the jenkins_deploy_policy).
The script calls aws ec2 run-instances --iam-instance-profile Name=OrganizationAccountAccessRole ... (or any high-privilege role).
The Build Agent sits continuously, but the new EC2 instance now has full Admin access.
The script curls a reverse shell to the attacker from that new EC2 instance.

We effectively flattened our entire RBAC model. If you could trigger a build, you were Root.

Why This is So Hard to Spot

The reason PassRole vulnerabilities persist is that they are synthetic. They require a combination of permissions to be exploitable. You need:

iam:PassRole on a target high-privilege role.
A "Compute Creation" permission (like ec2:RunInstances, lambda:CreateFunction, sagemaker:CreateNotebookInstance, glue:CreateDevEndpoint, etc.).

If you look at just the ec2:RunInstances permission, it looks fine. "Oh, Jenkins needs to create servers." If you look at just the iam:PassRole, it looks vague but necessary. "Oh, Jenkins needs to assign roles to those servers."

It's only when you combine them with Resource: * that the hole opens up.

Part 2: The Discovery & Forensic Analysis

We didn't just stumble on this. We started running advanced security tooling because we were preparing for SOC2.

Tools of the Trade: How we found it

We used three primary open-source tools to map our risk surface. If you aren't running these, you are flying blind.

1. PMapper (Principal Mapper)

This tool is the gold standard for IAM graph analysis. It doesn't just look at policies; it simulates the "effective" permissions.

We ran: pmapper graph create pmapper visualize --file graph.html

The output was a visual graph where JenkinsRole had a direct edge to AdministratorAccess. It explicitly identified the path: JenkinsRole -> can_pass_role -> NewEc2 -> assumes -> AdministratorAccess.

Seeing it visually was terrifying.

2. CloudSplaining

CloudSplaining by Salesforce is excellent for identifying "Over-privileged" roles. It specifically flags PassRole with *. It generated a remediation HTML report sorted by "Risk Priority". Our build agents were at the top of the list.

3. AWS Access Analyzer

We enabled IAM Access Analyzer in the AWS console. It flagged several roles as "Allowing access to unused services," but more importantly, it flagged "PassRole on sensitive roles."

The 3 AM Wakeup Call (A Case Study)

I remember the night vividly. A junior dev had opened a PR to update a Lambda function. The PR inadvertently included a change to the execution role, granting it s3:* on all buckets for debugging. Because of our loose PassRole controls, the CI/CD pipeline happily updated the function with the new, overpowered role.

Two hours later, we got an alert. An automated script running inside that Lambda was deleting "old" backups. Except "old" was defined as anything older than 24 hours, and it was running against our primary production bucket, not the test bucket.

We stopped the bleeding quickly, but the root cause wasn't the script. The root cause was that the CI/CD pipeline was allowed to attach such a powerful role to a development Lambda in the first place. That was the moment iam:PassRole went from a "theoretical risk" to a "business-critical vulnerability."

If we had restricted PassRole, the pipeline would have failed to attach the s3:* role because that role wouldn't have been tagged with the correct access-project (or we would have prevented the creation of such a role in the first place).

Part 3: Deep Dive into IAM Policy Evaluation Logic

Before we get to the fix, we need to geek out on exactly how AWS evaluates IAM policies. This is where most DevOps engineers get tripped up.

AWS IAM evaluation follows a specific flow:

Deny Evaluation: Is there an explicit Deny? If yes, game over. Deny always wins.
Organization SCPs: Does the SCP allow this action? (Note: SCPs are filters, they don't grant permissions, they only allow/deny).
Resource-Based Policies: Does the S3 bucket or KMS key say "Yes"?
Identity-Based Policies: Does the User/Role have an "Allow"?
Permissions Boundaries: Is the action within the boundary?
Session Policies: Is the assumed session restricting it?

Our vulnerability existed comfortably in layer 4 (Identity-Based Policies). Because we had Effect: Allow, Action: iam:PassRole, Resource: *, the logic engine simply said "YES" to everything.

Our Fix (ABAC) works by injecting a Condition into Layer 4. The IAM engine asks: "Is the resource tagged?" If the answer is "No", the StringEquals condition fails. Since the condition fails, the Allow statement is ignored. Since there is no other Allow, the default Implicit Deny kicks in.

This reliance on "Implicit Deny" is safe, but it means you must be very careful not to introduce a conflicting Allow statement elsewhere in another policy attached to the same role. If you have two policies, one with ABAC and one with Resource: *, the permissive one wins (because of the OR logic between Allow statements).

Crucial Lesson: You must audit ALL policies attached to a role. You cannot just attach a "Safe PassRole" policy and expect it to override an existing "Unsafe PassRole" policy. You must remove the unsafe one.

Part 4: The Strategy (ABAC vs. RBAC)

We considered two approaches to restrict PassRole.

Approach A: Explicit Resource Arns (The RBAC Way)

Listing every single allowable role in the Resource block.

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": [
      "arn:aws:iam::123456789012:role/app-payment-v1",
      "arn:aws:iam::123456789012:role/app-user-service",
      "..." 
      // 500 lines later
  ]
}

Why we rejected this: It forces a "God Object" policy. Every time a team makes a new microservice, they need a new role, which means they need to update the CI/CD policy. This creates a bottleneck on the Platform team. "Docs or it didn't happen" becomes "Ticket or you don't deploy."

This approach also hits the IAM Policy Size Limit (6,144 characters) very quickly. You end up splitting policies into PassRole1, PassRole2, PassRole3. It becomes unmanageable spaghetti.

Approach B: Attribute-Based Access Control (The ABAC Way)

This was our winner. We decided to use IAM Tags to control delegation.

The rule we wanted to enforce:

"A builder (user/role) can only PassRole if the Role they are passing is tagged with the same Project Team as the builder."

If I am on the Payments team, I can pass Payments roles. I cannot pass Admin roles or Marketing roles.

ABAC scales infinitely. You don't update the central policy when you add a new role. You just tag the new role correctly.

Part 5: The Implementation

This required a three-pronged attack:

Tagging Strategy: Enforce strictly managed tags on all IAM Ops.
The Sentinel Policy: Use SCPs (Service Control Policies) to prevent tag tampering.
The IAM Condition: Rewrite the Permission Boundaries and Policies.

1. The Tagging Taxonomy

We defined a standard tag: access-project.

Role: Jenkins-Payments-Worker -> Tag: access-project = payments
Role: App-Payment-Service -> Tag: access-project = payments
Role: AdminAccess -> Tag: access-project = platform-admin

We also added a security-tier tag (Tier1, Tier2, Tier3) for extra granularity, but let's focus on the project tag for this article.

2. The Golden Policy (Terraform)

We replaced the wildcard Resource: * with this beautiful condition logic. Here is the exact Terraform module we promoted to our modules registry. I'm going to break down every single line because getting this wrong breaks your cloud.

# The "Safe" PassRole Policy Module
resource "aws_iam_policy" "delegated_pass_role" {
  name        = "SafePassRole-${var.team_name}"
  description = "Allows passing roles only within the ${var.team_name} boundary"
 
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowPassRoleToSameTeam"
        Effect = "Allow"
        Action = "iam:PassRole"
        Resource = "*"
        Condition = {
          StringEquals = {
            "iam:ResourceTag/access-project" = "${var.team_name}"
          }
        }
      }
    ]
  })
}

Breakdown:

Action = "iam:PassRole": This is the permission we are scoping.
Resource = "*": Yes, looking at "star" here is scary, but wait for the condition. The condition is the firewall.
Condition: This is the magic.
- iam:ResourceTag/access-project: AWS looks at the tags on the target resource (the Role being passed).
- ${var.team_name}: It compares that tag value to the variable we inject.

So if var.team_name is "payments", this policy effectively says: "You can pass ANY role, as long as that role has a tag access-project equal to payments."

The Catch: AWS checks tags on the target resource (the Role being passed). But what if the role doesn't have tags yet (e.g. during creation)? Actually, iam:PassRole targets an existing role. So the role must exist and be tagged.

But we also needed to ensure that our Jenkins agents themselves couldn't just create a new role, tag it platform-admin, and then assume it.

So we added a second statement: Restrictions on Role Creation.

      {
        Sid    = "RestrictRoleCreationTags"
        Effect = "Allow"
        Action = [
          "iam:CreateRole",
          "iam:TagRole"
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:RequestTag/access-project" = "${var.team_name}"
          }
          "ForAllValues:StringEquals" = {
            "aws:TagKeys" = ["access-project", "CostCenter", "Environment"]
          }
        }
      }

This second block is just as critical. It governs iam:CreateRole and iam:TagRole.

aws:RequestTag/access-project: This checks the tags in the API request itself.
If the Jenkins agent tries to call CreateRole without providing the access-project tag set to "payments", the call fails.
ForAllValues:StringEquals: This prevents them from adding extra unapproved tags.

This creates a closed loop. The builder can only create roles tagged "payments". The builder can only pass roles tagged "payments". There is no escape hatch.

3. The "God Mode" Prevention (SCP)

We applied this Service Control Policy (SCP) at the Organizational Unit (OU) level in AWS Organizations. This acts as a global firewall that no local IAM policy can override.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PreventTagTampering",
      "Effect": "Deny",
      "Action": [
        "iam:UntagRole",
        "iam:UntagUser"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/Protected": "true"
        }
      }
    },
    {
      "Sid": "EnforceProjectTagOnRoleCreation",
      "Effect": "Deny",
      "Action": "iam:CreateRole",
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/access-project": "true"
        }
      }
    }
  ]
}

This SCP forces every new role in the account to have an access-project tag. If you try to run aws iam create-role without that tag, AWS rejects it hard.

Part 6: The Rollout (aka "Don't Break Prod")

You can't just apply this overnight. Half our roles weren't tagged. The chaos would be legendary. Services would fail to scale. Deployments would break.

We wrote a Python script using boto3 to audit and "backfill" the tags.

The Backfill Script (Condensed)

import boto3
import csv
import logging
 
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('IAMAudit')
 
iam = boto3.client('iam')
 
def audit_roles():
    paginator = iam.get_paginator('list_roles')
    for page in paginator.paginate():
        for role in page['Roles']:
            role_name = role['RoleName']
            
            # Skip AWS service roles (they are managed by AWS)
            # Modifying these can break internal AWS services
            if '/aws-service-role/' in role['Path']:
                continue
                
            try:
                tags = iam.list_role_tags(RoleName=role_name)['Tags']
            except Exception as e:
                logger.error(f"Could not list tags for {role_name}: {e}")
                continue
            
            project_tag = next((t for t in tags if t['Key'] == 'access-project'), None)
            
            if not project_tag:
                logger.warning(f"⚠️  UNTAGGED ROLE: {role_name}")
                # Heuristic: Check name prefix
                if role_name.startswith('payment-') or 'payment' in role_name:
                    logger.info(f" -> Auto-tagging {role_name} as 'payments'")
                    # DRY RUN: Uncomment to apply
                    # iam.tag_role(
                    #     RoleName=role_name,
                    #     Tags=[{'Key': 'access-project', 'Value': 'payments'}]
                    # )
                elif role_name.startswith('user-') or 'frontend' in role_name:
                    logger.info(f" -> Auto-tagging {role_name} as 'user-service'")
                    # DRY RUN: Uncomment to apply
                    # iam.tag_role(
                    #     RoleName=role_name,
                    #     Tags=[{'Key': 'access-project', 'Value': 'user-service'}]
                    # )
                else:
                    logger.error(f" -> ❌ Manual intervention required for {role_name}")
 
if __name__ == '__main__':
    audit_roles()

We ran this in "Dry Run" mode first. We found 450 roles. 300 were legacy junk we just deleted (instant security win). 100 were easy to attribute. 50 required hunting down the owners.

The "Manual Intervention" List: The 50 roles that required manual intervention were the trickiest. These were "shared" roles.

Shared-Jenkins-Slave
Common-Database-Access
Legacy-Monolith-Role

These roles were the technical debt of a thousand rapid deployments. Multiple teams were using Shared-Jenkins-Slave to deploy completely different applications. If we tagged it access-project = payments, the logistics team deployments would fail. If we didn't tag it, the SCP would block it.

The Fix for Shared Roles: We had to duplicate these roles.

Clone Shared-Jenkins-Slave to Payments-Jenkins-Slave (tag: payments).
Clone Shared-Jenkins-Slave to Logistics-Jenkins-Slave (tag: logistics).
Update the Jenkinsfiles for each project to use their new specific runner/role.
Deprecate and delete the original shared role.

This was painful. It took 2 weeks of coordinating with Product Owners to get "infrastructure maintenance" time. But it was the only way to achieve true isolation.

Once the roles were tagged, we deployed the new Policy updates to the Dev environment first.

Result in Dev:

3 Deployments failed.
Reason: One team was using a shared "Common-Infra" role for their specific app deployment.
Fix: As described above, we forced them to split the role. This was actually a huge architectural improvement - decoupling dependencies.

Part 7: Continuous Verification (The "Never Again" Guard)

Fixing it once isn't enough. Entropy exists. A new junior dev will join. A specialized consultant will ask for exceptions. A catastrophic incident will lead to a "temporary" open policy that never gets closed.

We moved security left. We added a check in our CI/CD pipeline (using Checkov) to scan Terraform plans for messy PassRole permissions.

The Custom Checkov Policy (passrole.yaml):

metadata:
  name: "Ensure IAM PassRole is not open to world"
  id: "CKV_AWS_CUSTOM_001"
  category: "IAM"
definition:
  cond_type: "attribute"
  resource_types:
    - "aws_iam_policy"
    - "aws_iam_role_policy"
  attribute: "policy"
  operator: "json_path_ne"
  value: "$..Statement[?(@.Action=='iam:PassRole' && @.Resource=='*')]"

If a developer tries to commit Resource: "*" for PassRole, the build fails before it even applies.

Infrastructure Drift Detection

We also use Driftctl. Driftctl warns us if someone manually changed an IAM policy in the AWS Console, bypassing Terraform.

driftctl scan --filter "Type=='aws_iam_policy'"

We run this nightly. If a change is detected that isn't in Git, an alert fires to the Security channel. "Who touched Prod IAM at 2 PM?" is a question we can now answer in minutes, not months.

Forensic Analysis: Have I already been breached?

If you are reading this and sweating, wondering if someone has ALREADY used this against you, here is how to check. You need to query AWS CloudTrail.

You are looking for AssumeRole events where the calling identity is one of your Build Agents (or a role that has PassRole permission).

CloudTrail Athena Query:

SELECT
 eventTime,
 eventName,
 userIdentity.sessionContext.sessionIssuer.userName AS caller_role,
 requestParameters
FROM
 cloudtrail_logs
WHERE
 eventName = 'RunInstances'
 AND requestParameters LIKE '%iamInstanceProfile%'
ORDER BY
 eventTime DESC;

Look for instances where your JenkinsRole created an EC2 instance with a suspicious profile (like AdministratorAccess or OrganizationAccountAccessRole). If you see your Jenkins role creating resources with Admin profiles, and that wasn't an authorized action, you have a breach.

Part 8: The Break Glass Scenario

One question I always get when I present this solution: "What if everything breaks and we need Admin access NOW?"

Locking down PassRole is strict. If the tagging logic is broken, or a tag is accidentally deleted, you might lock yourself out of deployments.

We instituted a Break Glass Procedure.

We have a specific IAM Role: OrganizationAccountAccessRole.
This role is exempt from the SCPs (via a NotPrincipal condition in the SCP).
Assuming this role triggers a PagerDuty alert to the entire SRE team immediately.
This role has full * access.

We wired this to our incident response platform. The moment OrganizationAccountAccessRole is assumed, a P0 Critical incident is created in PagerDuty and the on-call engineer's phone rings. This ensures that the "God Role" is never used silently. It forces a conversation: "Why are you using this? Is the automation broken?" We use this role only to fix the tags if the automation system breaks. It has been used exactly once in 2 years.

Part 9: Advanced ABAC Patterns

If you want to take this even further, you can introduce Temporal Restrictions.

For highly sensitive roles (like Database Migrations), we don't just require the access-project tag. We require a session-tag indicating an approved Change Request (CR).

In the Condition block:

"Condition": {
    "StringLike": {
         "aws:PrincipalTag/ChangeRequest": "CR-*"
    }
}

This ensures that the automated system can only pass these sensitive roles if the pipeline run itself is tagged with a valid Change Request ID. This links your ITSM (Jira/ServiceNow) directly to your IAM authorization. That is the holy grail of compliance.

Ephemeral Roles

Another pattern we explored is Ephemeral Roles. Instead of having persistent roles, we use HashiCorp Vault to vend AWS credentials that only exist for 15 minutes. This reduces the attack surface significantly. However, PassRole is still needed for the infrastructure these credentials build. The ABAC pattern remains relevant even with ephemeral creds.

Part 10: The Cultural Shift

The hardest part wasn't the code. It was the culture.

We had to shift our team's mindset from "Convenience First" to "Least Privilege First."

Developers complained: "I can't just spin up a quick test Lambda anymore!"
Management worried: "Is this going to slow down our feature release velocity?"

My answer was simple: "Security is a quality gate, just like unit tests." We didn't slow down. In fact, we sped up. Because we had better isolation, teams stopped stepping on each other's toes. The Payments API couldn't accidentally break the Logistics database helper because it literally couldn't touch the role.

We gamified the migration. Teams that tagged their roles first got "Gold Star" status on the dashboard. It sounds silly, but it worked.

Part 11: The Ultimate IAM Security Checklist

If you are a DevOps Lead, here is your Monday morning checklist. Do not leave the office until you have verified these.

[ ] Audit PassRole: Run CloudSplaining. Identify every role with iam:PassRole and Resource: *.
[ ] Tagging Standard: Define a standard project tag (e.g., access-project) in your organization.
[ ] SCP Enforcement: Apply an SCP to preventing regular users from nuking tags.
[ ] Terraform Module: Create a "Safe" PassRole module that abstracts the ABAC logic.
[ ] CI/CD Policy: Ensure your CI/CD runner is blocked from creating untagged roles.
[ ] Drift Detection: Enable a tool like Driftctl to catch manual console clicks.
[ ] CloudTrail Alert: Set up a CloudWatch Alarm for RunInstances calls that use Admin profiles.
[ ] Break Glass: Test your Break Glass procedure. Does it work? Does it alert everyone?

Glossary of Terms

For the uninitiated, here is a quick reference to the terms used in this "War Story".

IAM (Identity and Access Management): The service that manages access to AWS resources. "Who can do what."
PassRole: Specifically, the permission iam:PassRole. It allows a principal to assign a role to a service (like EC2 or Lambda).
Principal: A user, role, or application that can make a request to an action or operation on an AWS resouce.
ABAC (Attribute-Based Access Control): Using attributes (tags) to define permissions, rather than specific identities.
RBAC (Role-Based Access Control): Using roles to define permissions. "Only Admins can do X."
SCP (Service Control Policy): A policy type in AWS Organizations that manages permissions in your organization. It acts as a guardrail.
CloudTrail: A service that records AWS API calls for your account. The flight recorder of AWS.
Checkov: A static code analysis tool for infrastructure as code (IaC).
PMapper: Principal Mapper. An open-source tool for identifying risks in the configuration of AWS IAM.
Privilege Escalation: The act of exploiting a bug, design flaw, or configuration oversight to gain elevated access to resources that are normally protected.

If iam:PassRole kept you up at night, there are a few other "Silent Killers" you should look for in your audits.

iam:CreateLoginProfile

If an attacker can CreateLoginProfile on an existing user, they can potentially reset the password (even for an Admin user) and log in to the console. This is often overlooked in "User Management" delegation policies.

iam:UpdateAssumeRolePolicy

This is the "Backdoor Creator". If a user can update the "Trust Policy" of a role, they can edit the AdministratorAccess role to say "Trust Me". They can then assume the role and become Admin.

iam:AttachUserPolicy

Limit this permission strictly. A user who can attach a policy can attach AdministratorAccess to themselves. Always use Permissions Boundaries (iam:PermissionsBoundary) when delegating this permission.

Conclusion

The journey from Resource: * to a fully tagged, ABAC-driven IAM architecture was long and filled with uncomfortable meetings. We had to tell fast-moving product teams to "slow down and tag your resources." We had to audit thousands of legacy lines of code.

But the result is peace of mind.

We moved from a "Soft Shell" security model (hard on outside, soft on inside) to actual Zero Trust Principles.

Now, if a Build Agent is compromised:

It can only create resources for its specific project.
It can only pass roles that belong to its specific project.
It cannot create a new Admin role to escalate privileges.

The blast radius is contained. The "Silent Killer" has been disarmed.

IAM is validly criticized for being complex. But in that complexity lies granular power. If you master Condition keys, you master the cloud. Don't let the default settings lull you into a false sense of safety.

Go check your policies now. I'll wait.

Found this useful? This is part of my series on "Real World DevOps." I dissect the actual incidents that kept me up at 2 AM so you can sleep soundly.