---
title: "Agent Looping and Systems Engineering: Building Reliable AI"
author: "Rantideb Howlader"
date: "2026-06-09T00:00:00.000Z"
canonical_url: "https://www.ranti.dev/blog/what-is-agent-looping"
license: "CC-BY-4.0"
---


## 1. The Trap of Prompt Engineering

A few months ago, I was staring at my terminal at 2 AM. I was building an automated research pipeline. I was tweaking a single AI system prompt for the fiftieth time. Sometimes the output was brilliant. Other times, it was an absolute disaster. I was dealing with what engineers now call "AI Slop."

AI Slop is the direct result of probabilistic generation. Large Language Models (LLMs) are autoregressive token predictors. They do not think. They simply calculate the mathematical probability of the next word. When you ask an LLM to generate code or a JSON payload, it guesses the structure. Often, it hallucinates a key value or breaks the syntax. It fails in total silence. If you are generating blog content, the LLM defaults to the average of its training data. This results in hollow, robotic writing.

Like most developers, I thought the problem was my instructions. I tried few-shot prompting. I lowered the temperature setting to 0.1 to force deterministic responses. I injected massive context windows using Retrieval-Augmented Generation (RAG). Nothing completely fixed the issue. The slop kept slipping through.

Then I realized my fundamental mistake. I am a software engineer. I would never push standard code to production without a CI/CD pipeline. I write unit tests in Jest. I write integration tests. If a test fails, the build breaks. Yet, I was taking raw, unverified API responses from an LLM and pushing them directly to the end user. I was treating the LLM like an oracle instead of untrusted code. I did not have a prompt problem. I had a systems engineering problem.

## 2. The Missing Layer: The Eval Loop

The moment I stopped treating AI like magic, the architecture became clear. The secret is not a better prompt. The secret is an Eval Loop. An Eval Loop is a repeatable, automated testing pipeline. It scores your AI's output against a ruthless standard before it executes any final actions.

Here is the exact technical pipeline I build for my agents today:

```mermaid
graph TD
    A[1. Generate <br/> Primary LLM executes task] -->|Produces JSON/Text| B(2. Score <br/> Fast Judge Model evaluates)
    B -->|Grades against Rubric| C{3. Catch <br/> Threshold Check}
    C -->|< 0.7| D[4. Fix <br/> Error logs attached, sent back]
    D -->|Rework| A
    C -->|>= 0.7| E[5. Re-score and Ship <br/> Merged to production]
```

**1. Generate:** The primary LLM (like Claude 3.5 Sonnet or GPT-4o) executes the heavy lifting. It generates the initial output.
**2. Score:** A separate, faster judge model (like Claude 3 Haiku or GPT-4o-mini) intercepts the output. It runs a validation script. It checks the output against a strict grading rubric.
**3. Catch:** The system calculates a final score. If the score falls below my hard threshold, the pipeline halts.
**4. Fix:** The system takes the failed output. It attaches the judge model's critique and any JSON parsing errors. It sends the payload back to the primary LLM for a rewrite.
**5. Re-score and Ship:** The primary LLM tries again. The loop continues until the validation passes. Only then does the code execute.

### Building Mathematical Benchmarks

To make an Eval Loop work, you need a deterministic benchmark. You cannot ask the judge model to "check if this is good." You need a structured rubric. When I build a benchmark, I construct three strict components.

First, I build the **Test Cases**. This is my gold standard data. I search my PostgreSQL database for 50 examples of my absolute best work. I pair the initial input data with the perfect final output. I pass this array to the judge model as the baseline truth.

Second, I define the **Metrics**. I turn subjective quality into a mathematical array. I use a library like Zod or Pydantic to enforce strict JSON schemas. If the output is content, the judge model scores specific criteria. Does it use passive voice? Does it hallucinate URLs? Each criterion gets a boolean value or a float between 0 and 1.

Third, I set the **Threshold**. This is the line of death. I typically set my minimum passing score at 0.7. If an agent's output scores a 0.69, the pipeline kills the process. It throws an exception. I make zero exceptions. If you do not have a hard threshold, you are not testing your AI.

### Wiring It Into Hermes: The 6 Moves

You cannot just have a benchmark on a whiteboard. You must wire it into your daily tools. I use a system called Hermes. I wire it up using six specific technical moves.

**1. Install Telegram or Slack.** The eval gate must be able to interrupt you. It needs a fast channel.
**2. Load Persistent Memory.** I take those 20 to 50 best test cases. I load them into a vector database. This is the gold standard memory.
**3. Turn the Rubric into a Skill.** The rubric is not a spreadsheet. I write a Python script. The script takes the output and scores it 0 to 1 for every single criterion.
**4. Make the Test Suite an Executable.** The test suite is an active code package. It runs automatically in my CI/CD pipeline.
**5. Gate Every Change.** I build a Slack approval button. If a change fails the test, a webhook sends a button to Slack. I must click "Approve" or "Reject".
**6. Set a CRON Job.** I deploy an AWS EventBridge CRON schedule. It runs every hour. It scores live production samples. It pings me the moment the quality line dips.

Hermes does not ship with a magic "Evals" button. You must assemble these primitives yourself. You build it once, and you own the quality gate forever.

## 3. The Compounding Loop of Perfection

The most powerful aspect of this architecture is the compounding effect. The system hardens itself over time. In the early days, an agent will still occasionally ship a bad piece of work. The judge model might miss a subtle hallucination. When this happens, I do not rewrite the agent's code. I update the test suite.

I wire my agents directly into my workspace using webhooks. When an agent completes a task, it sends a notification payload to my private Slack channel. If I see a bad output in Slack, I click a custom thumbs-down emoji reaction.

```mermaid
graph TD
    A[Bad output ships] --> B[Spotted in Slack channel]
    B --> C[Tap thumbs-down reaction]
    C --> D[Webhook triggers serverless function]
    D --> E[System writes new Jest test case]
    E --> F[Every future run must pass this check]
    F -.->|The system hardens automatically| A
```

This emoji reaction triggers an AWS API Gateway webhook. The webhook fires a serverless Lambda function. The function takes the bad output and writes a brand new regression test. It commits this test to the Eval suite repository. From that day forward, the judge model incorporates this new edge case. The exact same hallucination can never ship twice. While I sleep, the system encounters errors, patches its own logic, and raises its own quality floor.

## 4. Moving from Chat to Agent Looping

Once you secure the output layer, you must change how you trigger the AI. Writing individual prompts in a chat window is painfully slow. It requires constant human babysitting. True automation requires a state machine. You must move to Agent Looping.

I no longer prompt coding agents. I design loops that prompt my agents. My job is to write the factory, not to operate the machine. For years, we held the agent's hand. We typed a prompt, read the output, and typed the next prompt. That era is over.

Instead of writing a single prompt, you write a `while` loop in Node.js or Python. The loop runs continuously until a separate judge model verifies the exact definition of "done." You build a system that finds the work, delegates it, checks it, and decides the next move.

### The Single-Agent Loop

When I automate a standalone script, I deploy a Single-Agent Loop. It runs autonomously through five distinct computational stages.

**1. Discovery:** The agent boots up. It executes API calls to gather required context. It might query a vector database or scrape a target URL using Puppeteer.
**2. Planning:** The agent pauses. It writes a structured JSON array detailing its execution steps. It breaks the main goal into a dependency graph.
**3. Execution:** The agent traverses the dependency graph. It executes function calls. It writes the required files or triggers the necessary external APIs.
**4. Verification:** The agent invokes the local judge model. It runs its own internal Eval check. It validates its own JSON outputs against the schema.
**5. Iteration:** If the validation fails, the `while` loop catches the error. The agent ingests the stack trace, adjusts its parameters, and tries again.

### The Fleet Loop

For massive engineering projects, a single agent hits context window limits. It forgets early instructions. That is when I deploy a Fleet Loop using an orchestration framework.

I build an **Orchestrator Agent**. This is the master node. The Orchestrator owns the primary state object. It does not execute raw tasks. It parses the main objective and delegates sub-tasks to **Specialist Agents**.

Those Specialist Agents operate on specific domains. They have access to narrow tools. They might delegate even further to specialized **Subagents**. Every single node in this tree runs its own `while` loop. The subagent verifies its own code. The specialist verifies the subagent's code. The orchestrator verifies the final merge. An Eval Gate sits at the very end of the pipeline. It is a recursive fractal of quality checks.

## 5. The 5 Primitives of Autonomous Engineering

If you want to stop babysitting your AI, you need a strict mechanical harness. I build every loop using five exact primitives, plus a persistent state.

**1. Automations (The Heartbeat).** A loop needs a pulse. I do not trigger my agents manually. I use CRON jobs. They wake up on a schedule. They read yesterday's CI failures. They triage open issues. They find the work autonomously.
**2. Worktrees (The Isolation).** When you run multiple agents, they will collide. They will overwrite the same file. I isolate them using Git worktrees. Every agent gets a fresh, isolated branch and directory. They never step on each other's toes.
**3. Skills (The Intent).** Without skills, an agent starts every session cold. It guesses your project conventions. I write my intent down into executable `SKILL.md` files. The agent reads this context every single run. It compounds knowledge instead of starting from zero.
**4. Connectors (The Hands).** An agent trapped in a terminal is useless. I build Model Context Protocol (MCP) connectors. The agent queries the staging database. It reads the Linear ticket. It opens the GitHub PR. It acts in the real environment.
**5. Sub-agents (The Maker/Checker Split).** You must separate the one who writes from the one who grades. A model is too nice when grading its own homework. I spawn a cheap, fast model to explore. I spawn a heavy model to write. I spawn a ruthless, isolated model to verify.

Then, there is the sixth primitive: **State Memory**. The model forgets everything between runs. The repository does not. The memory must live on disk as a Markdown file or a live database. It tracks what was tried, what failed, and what happens tomorrow.

## 6. My Playbook: Open vs. Closed Agents

When other engineers ask me how to build agents, I explain the difference in tool surfaces. I strictly divide my architecture into Open Agents and Closed Agents.

```mermaid
graph LR
    subgraph Open Explorer Agent
        CLI[Terminal CLI] --> OA[Open Agent]
        OA -->|Access to massive tool surface| Tools[SerpAPI, Puppeteer, SQL, Vector DB]
        Tools --> Output[Markdown / JSON Digest]
    end

    subgraph Closed Factory Agent
        CRON[AWS EventBridge CRON] --> CA[Closed Niche Agent]
        CA -->|Hardcoded API targets| SpecificTools[X API + Content Embeddings]
        SpecificTools --> Report[PostgreSQL Database]
        Report -->|Loop closes here| CB[Next Pipeline Stage]
    end
```

### Open Looping (The Explorer)

I deploy Open Agents for exploratory data analysis. I give these agents a massive tool surface. I bind them to LangChain toolkits containing SerpAPI for web search, Puppeteer for DOM parsing, and raw SQL adapters for database querying.

I execute them via a CLI command. They fan out across the internet. They pull massive amounts of unstructured data, synthesize it, and output a detailed Markdown report.

However, Open Agents burn an insane amount of tokens. Because their context window fills rapidly with scraped HTML, they cost dollars per run. They are prone to infinite loops if a website blocks their scraper. They are powerful, but they require human oversight.

### Closed Looping (The Factory)

Closed Agents are the backbone of a reliable system. A Closed Agent does one extremely narrow job, but it runs perfectly thousands of times.

I restrict its environment. For example, my competitive intelligence agent cannot access Google. It is hardcoded to ping exactly 50 specific Twitter IDs using an undocumented API endpoint. It runs entirely on an AWS EventBridge CRON expression (`cron(0 10 ? * FRI *)`).

Every Friday at 10 AM, the lambda function spins up. It scrapes the specific JSON feeds. It feeds the raw text into an LLM to extract entity relationships. It saves the structured data directly to my PostgreSQL database. Because the parameters are entirely locked down, it almost never fails. It uses minimal tokens. Closed loops provide absolute stability.

## 7. The Secret Sauce: Building the GBRAIN

A standard LLM has amnesia. It wakes up completely blank on every single API call. If you want your agents to perform like senior engineers, you must give them persistent memory. You must build a **GBRAIN**.

```mermaid
graph TD
    L1[1. Source-of-Truth Docs] --> GBRAIN
    L2[2. Workflows & SOPs] --> GBRAIN
    L3[3. Examples of Good] --> GBRAIN
    L4[4. Decision Logs] --> GBRAIN
    L5[5. Ownership Map] --> GBRAIN
    L6[6. Customer & Market] --> GBRAIN
    L7[7. Permissions] --> GBRAIN
    L8[8. Feedback Loops] --> GBRAIN

    GBRAIN((GBRAIN<br/>Vector Database)) --> Agents[Agent Memory Retrieval]
```

My GBRAIN is a centralized vector database running on Pinecone or pgvector. I take all my company documents, chunk them into small text blocks, and convert them into numerical embeddings. When an agent boots up, it executes a Cosine Similarity search against the vector database to retrieve context before it makes a single decision.

I feed the GBRAIN eight specific data layers:

**1. Source-of-Truth Docs:** I embed my brand guidelines, markdown documentation, and API specifications.
**2. Workflows & SOPs:** I embed my CI/CD deployment checklists and QA procedures. The agent retrieves these steps before executing code.
**3. Examples of Good:** I embed massive JSON arrays of successful past outputs. Taste is difficult to prompt, but mathematical similarity to good examples forces the model into the correct latent space.
**4. Decision Logs:** I document why we chose Next.js over React, or PostgreSQL over MongoDB. When the agent plans an architecture, it reads these logs to avoid suggesting deprecated tech stacks.
**5. Ownership Map:** I map out the exact IAM roles and Slack IDs of my team. The agent knows exactly who to ping for database approval.
**6. Customer & Market Data:** I embed transcripts from customer interviews. The agent uses this semantic context to align its tone.
**7. Permissions & Boundaries:** I embed strict security rules. The agent reads the system limits before executing any external HTTP requests.
**8. Feedback Loops:** I embed the error logs and stack traces from previous failures. The agent checks this index to avoid repeating historical crashes.

## 8. Adding an Agent to a Vertical: The 9 Pillars

When you add an agent to a specific vertical in your business, you must treat it like a human. You map it exactly like a human role.

Every single vertical agent needs exactly 9 things to succeed. If you miss one, the agent fails.

**1. Context:** The agent must know what the vertical needs to know to operate.
**2. Data:** The agent needs access to SQL databases, raw examples, and past work.
**3. Standards:** The agent must know what good looks like. It must have the reasons attached.
**4. Tools:** The agent needs scripts to read, write, scrape, send, update, and coordinate.
**5. Boundaries:** The agent must know what it must never touch. It must know what needs sign-off first.
**6. Delegation:** The agent must know which other specialist agents or humans it can route work to.
**7. Evals:** The agent must know how it checks quality before the work moves on.
**8. Human Review:** The agent must know where human judgment, taste, or final approval is legally required.
**9. Memory:** The agent must know what gets saved after each run. This is how the system gets sharper.

These are the exact same things a human owner needs. Context, authority, feedback, and a clear escalation path.

### Case Study: The PR Newswire Flow

Let us look at a real-world example. Let us look at a PR Newswire flow, mapped end-to-end.

```mermaid
graph TD
    A[New campaign request] --> B[PR Vertical Owner Agent]
    GBRAIN[Agency GBRAIN] -.->|Context & Past Wins| B
    B -->|Delegates writing| C[Writer Agent]
    C -->|Drafts article| D{Eval Loop}
    D -->|Fails check| C
    D -->|Passes check| E[Human Review Gate]
    E -->|Approved| F[Ship & Run Ops]
```

A marketing engineer submits a new campaign request. The request hits the **PR Vertical Owner Agent**.

This boss agent is wired deeply into the system. It reads from the agency GBRAIN. It looks at past wins. It accesses the PR database. It checks media partners. It checks communication channels. It reviews pricing offers. It plans the entire campaign.

But it does not write the article. It delegates the writing. It sends the brief to a **Writer Agent** running on Claude Opus. The Writer Agent drafts the article. It writes the pitch. It formats the media.

Then, the draft hits the **Eval Loop**. The system asks ruthless questions. Does the angle match the client? Does the publication fit make sense? Is it aligned with past winning campaigns? Are the claims supported by facts? Is it safe to send to the partner?

If it fails any check, the system triggers a redraft. The Writer Agent tries again. If it passes, it moves to the **Human Review Gate**.

This is the final check. A real human looks at the draft. The human provides judgment and taste. The human gives final approval. Once approved, the system moves to **Ship and Run Ops**. The system talks to the marketing team. It coordinates the media houses. It runs the heavy operations in the background.

In this vertical, agents own 70% of the work. Humans keep the judgment, the relationships, and the final call. This is how you build leverage safely.

## 9. Scaling as an Agency: The Client Pod Model

If you run an agency, you cannot pool client data into a single vector database. If you do, the embeddings will bleed together. The LLM will accidentally inject Client A's proprietary data into Client B's output. It is a massive security risk.

To solve this, I use the **Client Pod** architecture. I leverage Docker containerization.

I build my perfect Orchestrator and Specialist Agents internally. I test the Python scripts and Node.js servers until they are flawless. When I sign a new client, I spin up a brand new, isolated Docker container. I provision a dedicated, isolated vector database for that specific client. I pass unique API keys via a `.env` file.

This new Client Pod inherits the core logic of my master system, meaning it operates at high efficiency immediately. But because it runs in an isolated network environment, it only reads from its own siloed GBRAIN. The data remains cryptographically secure.

## 10. My 9-Step Protocol for Deploying Specialists

You cannot build a reliable agent by simply typing a long prompt. Building a specialist agent requires strict software engineering discipline. I follow a ruthless 9-step deployment protocol.

```mermaid
flowchart TD
    subgraph Phase 1: Discovery
        D1[1. Manual API Testing] --> D2[2. Define Open or Closed]
    end

    subgraph Phase 2: Design
        D2 --> D3[3. Write System Prompts]
        D3 --> D4[4. Bind API Tools]
        D4 --> D5[5. Define Zod Schemas]
        D5 --> D6[6. Enforce IAM Roles]
    end

    subgraph Phase 3: Deploy
        D6 --> D7[7. Deploy Docker Container]
        D7 --> D8[8. Monitor Edge Cases]
        D8 --> D9[9. Set EventBridge CRON]
        D9 -.->|Patch exceptions| D8
    end
```

### Phase 1: Discovery

**1. Prototype it manually.** I never automate a workflow I have not executed myself. I open Postman or my terminal. I run the exact API calls and data transformations manually. I document the exact JSON structures required.
**2. Decide Open or Closed.** I analyze the workflow. If it requires dynamic web scraping, it becomes an Open Agent. If it targets a static API endpoint, it becomes a Closed Agent.

### Phase 2: Design

**3. Write the Soul.md.** I create the master system prompt. I define the agent's strict operational parameters. I use negative prompting to explicitly list prohibited behaviors.
**4. Pick the Skill Bundle.** I bind exact functions to the agent. If it needs to query a database, I write an atomic, sanitized SQL execution function. I do not give it raw access.
**5. Define Inputs and Zod Schemas.** I map the exact JSON schema required for the final output using a validation library like Zod. If the agent cannot satisfy the schema, the execution fails instantly.
**6. Scope the Boundaries.** I apply the principle of least privilege. I generate a restricted API key that can only execute `GET` requests. I explicitly block it from executing `POST` or `DELETE` requests on production databases.

### Phase 3: Deploy

**7. Spin up an Isolated Environment.** I wrap the agent in a Docker container. I deploy it to a staging environment on AWS ECS or a private VPS.
**8. Run on Real Work and Iterate.** I pipe 10 real tasks through the staging container. I monitor the console logs. I watch for API rate limits, context window overflows, and schema validation errors. I patch the code until the pipeline achieves a 100% success rate.
**9. Promote to Production.** Once verified, I push the container to production. I link it to Datadog or Sentry for runtime monitoring. I attach the CRON trigger. Then, I let the system run autonomously.

## 11. The Danger of Cognitive Surrender

The loop changes your work. It does not delete you from it. In fact, as your loop gets faster, three core problems become highly dangerous.

First is verification. A loop running unattended is a loop making mistakes unattended. "Done" is a claim, not a mathematical proof. You must review the final output. Your job is to ship code you confirmed works.

Second is comprehension debt. The faster the loop ships code you did not write, the larger the gap between the codebase and your brain. If you do not read the commits, your understanding rots.

Finally, there is cognitive surrender. When a loop runs perfectly, it is tempting to stop thinking. You accept whatever it hands you. Two engineers can build the exact same loop and get completely opposite results. One uses it to move faster on architecture they deeply understand. The other uses it to avoid understanding the system at all. The loop does not know the difference. You do.

The leverage point has moved. Prompt engineering is dead. Loop engineering is the new baseline.

Build the loop. But build it like someone who intends to stay the engineer.


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Logging Off For A While](https://www.ranti.dev/blog/logging-off.md)
- [Building a Multi-Writer Serverless SQLite Engine on Amazon S3](https://www.ranti.dev/blog/s3-db-wal-hands-on-guide.md)
- [Letter to EveryOne](https://www.ranti.dev/blog/hospital-days.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Agent Looping and Systems Engineering: Building Reliable AI",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-06-09T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/what-is-agent-looping",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{what-is-agent-looping_2026,
  author = {Rantideb Howlader},
  title = {Agent Looping and Systems Engineering: Building Reliable AI},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/what-is-agent-looping},
  note = {Accessed: 2026-06-09}
}
```

### IEEE
Rantideb Howlader, "Agent Looping and Systems Engineering: Building Reliable AI," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/what-is-agent-looping. [Accessed: 2026-06-09].

### APA
Rantideb Howlader. (2026). Agent Looping and Systems Engineering: Building Reliable AI. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/what-is-agent-looping

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->