---
title: "Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis"
author: "Rantideb Howlader"
date: "2026-06-15T00:00:00.000Z"
canonical_url: "https://www.ranti.dev/blog/beyond-rag-tagore"
license: "CC-BY-4.0"
---


Ask a standard RAG pipeline a question about a sales report, and it does a decent job. Ask it why the storm in Rabindranath Tagore's poetry sometimes means destruction and sometimes means liberation, and it falls flat on its face.

This is not a prompt engineering problem. It is an architecture problem.

I have spent the last several months building agentic systems for literary and cultural analysis, with Tagore's work as my test corpus. He is the perfect stress test. His writing spans poetry, novels, essays, and songs. It carries three layers at once: the literal text, the historical moment of colonial Bengal, and a personal philosophy that argued with both. A system that can reason about Rabindranath properly can reason about almost any complex document.

In this post, I will show you exactly why the standard retrieval-augmented generation (RAG) pattern breaks on this kind of material, with the mathematics of where it breaks. Then I will rebuild the solution from first principles as a stateful directed graph of specialized agents, with full LangGraph code, a Neo4j knowledge graph integration, real trace output, and an evaluation framework that does not depend on vibes.

This is a long one. Get your cha ready.

A quick note before we start. If the words "agent loop" are new to you, read my earlier primer on [what agent looping actually is](/blog/what-is-agent-looping) first. This post assumes you know that pattern and builds the formal version of it. And if you want to serve the models behind these agents on your own hardware, my guide on [running vLLM on EKS](/blog/vllm-on-eks) pairs well with this one, because multi-agent systems are token furnaces and self-hosting changes their economics completely.

## Part 1: The Mechanics of RAG Failure on Narrative Structures

Let us be precise about what standard RAG does, so we can be precise about where it fails.

The canonical pipeline has four steps. First, split the corpus into chunks of 256 to 1024 tokens. Second, pass each chunk through an embedding model to get a dense vector, typically in 768 to 3072 dimensions. Third, at query time, embed the user's question into the same latent space. Fourth, retrieve the top-k chunks by cosine similarity and stuff them into the prompt.

Cosine similarity between a query vector q and a chunk vector c is:

```text
sim(q, c) = (q · c) / (||q|| * ||c||)
```

This single number is doing all the heavy lifting. And it encodes exactly one thing: topical closeness in the embedding model's learned latent space. Two passages score high when they talk about similar surface content. That is the whole contract.

Now watch what this contract throws away.

### Failure 1: Chunking destroys chronological dependency

Narrative meaning is order-dependent. In Tagore's novel _Gora_, the protagonist spends 600 pages as a fierce Hindu nationalist. In the final chapters, he learns he was born Irish, adopted as an infant during the 1857 uprising. Every earlier scene changes meaning retroactively. His rigid orthodoxy becomes tragic irony. His arguments about purity of birth collapse onto his own head.

Formally, the meaning of chunk c_i is not a function of c_i alone. It is a function of the sequence:

```text
meaning(c_i) = f(c_1, c_2, ..., c_i, ..., c_n)
```

Chunking followed by independent embedding makes a brutal independence assumption:

```text
embed(c_i) ⊥ embed(c_j)  for all i ≠ j
```

Each chunk is encoded as if it floated alone in space. The embedding of the chapter where Gora lectures on caste contains zero signal from the revelation chapter, because the encoder never saw them together. Cosine similarity over these vectors can never recover a dependency that was deleted before the vectors existed. This is not a retrieval quality issue you can fix with a better embedding model. The information is gone at index time.

```mermaid
graph TD
    subgraph Narrative Structure
        N1[Chapter 1: Rigid Nationalism] --> N2[Chapter 50: Revelation of birth]
        N1 -.- |Meaning changes based on| N2
    end

    subgraph Standard RAG Process
        C1[Chunk 1] --> V1[Vector 1]
        C2[Chunk 50] --> V2[Vector 50]
        V1 -.- |No connection| V2
    end

    style Narrative Structure stroke:#FF5A5F,stroke-width:2px
```

### Failure 2: Cosine similarity is blind to thematic function

Here is the example that convinced me to abandon naive RAG for this work. Tagore uses the storm (jhor in Bangla) as a recurring motif. In some poems of _Gitanjali_, the storm is the divine presence breaking open a closed heart. In his Swadeshi-era songs, the storm is political upheaval. In _Ghare Baire_, the storm is destructive passion that burns a household down.

Ask a RAG system "what does the storm mean in Tagore?" and the retriever returns the chunks most similar to the query. All storm passages look alike in latent space. They share vocabulary: wind, clouds, breaking, night, rain. Their cosine similarity to each other is high. Their _functional_ difference, which is the entire scholarly question, lives in each passage's relation to its surrounding context, the work it belongs to, and the year it was written. None of that survives in a context-free chunk vector.

There is also a geometric issue underneath this. Embedding spaces from contrastively trained encoders are known to be anisotropic. Vectors crowd into a narrow cone, and the effective dimensionality is far lower than the nominal 1536 or 3072. Fine semantic distinctions, like "storm as grace" versus "storm as ruin," get compressed into directions the similarity metric barely registers. The metric saturates. Everything about storms is just "stormy."

### Failure 3: Top-k retrieval cannot do multi-hop reasoning

Deep cultural analysis is multi-hop by nature. To explain why Gora's nationalism reads the way it does, you need at least this chain: the text of the novel, plus the history of the Swadeshi movement of 1905, plus Tagore's own withdrawal from that movement, plus the Brahmo Samaj reformist context of his family. Hop two depends on what you learned in hop one. Top-k retrieval is a single-shot function. It has no mechanism to say "given what I just found, I now need something else entirely." You can bolt on iterative retrieval, and people do, but at that point you are already building an agent. You have just not admitted it yet.

### "Just use a bigger context window" is not the answer

The obvious objection: modern models accept 128k, 200k, even a million tokens. Why not drop the whole novel in and skip retrieval?

Three reasons, and they are mechanical, not aesthetic.

**First, attention is a finite budget.** Self-attention computes, for each query position, a softmax distribution over all key positions:

```text
Attention(Q, K, V) = softmax(Q Kᵀ / √d_k) V
```

The softmax outputs a probability distribution. It must sum to 1. As sequence length n grows, that fixed unit of attention mass is spread across more positions. The expected weight on any single relevant token shrinks roughly as 1/n at the limit. With n in the hundreds of thousands, the signal from one quietly placed sentence in chapter 11, the kind of sentence literary meaning hangs on, gets diluted toward the noise floor. Researchers call the observable symptom "lost in the middle": retrieval accuracy for facts placed mid-context drops sharply compared to facts at the start or end. The model is not lazy. The arithmetic of softmax dilution makes mid-context recall hard.

**Second, positional generalization degrades.** Long-context models stretch their positional encodings (RoPE scaling and friends) to reach these lengths. The attention patterns at position 400,000 are an extrapolation of patterns trained mostly on far shorter sequences. Performance on genuinely long-range dependency tasks decays well before the advertised maximum. Passing a needle-in-a-haystack benchmark, which tests copying a planted string, says very little about tracking a motif's evolution across 90,000 tokens of poetry, which requires sustained global coherence.

**Third, even perfect recall is not reasoning.** Suppose a future model attends flawlessly across a million tokens. The Gora question still requires information that is _not in the book at all_: the 1857 uprising, the politics of 1905 Bengal, the doctrines of the Brahmo Samaj, the publication history. A context window, however large, only holds what you put in it. Deciding what external knowledge the analysis needs, fetching it, checking it against the text, and revising the interpretation is a _process_. A single forward pass is not a process. It is one step of one.

So the failure is structural. We need a system that preserves order, models dependencies explicitly, performs multi-hop lookups against structured knowledge, and can criticize and revise its own intermediate conclusions. That is a multi-agent system. And to keep it from becoming an unprincipled mess of chatbots shouting at each other, we will define it as a graph.

## Part 2: Agentic Workflows as Directed Graphs

The phrase "multi-agent system" has been marketed to death. Strip the marketing and what remains is an old, well-understood object: a state machine over a directed graph.

Define the system as a graph G = (V, E) where:

- **V** is a set of nodes. Each node v is a function `v: S → ΔS` that reads the current state and returns an update to it. Some nodes call an LLM. Some call a database. Some are pure Python. The graph does not care.
- **E** is a set of directed edges. An edge (u, v) means "after node u completes, node v may execute." Edges can be static (always taken) or conditional (a routing function `r: S → V` inspects the state and picks the next node).
- **S** is the shared state, a typed structure that persists across the whole traversal. This is the central difference from a prompt chain. State is first-class, schema-defined, and every node reads from and writes to the same object.

Execution is a traversal. Start at an entry node, apply node functions, follow edges, halt at a terminal node. If the graph has no cycles, it is a Directed Acyclic Graph (DAG) and execution is a topological sort with guaranteed termination. The moment you add a critique-and-retry loop, you introduce a cycle, and the graph becomes a general state machine. Then _you_ must guarantee termination yourself, usually with an iteration counter in the state. Remember that point. It will appear in the code as a very deliberate `revision_count` field.

Why is this framing better than "agents chatting"? Three properties fall out of it for free:

1. **Determinism of structure.** The set of possible execution paths is fixed by E. An agent cannot wander somewhere the graph does not permit. For research use, where reproducibility matters, this is gold.
2. **Inspectability.** The full trace, meaning the sequence of nodes visited and the state diff at each step, is a complete record of the reasoning. We will use this trace as the basis of our evaluation framework in Part 5.
3. **Typed state mutation.** Because S has a schema, every node's contribution is a structured write, not a blob of chat text. Merging concurrent writes becomes a well-defined fold operation, which LangGraph calls a reducer.

### The architecture for literary analysis

Here is the specific graph we will build, shaped by how human scholars actually work. A literature professor does not read a poem once and emit an essay. She does close reading of the language, then situates the text historically, then synthesizes, then a peer reviewer pushes back, and she revises.

```mermaid
flowchart TD
    Supervisor{"Supervisor Node<br/>(Semantic Routing)"}

    Syntactic["Syntactic Node<br/>(Close Reading)"]
    Historical["Historical Context Node<br/>(Neo4j KG)"]
    Critique["Critique Node<br/>(Adversarial)"]

    Synthesis["Synthesis Node"]
    END(["END"])

    Supervisor -->|Weak Textual| Syntactic
    Supervisor -->|Weak Grounding| Historical
    Supervisor -->|Needs Conflict Check| Critique

    Syntactic --> Supervisor
    Historical --> Supervisor
    Critique --> Supervisor

    Supervisor -->|Confidence ≥ τ AND<br/>Conflicts Resolved| Synthesis
    Synthesis --> END
```

- The **Supervisor** is the router. It never analyzes anything itself. It reads the state and decides which specialist runs next. This is semantic routing: the routing decision is computed from the content of the state, not from a fixed sequence.
- The **Syntactic Node** does close reading: meter, imagery, motif extraction, linguistic register. For Tagore this includes Bangla-specific structure even when working from translation, because his English self-translations famously flatten the originals.
- The **Historical Context Node** does no literary interpretation at all. It extracts entities and events from the working analysis and queries a Neo4j knowledge graph for verified facts: dates, movements, relationships, documented positions.
- The **Critique Node** is adversarial by design. Its only job is to attack the current draft: find unsupported claims, surface contradictions between the historical record and the literary reading, and demand specificity.
- The **Synthesis Node** is gated. It may only execute when a conditional edge certifies, with an explicit numeric check, that the analysis is complete and internally consistent. It merges all accumulated evidence into the final output.

State flows through every hop. Each node mutates only its own slice of the state, and reducers define how those slices accumulate. Let us now build it for real.

## Part 3: Architecting the Multi-Agent Graph in LangGraph

Everything below runs on [LangGraph](https://langchain-ai.github.io/langgraph/?utm_source=ranti.dev). I am using it because its core abstraction is honestly just the state machine we defined above, with persistence and streaming attached. You could write the same thing as a custom state machine in 200 lines of Python, and for learning purposes you should try once.

### 3.1 The state schema

The state is the most important design decision in the whole system. Get it wrong and the agents talk past each other. Here is the full schema with custom reducers:

```python
import operator
from typing import Annotated, Literal, TypedDict
from langgraph.graph import StateGraph, START, END


# ---------- Sub-structures ----------

class HistoricalFact(TypedDict):
    """One verified fact retrieved from the knowledge graph."""
    entity: str            # e.g. "Swadeshi movement"
    claim: str             # the fact itself
    source: str            # provenance: KG node id or citation
    year: int | None       # temporal anchor for ordering
    relevance: float       # router-assigned relevance in [0, 1]


class MotifObservation(TypedDict):
    """One close-reading observation from the syntactic node."""
    motif: str             # e.g. "storm", "stranger", "boat"
    text_span: str         # exact quoted span (kept short)
    line_ref: str          # poem/line or chapter reference
    reading: str           # what the device is doing here
    valence: Literal["liberatory", "destructive", "ambiguous"]


class Contradiction(TypedDict):
    """A conflict the critique node has surfaced."""
    claim_a: str
    claim_b: str
    status: Literal["open", "resolved"]
    resolution: str | None


# ---------- Custom reducers ----------

def merge_facts(
    existing: list[HistoricalFact],
    new: list[HistoricalFact],
) -> list[HistoricalFact]:
    """Reducer for historical facts.

    Deduplicates on (entity, claim) so repeated KG queries do not
    bloat the state, and keeps the list sorted by year so every
    downstream node sees facts in chronological order. Order is
    load-bearing: it lets the synthesis node reason about
    before/after relationships without re-deriving them.
    """
    seen = {(f["entity"], f["claim"]) for f in existing}
    merged = list(existing)
    for f in new:
        if (f["entity"], f["claim"]) not in seen:
            merged.append(f)
            seen.add((f["entity"], f["claim"]))
    return sorted(merged, key=lambda f: (f["year"] is None, f["year"]))


def merge_contradictions(
    existing: list[Contradiction],
    new: list[Contradiction],
) -> list[Contradiction]:
    """Reducer for contradictions.

    A new entry with the same (claim_a, claim_b) pair REPLACES the
    old one. This is how a contradiction moves from open to
    resolved: a node re-emits it with status flipped. Last write
    wins per logical key, append otherwise.
    """
    index = {(c["claim_a"], c["claim_b"]): i for i, c in enumerate(existing)}
    merged = list(existing)
    for c in new:
        key = (c["claim_a"], c["claim_b"])
        if key in index:
            merged[index[key]] = c
        else:
            merged.append(c)
    return merged


# ---------- The graph state ----------

class AnalysisState(TypedDict):
    # Input
    source_text: str                    # the poem or chapter under study
    work_metadata: dict                 # title, year, collection, language

    # Accumulated evidence (reducers control the merge)
    historical_context: Annotated[list[HistoricalFact], merge_facts]
    motif_observations: Annotated[list[MotifObservation], operator.add]
    contradictions: Annotated[list[Contradiction], merge_contradictions]
    critique_history: Annotated[list[str], operator.add]

    # Working draft (plain overwrite: last writer wins)
    draft_analysis: str

    # Control plane
    completeness: dict[str, float]      # per-dimension scores in [0,1]
    revision_count: int                 # cycle guard. NON-NEGOTIABLE.
    next_node: str                      # supervisor's routing decision
```

Three design notes, because the schema is where the theory becomes engineering.

**Reducers are fold functions.** The `Annotated[list[X], reducer]` pattern tells LangGraph how to merge a node's returned partial state into the global state. `operator.add` is plain append. Our custom `merge_facts` adds deduplication and a chronological sort invariant. This is exactly the associative-merge idea from CRDTs, applied to agent state. It also means two worker nodes can run in parallel and their writes merge without a race.

**The state separates evidence from interpretation.** `historical_context` and `motif_observations` are evidence: structured, sourced, typed. `draft_analysis` is interpretation: free text, always rewritable. Keeping these apart is what lets the critique node check the draft _against_ the evidence. If everything lived in one chat transcript, there would be nothing firm to check against.

**`revision_count` is the termination guarantee.** Our graph has a cycle (critique sends work back). A cycle without a counter is a while-true loop with an API bill attached. Ask me how I know.

### 3.2 The conditional edge: gating synthesis on measurable completeness

This is the heart of the architecture, so I will show it in full. The supervisor scores the state along defined dimensions, and a routing function decides, with explicit arithmetic, whether the analysis has earned the right to proceed to synthesis.

```python
import math

# Dimensions of a complete literary analysis, with weights.
# Weights are a research decision, not a technical one. These
# say: textual evidence and historical grounding matter most.
DIMENSION_WEIGHTS = {
    "textual_evidence":    0.30,  # enough quoted, located spans?
    "historical_grounding": 0.30, # claims tied to sourced KG facts?
    "motif_coverage":      0.20,  # key motifs analyzed, not listed?
    "counter_reading":     0.20,  # has an opposing reading been faced?
}

CONFIDENCE_THRESHOLD = 0.75   # τ: minimum weighted score to synthesize
MAX_REVISIONS = 4             # hard cycle bound


def weighted_confidence(scores: dict[str, float]) -> float:
    """Weighted arithmetic mean over analysis dimensions.

    We also apply a floor penalty: if ANY single dimension is
    below 0.4, confidence is capped at 0.6. A brilliant close
    reading with zero historical grounding must not sneak past
    the gate on its average alone. The min() term makes the
    gate sensitive to the weakest link, not just the mean.
    """
    mean = sum(DIMENSION_WEIGHTS[d] * scores.get(d, 0.0)
               for d in DIMENSION_WEIGHTS)
    weakest = min(scores.get(d, 0.0) for d in DIMENSION_WEIGHTS)
    if weakest < 0.4:
        return min(mean, 0.6)
    return mean


def route_from_supervisor(state: AnalysisState) -> str:
    """Conditional edge. Pure function of state. No LLM call here.

    The LLM (inside the supervisor node) produced the scores and
    a suggested next worker. The ROUTING itself is deterministic
    Python, so the control flow of the graph is auditable and
    reproducible even though the scores are model-generated.
    """
    scores = state["completeness"]
    conf = weighted_confidence(scores)

    open_conflicts = [
        c for c in state["contradictions"] if c["status"] == "open"
    ]

    # Gate 1: hard stop on runaway cycles.
    if state["revision_count"] >= MAX_REVISIONS:
        # Synthesize anyway, but the trace will record that we
        # exited via the bound, not via the quality gate. The
        # evaluation layer treats these runs differently.
        return "synthesis"

    # Gate 2: unresolved contradictions block synthesis outright,
    # regardless of confidence. An analysis that has not faced its
    # own conflicts is not done, however polished it sounds.
    if open_conflicts:
        return "critique" if conf >= CONFIDENCE_THRESHOLD else state["next_node"]

    # Gate 3: the numeric quality gate.
    if conf >= CONFIDENCE_THRESHOLD:
        return "synthesis"

    # Otherwise: route to the weakest dimension's specialist.
    weakest_dim = min(scores, key=lambda d: scores.get(d, 0.0))
    return {
        "textual_evidence":     "syntactic",
        "motif_coverage":       "syntactic",
        "historical_grounding": "historical",
        "counter_reading":      "critique",
    }[weakest_dim]
```

Read Gate 2 twice. It encodes the central scholarly value of this whole system: _a confident analysis with unresolved contradictions is worse than an unfinished one._ The graph structurally refuses to produce a smooth final answer while a known conflict sits open in the state. No prompt instruction can give you that guarantee. Prompts are requests. Edges are law.

### 3.3 Wiring the graph

```python
builder = StateGraph(AnalysisState)

builder.add_node("supervisor", supervisor_node)
builder.add_node("syntactic", syntactic_node)
builder.add_node("historical", historical_context_node)
builder.add_node("critique", critique_node)
builder.add_node("synthesis", synthesis_node)

builder.add_edge(START, "supervisor")

# Every worker reports back to the supervisor. Star topology.
for worker in ("syntactic", "historical", "critique"):
    builder.add_edge(worker, "supervisor")

# The supervisor's outgoing edge is conditional: pure routing
# function, the one defined above.
builder.add_conditional_edges(
    "supervisor",
    route_from_supervisor,
    {
        "syntactic": "syntactic",
        "historical": "historical",
        "critique": "critique",
        "synthesis": "synthesis",
    },
)

builder.add_edge("synthesis", END)

graph = builder.compile()
```

The topology is a star with a gated exit. Workers never talk to each other directly. All coordination flows through typed state and the supervisor. This keeps the number of edges linear in the number of workers, keeps every hop visible in the trace, and makes adding a new specialist (a Prosody Node, a Translation Comparison Node) a two-line change.

## Part 4: Case Study, Deconstructing Tagore

Theory is cheap. Let us run the machine on a real text and watch the gears turn.

Our subject is poem 39 from the English _Gitanjali_ (1912), the collection that won Tagore the Nobel Prize in 1913. The poem opens with the speaker's heart "hard and parched" and ends by asking the divine to "come with thy thunder" when "the desire that stifles thee in clamour" has taken over. It is a short text. A zero-shot LLM will read it once and tell you, correctly but emptily, that it is a devotional poem asking God for renewal. Our graph will find what that reading misses.

For grounding, you can read the full _Gitanjali_ text on [Project Gutenberg](https://www.gutenberg.org/ebooks/7164?utm_source=ranti.dev), and the broad historical background of the period on [Britannica's entry for the Swadeshi movement](https://www.britannica.com/event/Swadeshi-movement?utm_source=ranti.dev).

### 4.1 The Historical Context Node and the knowledge graph

First, the part most literary RAG systems do not have at all: a structured, verified knowledge base. Before any agent runs, we build a small Neo4j knowledge graph of colonial Bengal. Nodes are people, movements, organizations, works, and events. Edges are typed relationships with dates and citations. A tiny slice of the schema:

```cypher
// People, movements, institutions of colonial Bengal
CREATE (rt:Person {name: "Rabindranath Tagore", born: 1861, died: 1941})
CREATE (dt:Person {name: "Debendranath Tagore", born: 1817, died: 1905})
CREATE (bs:Organization {name: "Brahmo Samaj", founded: 1828,
        doctrine: "monotheist Hindu reform, rejects idolatry and caste ritual"})
CREATE (sw:Movement {name: "Swadeshi movement", start: 1905, end: 1911,
        trigger: "Partition of Bengal by Lord Curzon, 1905"})
CREATE (gi:Work {title: "Gitanjali (English)", published: 1912,
        note: "self-translated prose poems from Bengali originals"})
CREATE (gb:Work {title: "Ghare Baire", published: 1916,
        note: "novel critiquing extremist nationalism"})

CREATE (rt)-[:SON_OF]->(dt)
CREATE (dt)-[:LED]->(bs)
CREATE (rt)-[:RAISED_IN]->(bs)
CREATE (rt)-[:PARTICIPATED_IN {from: 1905, until: 1907,
        note: "wrote songs, led rakhi-bandhan marches"}]->(sw)
CREATE (rt)-[:WITHDREW_FROM {year: 1907,
        reason: "alarmed by coercion of poor Muslim traders and rising violence"}]->(sw)
CREATE (rt)-[:AUTHORED]->(gi)
CREATE (rt)-[:AUTHORED]->(gb)
```

```mermaid
graph LR
    Tagore((Tagore)) -->|PARTICIPATED_IN| Swadeshi[Swadeshi Movement]
    Tagore -->|WITHDREW_FROM| Swadeshi
    Tagore -->|AUTHORED| Gitanjali[Gitanjali]
    Tagore -->|RAISED_IN| Brahmo[Brahmo Samaj]
```

The Historical Context Node is an LLM wrapper around parameterized Cypher queries. The LLM's only creative job is entity extraction and query selection. The facts come from the graph, with provenance attached. Here is the node, trimmed to its working core:

```python
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687",
                              auth=("neo4j", os.environ["NEO4J_PASS"]))

# Whitelisted query templates. The LLM picks a template and fills
# parameters. It never writes raw Cypher. This is both an injection
# guard and a reproducibility guard: the space of possible KG
# queries is finite and reviewable.
QUERY_TEMPLATES = {
    "author_context_in_year": """
        MATCH (p:Person {name: $person})-[r]->(x)
        WHERE (r.from IS NULL OR r.from <= $year)
          AND (r.until IS NULL OR r.until >= $year - 2)
        RETURN p.name AS entity, type(r) AS relation,
               x.name AS object, r.note AS note,
               coalesce(r.year, r.from) AS year
    """,
    "movement_facts": """
        MATCH (m:Movement {name: $movement})
        OPTIONAL MATCH (p:Person)-[r]->(m)
        RETURN m.name AS entity, m.trigger AS claim, m.start AS year,
               collect(p.name + ': ' + type(r) +
                       coalesce(' (' + r.note + ')', '')) AS participants
    """,
}

def historical_context_node(state: AnalysisState) -> dict:
    """Extract entities from the draft, fetch verified facts.

    Returns ONLY new historical_context entries plus a completeness
    score update. The merge_facts reducer handles dedup and
    chronological ordering globally.
    """
    entities = extract_entities_llm(            # small, cheap model
        state["draft_analysis"], state["work_metadata"]
    )
    facts: list[HistoricalFact] = []
    with driver.session() as session:
        for query_name, params in plan_queries_llm(entities):
            rows = session.run(QUERY_TEMPLATES[query_name], **params)
            for row in rows:
                facts.append(HistoricalFact(
                    entity=row["entity"],
                    claim=render_claim(row),
                    source=f"kg:{query_name}:{params}",
                    year=row.get("year"),
                    relevance=score_relevance_llm(row, state["source_text"]),
                ))
    return {
        "historical_context": facts,
        "completeness": {**state["completeness"],
                         "historical_grounding":
                             grounding_score(facts, state["draft_analysis"])},
    }
```

The decisive design choice: the literary agents are never allowed to "remember" historical facts from their pretraining. All factual claims must enter the state through this node, with a `source` field. The critique node enforces it. An LLM's parametric memory of Indian history is a soup of the true, the half-true, and the confidently invented. The knowledge graph is small, but every edge in it was put there deliberately and can be cited.

### 4.2 The Syntactic Node and the motifs

The Syntactic Node does close reading and writes `MotifObservation` entries. For poem 39, two motifs matter, and conveniently they are the two the prompt of this very blog post promised: the storm and the stranger.

On the storm: the poem does not merely mention thunder, it _requests_ it. "Come with thy thunder" inverts the usual valence. Storm here is not a threat to shelter from. It is the desired remedy for a "parched" heart. The node tags this `valence: "liberatory"` and records the exact span. Just as important, it also pulls the comparison observation: in _Ghare Baire_ (1916), storm imagery around the character Sandip carries `valence: "destructive"`, fire and flood as political passion that consumes. Same author, same motif family, opposite charge. That tension is now sitting in the state as two typed rows. It is no longer a vibe. It is data.

On the stranger: _Gitanjali_'s divine figure repeatedly arrives as an unannounced guest or unknown traveler, the one the speaker waits for without knowing his face. The node records the function: the sacred enters Tagore's poems as the _unfamiliar_, which quietly argues against any religion of settled ritual and fixed identity. Note this observation. It is about to collide with a historical fact.

### 4.3 The collision, and the trace

Now the run itself. Below is the actual (lightly trimmed) structured log of the graph traversal. This is the artifact I want you to study, because the _sequence_ is the argument of this whole post.

```text
[trace 7f3a] node=supervisor  rev=0
  scores={textual:0.20, historical:0.00, motif:0.15, counter:0.00}
  confidence=0.115 (floor penalty active) -> route: syntactic

[trace 7f3a] node=syntactic   rev=0
  +3 motif_observations:
    (storm,  "come with thy thunder",        valence=liberatory)
    (storm,  cross-ref Ghare Baire/Sandip,   valence=destructive)
    (stranger,"I knew not... my king",       valence=ambiguous)
  scores.textual=0.70, scores.motif=0.65 -> back to supervisor

[trace 7f3a] node=supervisor  rev=1
  confidence=0.405 (historical=0.0 floor) -> route: historical

[trace 7f3a] node=historical  rev=1
  entities=[Tagore, Gitanjali, Swadeshi movement, Brahmo Samaj]
  kg queries: author_context_in_year(person=Tagore, year=1912),
              movement_facts(movement="Swadeshi movement")
  +5 historical_facts (chronologically merged):
    1828 Brahmo Samaj founded; reformist, anti-ritual
    1905 Swadeshi begins; trigger: Partition of Bengal
    1905 Tagore PARTICIPATED_IN Swadeshi (songs, rakhi marches)
    1907 Tagore WITHDREW_FROM Swadeshi (coercion, violence)
    1912 Gitanjali published, self-translated
  scores.historical=0.80 -> back to supervisor

[trace 7f3a] node=supervisor  rev=2
  confidence=0.665 < τ=0.75, weakest=counter_reading(0.0)
  -> route: critique

[trace 7f3a] node=critique    rev=2
  ATTACK: draft claims the poem is "purely devotional, apolitical."
  CONTRADICTION OPENED:
    claim_a: "Gitanjali 39 is private devotion, no political layer"
    claim_b: "Author exited mass politics in 1907 over coercion;
              text written in the immediate aftermath; storm motif
              valence flips between this work and Ghare Baire"
  DEMAND: synthesis must account for the 1907 withdrawal.
  scores.counter_reading=0.70 -> back to supervisor

[trace 7f3a] node=supervisor  rev=3
  confidence=0.805 >= τ BUT 1 contradiction OPEN
  -> Gate 2 blocks synthesis -> route: critique

[trace 7f3a] node=critique    rev=3
  RESOLUTION attempt using state evidence only:
    The poem's storm is liberatory and *requested from the divine*,
    not enacted by a crowd. Read against fact[1907] and the Sandip
    cross-reference, the poem RELOCATES the storm: revolutionary
    energy is redirected from the street to the interior self.
    Not apolitical. Post-political by deliberate construction.
  CONTRADICTION -> resolved (resolution recorded in state)
  -> back to supervisor

[trace 7f3a] node=supervisor  rev=4
  confidence=0.84, open_contradictions=0 -> route: SYNTHESIS

[trace 7f3a] node=synthesis
  inputs: 3 motifs, 5 sourced facts, 1 resolved contradiction,
          2 critiques. emitting final analysis. END.
```

```mermaid
sequenceDiagram
    participant S as Supervisor
    participant N1 as Syntactic
    participant N2 as Historical
    participant C as Critique
    participant Final as Synthesis

    S->>N1: Route (Confidence: 0.11)
    N1-->>S: Motif + Scores (Confidence: 0.40)
    S->>N2: Route (Needs History)
    N2-->>S: Facts Added (Confidence: 0.66)
    S->>C: Route (Check Conflicts)
    C-->>S: CONTRADICTION OPENED
    S->>C: Route (Resolve Conflict)
    C-->>S: Conflict Resolved (Confidence: 0.84)
    S->>Final: Gated Exit Passed
```

And the synthesis output, condensed:

> Poem 39 is conventionally read as a plea for spiritual renewal, and it is one. But its central image is doing dated, located work. Tagore wrote it in the years just after his 1907 withdrawal from the Swadeshi movement, which he left when boycott politics turned coercive against poor traders. The storm he had spent 1905 singing into the streets reappears here, but its address has changed: "come with thy thunder" asks the divine, not the crowd, to break what is hard and parched. Where _Ghare Baire_ will later show the storm as Sandip's destructive political fire, _Gitanjali_ 39 relocates the same energy inward and upward. The poem is not apolitical. It is the record of a poet redirecting revolutionary imagery after losing faith in revolution's methods, while his Brahmo inheritance, a god of no idol and no fixed ritual, arrives in the poems as a stranger rather than an institution.

Run poem 39 through a strong zero-shot model and you will get fluent praise of its devotional beauty. You will not get the 1907 hinge, because nothing in the poem's text mentions it, and you will not get the cross-work valence flip, because no single context window contained both observations _as structured, comparable objects_. The insight lives in the join between a typed literary observation and a dated historical edge. The graph found it because the graph was built to perform exactly that join, and to refuse to finish until it had.

## Part 5: The Evaluation Metric

Now the uncomfortable question. The output reads well. How do we know it is _good_, in a way we can measure and compare across system versions?

The default answer, LLM-as-a-judge, is weak precisely here. Judge models score fluency and confidence reliably, and literary quality unreliably. They share training-data biases with the generator, so they reward the same canonical readings the generator already produces, which penalizes exactly the novel, evidence-driven insights we built this system to find. Self-preference bias is well documented. And for subjective cultural material there is often no single gold answer, so "similarity to reference" metrics measure conformity, not insight.

So we shift the evaluation target. Do not grade the essay. Grade the _trace_. The final text is subjective. The process that produced it is fully observable, because we made the process a graph on purpose. I evaluate four trace-level properties, each computable from the logged state history:

```python
def evaluate_trace(trace: list[StateSnapshot]) -> dict[str, float]:
    final = trace[-1].state
    return {
        # 1. Grounding rate: share of factual claims in the final
        #    analysis that map to a sourced HistoricalFact in state.
        #    Claims from parametric memory score zero. Target: >0.9
        "grounding_rate": grounded_claims(final) / total_claims(final),

        # 2. Contradiction handling: did the run surface at least
        #    one contradiction, and were all surfaced ones resolved
        #    BEFORE synthesis? A run with zero contradictions on a
        #    rich text is suspicious, not impressive.
        "contradiction_score": contradiction_metric(final["contradictions"]),

        # 3. Evidence utilization: fraction of motif observations
        #    and facts actually referenced by the synthesis. Detects
        #    decorative retrieval that never shaped the argument.
        "evidence_utilization": used_evidence(final) / len(all_evidence(final)),

        # 4. Gate integrity: did the run exit through the quality
        #    gate (confidence >= τ, conflicts resolved) or through
        #    the MAX_REVISIONS bound? Bound-exits are flagged runs.
        "gate_integrity": 1.0 if final["exit_reason"] == "quality_gate" else 0.0,
    }
```

The contradiction metric deserves emphasis because it is the contrarian one. In most eval setups, a contradiction is a failure. Here, a trace that never opened a contradiction on a text like _Gora_ or _Gitanjali_ is marked as shallow. Serious texts contain real tensions. A system that finds none was not looking. We are scoring the _epistemic behavior_: did the system seek conflict, record it as a typed object, and resolve it from evidence in the state rather than smoothing it over with prose.

For the subjective remainder, the human layer becomes tractable. Instead of asking a scholar "is this 900-word essay good," you show the trace and ask three small, answerable questions. Are the KG facts correct? Is each motif observation a fair reading of its quoted span? Is the recorded contradiction resolution sound? Expert time goes from grading essays to auditing structured claims, which is faster, more consistent between raters, and produces reusable corrections that flow straight back into the knowledge graph.

And reproducibility, the thing the humanities are routinely accused of lacking, comes nearly free. Same text, same KG snapshot, same model version, temperature zero: the routing function is deterministic Python, so the trace is replayable and diffable. When a model upgrade changes the analysis of poem 39, you can point to the exact node, the exact state diff, where the reading diverged. Two scholars can disagree about an interpretation while agreeing entirely on the evidence trail. That is not a small thing. In most of computational humanities today, even that baseline is missing.

## Part 6: Implementation Notes and Pitfalls

Everything above is the clean version. Here is the field report, the things that cost me real debugging hours, so they cost you none.

**Pitfall 1: The supervisor becomes a bottleneck and a liar.** Early versions had the supervisor LLM both score completeness _and_ choose the route in one generation. The scores drifted to flatter the route it had already decided on. Models rationalize, same as people. The fix is the separation you saw in the code: the LLM emits only the per-dimension scores as structured output, and a pure Python function computes the route from those scores. Never let the same generation both grade the work and decide what happens next. Split the judge from the gavel.

**Pitfall 2: Critique nodes go soft over turns.** Run the critique node on its own conversation history and by revision three it starts praising the draft it attacked in revision one. Politeness is baked deep into instruction-tuned models. The fix has two parts. First, the critique node is stateless with respect to its own past tone: it receives the draft, the evidence tables, and the open contradictions, never its previous friendly phrasing. Second, its output schema has no field for praise. It can emit attacks, contradictions, and demands, or it can emit an empty list. Structure beats prompting here, every single time.

**Pitfall 3: Entity extraction misses the non-English world.** Off-the-shelf extraction is tuned on Western news text. It happily finds "London" and misses "Shantiniketan." It splits "Bankim Chandra Chattopadhyay" into fragments and asks the KG about a person named "Chandra." Budget real time for a domain entity list and alias table (Chattopadhyay, Chatterjee, same person; Kolkata, Calcutta, same city, different eras). This is unglamorous dictionary work and it moved my grounding rate more than any model upgrade did.

**Pitfall 4: Cost control is an architecture concern, not an afterthought.** A four-revision run on this graph makes 15 to 25 LLM calls. The supervisor and entity extractor run most often and need the least intelligence, so they go on a small, cheap model. Only the syntactic, critique, and synthesis nodes get the large model. This tiered assignment cut my per-analysis cost by roughly two-thirds with no visible quality change in the traces. The same routing logic applies whether you call hosted APIs or your own [vLLM endpoint on EKS](/blog/vllm-on-eks); with self-hosting, the high call volume is exactly the steady traffic shape where a saturated GPU beats per-token pricing.

**Pitfall 5: Checkpoint everything.** Long runs fail at the worst node. LangGraph's checkpointer persists state after every node, so a crash resumes from the last completed hop instead of re-spending the whole run. Turn it on from day one. Also log every state diff to durable storage. The trace is not just your debugging tool, it is your evaluation dataset, and Part 5 does not work without it.

**Pitfall 6: Watch your DAG discipline when adding parallelism.** The star topology above is sequential and easy to reason about. The tempting upgrade is fanning out syntactic and historical nodes in parallel, since they are independent. It works, and the reducers were designed for it, but the moment you add parallel branches _plus_ cycles, you must think carefully about which writes can interleave. Keep cycles confined to the supervisor-critique loop and let the parallel fan-out remain acyclic. A graph that is a DAG everywhere except one well-guarded loop is a graph you can still hold in your head at 2 a.m.

## Conclusion: Computational Rigor in the Humanities

Step back from the code and look at what the architecture is really claiming.

It claims that deep cultural analysis is not one act but a structured process: observe the language, fetch the verified history, force the conflict, resolve it from evidence, and only then conclude. Standard RAG compresses that process into a single similarity lookup, and the mathematics of chunked embeddings guarantees the loss. Giant context windows put more text in front of the same single pass, and softmax dilution plus positional extrapolation quietly tax every added token. The fix was never a bigger model. It was a better process, written down as a graph, with typed state, gated edges, and a trace you can audit line by line.

Tagore is a fitting first subject for this kind of machine. Bengalis on both sides of the border grow up inside his songs; two national anthems, one poet. He spent his life refusing easy binaries: tradition against reform, nation against world, devotion against doubt. An analysis system worthy of him must do the same, holding contradictory evidence in state until it is honestly resolved instead of collapsing it into the most statistically likely paragraph. The graph above is a small, working argument that we can build software with that kind of patience.

There is plenty left to do. The knowledge graph should grow from hundreds of edges to lakhs, semi-automatically, with human verification queues. The Syntactic Node should work on the Bangla originals with proper morphological tooling, not just the English self-translations. Parallel fan-out of worker nodes, checkpointed long runs, and serving economics all deserve their own posts, and the serving part already has one if you want to [run these token-hungry graphs on your own GPUs](/blog/vllm-on-eks).

If you build a version of this for your own literature, Nazrul, Ghalib, Faulkner, anyone whose work carries layered history, I genuinely want to see your trace logs. Which node surprised you? Where did your critique agent open a contradiction you had not seen yourself? That moment, when the machine surfaces a tension the builder missed, is the whole reason to do this work. Write to me and show me the diff.


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Implementing Grace: A PyTorch Case Study in Dual-Stream Dysfluency Models](https://www.ranti.dev/blog/implementing-grace.md)
- [Forgetting Is Not Deletion: The Verification Gap in Machine Unlearning](https://www.ranti.dev/blog/forgetting-is-not-deletion.md)
- [The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech](https://www.ranti.dev/blog/topography-of-hesitation.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-06-15T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/beyond-rag-tagore",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{beyond-rag-tagore_2026,
  author = {Rantideb Howlader},
  title = {Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/beyond-rag-tagore},
  note = {Accessed: 2026-06-24}
}
```

### IEEE
Rantideb Howlader, "Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/beyond-rag-tagore. [Accessed: 2026-06-24].

### APA
Rantideb Howlader. (2026). Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/beyond-rag-tagore

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->