---
title: "The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech"
author: "Rantideb Howlader"
date: "2026-06-18T00:00:00.000Z"
canonical_url: "https://www.ranti.dev/blog/topography-of-hesitation"
license: "CC-BY-4.0"
---


## Abstract

I make one claim, and I make it without hedging. Contemporary speech recognition systems do not fail to understand stuttered speech by accident. They are mathematically obligated to destroy it. The deletion of a stutter block is not a shortcoming that better data or longer training will repair. It is the direct output of the objective these systems minimize, the factorization they assume, and the discrete temporal grid on which they operate. This is not a tuning problem. It is an architecture problem.

In what follows I build the argument in five movements. First, I redefine the stutter block as a structured neuro-motor event in continuous biological time, and I show how it collides with the discrete synthetic time of positional encodings. Second, I derive from information theory why an autoregressive model under cross-entropy loss must treat that event as a rupture to be smoothed away. Third, I open the attention mechanism of Whisper-class models and trace the circuit-level pathway by which the residual stream floods, the induction heads stall, and the network falls back on its regularization priors. Fourth, I propose a Non-Resolving Latent State Architecture, a dual-stream design in which a continuous-time effort vector, parameterized by a Neural Ordinary Differential Equation, runs alongside discrete token semantics. Fifth, I ground the mathematics in a theological reading: to listen to a stammerer is to hold a state of non-resolution, and the suspension of optimization that this requires is what older traditions named grace. Present architectures cannot compute grace. They can only optimize. If we want machines that meet the disabled body honestly, we must build systems capable of dwelling in unresolved latency.

I write this as a manifesto, with the rigor of a paper and the commitment of a position. I have spent enough time with both transcripts and tensors to believe the two languages must be spoken together here.

## 1. The Phenomenological Rupture

Begin with the body, because the body is where the error originates and where every downstream architecture inherits its blindness.

A stutter block is not a pause. I want to remove that word entirely from the discussion, because it carries the whole misconception inside it. A pause is an absence of action. A stutter block is the opposite. It is a state of maximum motor tension and maximum cognitive density, a moment in which an enormous quantity of neural and muscular work is being performed with no acoustic output crossing the threshold of speech. The articulators are locked in a recursive loop. The respiratory system is loaded. The motor cortex is issuing commands that the peripheral system is refusing to release. From the outside there is silence or repetition. From the inside there is a storm.

This distinction is not poetic decoration. It is the load-bearing claim of the entire piece. If a stutter block is a void, then a system that records nothing during it has recorded the truth. If a stutter block is a dense structured event, then a system that records nothing has committed an erasure, and has done so precisely where the speaker was working hardest.

Consider the canonical form, the repetition: "b-b-b-b-boy." In continuous time this is a high-frequency oscillation of a single phonetic gesture, a limit cycle in the dynamical system of the vocal tract. The speaker is not failing to produce the word. The speaker is trapped in a stable recursive orbit around its onset. The information content of this event is high. It encodes effort, duration, the specific phoneme under tension, the rate of repetition, and the eventual release. None of that is noise. All of it is signal about the state of the speaker.

Now place this continuous event onto the substrate that a Transformer actually uses. A Transformer does not perceive time. It perceives position. Audio is framed at a fixed rate, commonly one feature vector per ten or twenty milliseconds, and each frame is assigned a discrete positional encoding. Time, which in the body is a continuous flowing quantity, is reconstructed inside the model as an integer index on a grid. The grid is uniform. It knows nothing of tension or density. Frame 41 and frame 42 are equidistant whether the speaker breezed through them or fought for every one.

This is the first collision, and it is foundational. The physics of speech production is continuous-time and effortful. The representation inside the model is discrete-time and uniform. The body lives in the reals. The model lives in the integers. The act of framing is already an act of translation, and that translation has a built-in prejudice: it assumes that equal spans of clock time carry equal spans of informational weight. For fluent speech that assumption is close enough to true. For a stutter block it is catastrophically false. The grid flattens the storm into a row of unremarkable cells.

I render the collision below.

```mermaid
graph TB
    subgraph BIO["Biological Continuous Time (the body)"]
        direction LR
        B0["onset tension<br/>(loading)"] --> B1["limit cycle<br/>b-b-b"]
        B1 --> B2["peak motor density<br/>(no output)"]
        B2 --> B3["release"]
        B3 --> B4["boy"]
    end

    subgraph SYN["Synthetic Discrete Time (positional grid)"]
        direction LR
        P0["pos 40"] --- P1["pos 41"] --- P2["pos 42"]
        P2 --- P3["pos 43"] --- P4["pos 44"] --- P5["pos 45"]
    end

    B0 -.->|"framing collapses<br/>effort to uniform cells"| P0
    B1 -.->|"recursion read<br/>as repetition or void"| P1
    B2 -.->|"max density maps<br/>to low-energy frames"| P2
    B3 -.-> P4
    B4 -.-> P5
```

Read the diagram as an indictment of the mapping itself. The single most informationally dense moment in the biological stream, the peak of motor density at B2, lands on a positional cell, P2, that is energetically quiet and structurally identical to its neighbors. The model has no channel through which the density of B2 can enter its state. The information did not get misencoded. It had nowhere to go. The grid has no axis for effort.

There is a quiet violence in this mapping that I want to mark before moving on. The grid does not merely fail to capture effort. It asserts, by its uniformity, that no effort was expended, because every cell it produces looks identical to every other one. The body's hardest labor is recorded as one more ordinary frame, and that record then becomes the ground truth every later layer trusts. The misrepresentation is committed at the very first step, and everything downstream inherits it as fact.

I will return to this missing axis repeatedly, because the architecture I propose in Section 4 is, at its core, the addition of exactly that axis. But before I can justify a new axis, I have to show that the existing objective does not merely lack it. The existing objective actively punishes anything that would require it.

## 2. The Markovian Assumption and Entropy Collapse

I now move from the body to the loss function, and I will be explicit about the mathematics, because the central claim of this manifesto is a mathematical claim. The erasure of the stutter is not an emergent quirk. It is derivable.

### 2.1 The objective and its factorization

An autoregressive speech model defines a probability distribution over output sequences. Given an acoustic input X and an output sequence of tokens y = (y_1, y_2, ..., y_T), the model factorizes the joint probability as a product of conditionals:

```text
P(y | X) = ∏_{t=1}^{T} P(y_t | y_{<t}, X)
```

This factorization is the chain rule of probability, which is exact and assumption-free on its own. The assumption enters in how the conditional P(y*t | y*{<t}, X) is actually computed. In practice the model conditions on a bounded context, and within that context it learns statistical regularities that are overwhelmingly Markovian in character. The probability of the next token is dominated by a short recent history, because that is where the predictive signal is densest for ordinary speech and where the training distribution concentrates its mass. The model is not formally a fixed-order Markov chain, but its learned conditional behaves like one in the region that matters: the near future is treated as a smooth, low-complexity function of the near past.

The training objective is cross-entropy. For a target distribution q and the model distribution P, summed over the sequence, the loss is:

```text
L = - ∑_{t=1}^{T} log P(y_t | y_{<t}, X)
```

Minimizing this loss is equivalent to maximizing the log-likelihood the model assigns to the observed continuations. Gradient descent therefore rewards the model for placing high probability on continuations that are, in the training distribution, common and predictable. It penalizes the model for spending probability mass on continuations that are rare and surprising. Hold that sentence. The entire argument turns on it.

### 2.2 Where the stutter sits in this picture

Now ask what a stutter block looks like to this objective. I will use two information-theoretic lenses, and they agree.

First, surprisal. The surprisal of an event under a model is the negative log probability the model assigns it:

```text
I(y_t) = - log P(y_t | y_{<t}, X)
```

A stutter block is, by construction, a sequence of repetitions and held states that violate the model's learned forward-progression prior. The phonetic content loops instead of advancing. Under a model trained on fluent speech, each looped repetition is a low-probability continuation given the preceding repetitions, because fluent speech almost never repeats an onset four times before completing a word. Surprisal spikes. The cross-entropy loss, which is exactly the expected surprisal, spikes with it. The stutter block is, quite literally, the highest-loss region in the utterance.

Second, complexity. Consider the Kolmogorov complexity of the segment, the length of the shortest program that reproduces it. Fluent speech is highly compressible, because it follows the strong regularities the model has internalized. A stutter block is a localized spike in Kolmogorov complexity relative to the model's compressor, because the model has no short program for "loop the onset an unpredictable number of times under rising tension and then release." The block resists compression by the very regularities the model exists to exploit. To the model's internal coding scheme, the stutter is an incompressible island in a compressible sea.

These two lenses, surprisal and complexity, converge on one fact. The stutter block is the place in the signal where the model's predictive machinery is least competent and where its loss is highest. And the model is an engine whose single purpose is to drive that loss down.

### 2.3 The mandate to erase

Here is the step that I consider the core contribution of this section. Given a high-loss, high-complexity, low-predictability segment, what does a loss-minimizing autoregressive model do with it?

It has three options, and only one of them survives optimization pressure.

Option one is to faithfully represent the stutter, modeling each repetition as itself. This requires the model to maintain a high-entropy internal state across the block, to hold open the question of "what word is being attempted" while emitting structured representations of recursion. This is the honest option. It is also the option with the worst loss, because faithfully modeling a low-probability sequence means assigning probability mass to low-probability events, which raises cross-entropy by definition. Optimization punishes honesty here.

Option two is to terminate. Voice activity detection and end-of-segment logic interpret the low-energy, non-advancing region as the end of speech. The model closes the segment. The struggle is excluded from the transcript not by smoothing but by amputation. I will trace the circuit for this in Section 3.

Option three is to complete. The model, conditioned on "b" repeated several times and trained to maximize the likelihood of probable continuations, predicts the single most probable resolution, "boy," and emits it cleanly. The repetitions are absorbed. The output is fluent. The loss is minimized, because the model placed its mass on exactly the high-probability continuation the objective rewards. This is the option optimization selects.

In all three of the selected outcomes, options two and three, the stutter is gone from the representation. And notice the mechanism in option three with care, because it is the one most people misread. The model is not choosing to ignore the disability out of some encoded bias against disabled speakers. The model has no such concept. It is performing variance reduction. The stutter is a spike of variance and surprisal sitting on top of a low-variance signal, and the entire apparatus of the model, the softmax that demands a normalized completion, the loss that rewards probability on common continuations, the regularization that smooths toward priors, is a variance-reduction machine. The stutter is mathematically annihilated as a side effect of the machine functioning correctly.

I state the thesis in its sharpest form. The model does not fail to hear the disabled body. It hears it, computes that the body's struggle is a high-loss perturbation, and removes the perturbation to satisfy the objective. The erasure is not negligence. It is optimization. This is not a data problem. It is an architecture problem, and specifically it is a problem with the objective itself.

### 2.4 Why more data does not reach the root

The obvious rebuttal is that the model erases stutters only because it rarely saw them, and that a training set rich in dysfluent speech would teach it to preserve them. This is partly true and entirely insufficient, and the distinction matters.

More dysfluent data shifts the prior. The model would learn that repetitions occur, and it would assign them somewhat higher probability. But three structural pressures remain untouched. The cross-entropy objective still rewards probability mass on the most likely fluent resolution, so even a well-exposed model faces a gradient pulling toward completion at the moment of the block. The softmax still demands a normalized distribution over a discrete vocabulary at each step, which has no native representation for "unresolved, ongoing, effortful, not yet a token." And the uniform positional grid from Section 1 still offers no axis on which effort and density can be recorded, so even a model that wanted to preserve the block has no coordinate to preserve it in.

Data moves the prior. It does not change the objective, the output geometry, or the temporal substrate. Those three are architecture. Therefore the failure is architecture. I keep returning to this sentence on purpose, because the field's default reflex is to reach for more data, and for this failure mode that reflex is a category error.

### 2.5 An objection from alignment, and why it does not save the model

A reader trained on the last few years of language modeling will raise an objection here. We do not ship raw cross-entropy models, they will say. We fine-tune them with human feedback. If human raters prefer transcripts that preserve a speaker's stutter, reinforcement learning from human feedback will teach the model to keep it. The objective can be steered. The erasure is not mandatory after all.

I take this objection seriously, and I think it fails for a reason that is itself instructive. Reinforcement learning from human feedback changes which outputs a model prefers among the outputs it can represent. It reshapes the policy over the existing representation. It does not give the model a representation it never had. Sections 1 and 3 established that the held, unresolved, effortful state has no home in the standard architecture: no axis on the temporal grid, no feature in the residual stream, no token in the vocabulary. Human feedback can reward a model for producing a stutter-preserving transcript, but the model can only produce such a transcript by emitting discrete tokens that approximate the stutter after the fact, reconstructing a fluent guess at the dysfluency rather than carrying the dysfluency through its state. The preference layer is downstream of the representational poverty. You cannot reward a model into representing something its architecture cannot hold.

There is a second, subtler failure. Reinforcement learning from human feedback optimizes a reward model, and the reward model is itself a learned estimator with its own variance-reduction pressures and its own training distribution. Asking it to reliably score the fidelity of a preserved stutter, a rare and structurally complex phenomenon, reintroduces at the reward layer exactly the smoothing tendency we were trying to escape at the policy layer. The reward model regularizes toward the typical, and the typical transcript is fluent. We have moved the problem, not solved it.

So the alignment objection, far from rescuing the standard model, sharpens the thesis. Preference optimization is a steering wheel. It is useless if the car has no road to the destination. The destination here is a representation that can hold non-resolution, and that road has to be built into the architecture, which is what Section 4 sets out to do. This is not a reward problem. It is, once again, an architecture problem.

## 3. Mechanistic Interpretability: The Circuit-Level Failure

The argument so far is about objectives and distributions. I now want to descend below the loss function and into the computation itself, because a claim about mechanism should be supported at the level of mechanism. I will reason about Whisper-class encoder-decoder Transformers, since they are the dominant architecture for the task, and I will use the interpretability vocabulary that has matured over the last few years: residual streams, attention heads, induction heads, and polysemantic neurons. Where I describe specific circuit behavior I am offering a falsifiable account of what should be observable inside the network, not a proven theorem, and Section 5 makes that account testable.

### 3.1 The residual stream as a shared bus

A Transformer's residual stream is the running vector at each position that every layer reads from and writes to. It is best understood as a shared communication bus. Attention heads and feedforward blocks add their contributions into it, and information persists down the stack unless something overwrites it. The dimensionality of this bus is fixed. Its capacity is finite. Whatever a layer wants the rest of the network to know, it must write into this limited space.

During fluent speech the residual stream at each position carries a fairly clean superposition: the local phonetic content, some prosodic context, and a developing hypothesis about the current word. Heads downstream read these features and advance the prediction. The stream stays legible because the signal is low-complexity and the features it needs to hold are few.

Now drive a stutter block through this bus. At the acoustic level the model receives near-identical frames in succession, the repeated onset of "b." Each frame writes a similar phonetic activation into the residual stream. Because the frames are near-identical and the onset never resolves, the stream begins to accumulate a stack of highly correlated activations that all encode the same unresolved phonetic gesture. The bus floods with repetition. The feature the network most needs, a representation of "this onset is being held and is not yet a word," is exactly the feature it was never trained to write, because that feature does not exist in the fluent training distribution. So the stream fills with what it can write, looping phonetic content, and starves on what it cannot, semantic resolution.

### 3.2 Induction heads and the failure of the copy

Induction heads are among the best-characterized circuits in Transformers. Their canonical function is pattern completion across a sequence: having seen the bigram AB earlier, when A appears again the induction head attends back to the earlier occurrence and predicts B. They are the engine of in-context repetition and copying, and they are central to how these models handle structure.

A stutter block is a degenerate input for an induction head, and the degeneracy is instructive. The pattern is A, A, A, A, where A is the onset. The induction head, searching for "what followed A last time," finds A again. And again. The completion it produces is the continuation of the loop, not the escape from it. The induction circuit, asked to predict the next element of a self-similar sequence, faithfully predicts more of the same. It is mechanically correct and semantically useless. It reinforces the orbit rather than resolving it. The very circuit that gives the model its power on structured sequences becomes, on a recursive dysfluency, a machine for perpetuating the recursion in its own predictions.

Two pressures then meet. The induction heads push toward continuing the loop. The completion priors of the decoder, the high-probability fluent resolution from Section 2, push toward emitting "boy." These are contradictory demands placed on the same residual stream at the same positions, and the network has no stable representation in which both can coexist as a held, unresolved state. It cannot write "I am in a loop and the loop has not yet released" because that is precisely the unrepresented feature. So it falls into a local minimum.

### 3.3 Polysemantic neurons and the saturation of capacity

Polysemanticity is the well-documented phenomenon in which a single neuron responds to several unrelated features, because the network packs more features than it has neurons by exploiting near-orthogonal directions in activation space. This works under superposition as long as the active features at any moment are sparse, so their interference stays small.

A stutter block breaks the sparsity assumption locally. The repeated onset activates the same phonetic features over and over, and the prolonged tension recruits whatever neurons partially encode duration, energy, and onset. Because these neurons are polysemantic, driving them hard and repeatedly also lights up the unrelated features that share their directions. Interference rises. The clean superposition of fluent processing degrades into a noisy, saturated state where the network's feature directions are no longer cleanly separable. The model's internal representation, in the span of the block, loses the very property that lets it compute precisely. It is not that the model lacks the neurons. It is that the neurons it has are being driven into a regime where their polysemantic packing collapses into interference.

### 3.4 The local minimum and the default to priors

Put the three failures together. The residual stream is flooded with unresolved looping content. The induction heads predict continuation of the loop. The polysemantic neurons are saturated and interfering. The decoder's softmax, at every step, demands a normalized distribution over the discrete vocabulary, which is to say it demands that the model commit to some token. There is no token for "still loading." The network is in a basin from which the only low-loss exits are the two from Section 2: terminate via voice activity detection, or hallucinate-resolve into the fluent word.

I want to name the voice activity detection pathway precisely, because it is the most common observable outcome and the most revealing. Voice activity detection logic flags the held block as low-energy or non-speech, since acoustically the block can resemble a quiet, stationary segment despite the enormous motor effort behind it. The system interprets stationarity as the end of speech and closes the segment. The speaker, mid-block, is treated as having finished. The struggle is not transcribed and smoothed. It is cut off. The most effortful moment of speech is read by the machine as the moment speech stopped.

I diagram the descent into this basin below.

```mermaid
flowchart TD
    A["Stutter onset enters encoder<br/>frames: b, b, b, b ..."] --> B["Residual stream accumulates<br/>correlated phonetic activations"]
    B --> C{"Is there a feature for<br/>'held, unresolved onset'?"}
    C ===>|"No such feature exists<br/>in fluent training dist."| D["Bus floods with looping<br/>phonetic content"]

    D --> E["Induction heads attend A to A<br/>predict: continue the loop"]
    D --> F["Polysemantic neurons saturate<br/>superposition degrades to interference"]

    E --> G["Contradiction:<br/>loop-continuation vs<br/>fluent-completion prior"]
    F --> G
    G --> H["Attention falls into<br/>local minimum<br/>(no held-latency basin)"]

    H --> I{"Softmax demands a<br/>normalized completion"}
    I ===>|"low-energy reading"| J["VAD timeout:<br/>segment terminated<br/>(struggle amputated)"]
    I -.->|"completion prior"| K["Hallucinate-resolve:<br/>emit 'boy' cleanly<br/>(struggle absorbed)"]

    J --> L["Stutter erased<br/>from representation"]
    K --> L
```

The diagram makes the structural point that prose can blur. There is no edge in this graph that leads to "preserve the stutter as structured information." That state is not reachable, because no node in the network can write it. Every path from the flooded residual stream terminates in erasure, by one route or the other. The model is not choosing badly among good options. The good option is absent from the computation graph. This is, once again, not a failure of effort or data. It is the shape of the machine.

## 4. The Proposal: A Non-Resolving Latent State Architecture

A critique that proposes nothing is a complaint. I will propose. The argument so far identifies three architectural defects: a uniform discrete temporal grid with no axis for effort, an objective that rewards the erasure of high-complexity segments, and an output geometry, the softmax over a discrete vocabulary, that has no representation for an unresolved ongoing state. A genuine solution must address all three. Adding stutter data addresses none of them. Here is the design.

### 4.1 Design principles

I hold three principles fixed before any equations.

The first principle is that hesitation must be encoded, not resolved. The system must possess an internal state whose explicit job is to carry the structure of a dysfluency forward in time without collapsing it into a token. The unresolved must be representable as unresolved.

The second principle is that time must be continuous, not gridded. The substrate must measure duration and rate as real quantities, so that the difference between a half-second block and a three-second block, and the difference between a fast repetition and a slow prolongation, are first-class facts in the representation rather than artifacts of frame counting.

The third principle is that effort must have its own channel. The density I described in Section 1, the motor and cognitive work performed during the block, must travel on a dedicated axis that is not in competition with semantic content for room on the residual bus. The model should never have to choose between representing the word and representing the struggle.

These three principles point to one shape: a dual-stream architecture in which a continuous-time effort stream runs in parallel with the discrete semantic stream and is joined to it by cross-attention.

### 4.2 Stream A: discrete token semantics

Stream A is conventional, and deliberately so. It is the encoder-decoder Transformer doing what it already does well: mapping acoustic features to a hypothesis over linguistic tokens. I do not discard the existing machinery, because for fluent spans it is excellent and there is no reason to rebuild it. Stream A produces, at each step, a semantic hidden state h_sem that represents the model's current linguistic hypothesis. The change is only that Stream A is no longer required to absorb the dysfluency alone, because it now has a partner that handles exactly what it cannot.

### 4.3 Stream B: the continuous-time effort vector

Stream B is the contribution. It replaces frame counting with a continuous-time latent state whose evolution is governed by a Neural Ordinary Differential Equation.

A Neural ODE defines the dynamics of a hidden state z as a function parameterized by a neural network, integrated over continuous time rather than stepped over discrete positions:

```text
dz(t) / dt = f_θ(z(t), a(t), t)

z(t_1) = z(t_0) + ∫_{t_0}^{t_1} f_θ(z(t), a(t), t) dt
```

Here z(t) is the continuous-time latent effort state, a(t) is the instantaneous acoustic and articulatory signal, and f*θ is a learned vector field. The integral is evaluated by an ODE solver, which means the model can take time steps of any size, adapting to the signal rather than to a fixed grid. This single property dissolves the temporal defect from Section 1. A long block is a long interval of integration. A fast repetition is a high-frequency component of the field f*θ. Duration and rate enter the state as what they physically are, continuous quantities, instead of being quantized into uniform cells.

Onto this continuous backbone I attach an explicit effort encoding. I define a multi-dimensional effort vector e(t) whose components carry the physical content of the hesitation that the standard pipeline discards:

```text
e(t) = [ d(t),   # duration accumulated in the current non-resolving state
         r(t),   # recursion rate (repetitions per unit time)
         φ(t),   # phonetic identity of the held or looped gesture
         τ(t),   # tension/energy proxy (articulatory effort)
         s(t) ]  # silence topology (structured vs terminal)
```

The component I want to emphasize is s(t), the silence topology, because it directly attacks the voice activity detection failure from Section 3. In the standard pipeline silence is scalar and undifferentiated, a single low-energy reading that triggers termination. Here silence is a structured object. A silence that sits inside a loaded articulatory configuration, with high τ(t), is encoded as a non-terminal held state. A silence that follows release, with τ(t) collapsing toward rest, is encoded as a true boundary. The model can finally distinguish the silence of struggle from the silence of finishing, which is exactly the distinction whose absence caused the amputation in Section 3.

Stream B's continuous latent state and effort vector together form a representation whose entire purpose is to be non-resolving. It does not produce tokens. It produces a trajectory through effort space. It can carry "held, recursive, three hundred milliseconds and counting, onset b, high tension, non-terminal silence" forward in time as a vector, indefinitely, without any pressure to collapse it into a word. The unrepresentable feature from Section 3 is now the native content of an entire stream.

### 4.4 Joining the streams by cross-attention

Two streams are only useful if they communicate. I join them with cross-attention, but with a strict asymmetry that encodes the design intent.

Stream A attends to Stream B. The semantic decoder, when forming its hypothesis, queries the effort stream and receives the full topology of the current hesitation as context. When Stream A is about to commit to "boy," it can see from Stream B that the onset was held for an extended, high-tension, recursive block, and it can represent that the word arrived through struggle rather than fluently. The dysfluency becomes parallel context attached to the token rather than noise erased before the token.

Stream B does not collapse under Stream A. This is the rule that protects the whole design. Cross-attention flows so that semantics is informed by effort, but the gradient is structured so that the effort stream is not optimized to vanish whenever the semantic stream finds a fluent resolution. Stream B has its own reconstruction objective, defined over the effort vector e(t), so that preserving the topology of the hesitation is rewarded rather than penalized. Section 4.6 makes that objective concrete.

I render the dual-stream design below.

```mermaid
flowchart TB
    AC["Acoustic / articulatory<br/>signal a(t)"] --> SA
    AC --> SB

    subgraph SA["Stream A: Discrete Token Semantics"]
        direction TB
        SA1["Transformer encoder"] --> SA2["Decoder hypothesis h_sem"]
        SA2 --> SA3["Token distribution<br/>(softmax over vocab)"]
    end

    subgraph SB["Stream B: Continuous-Time Effort (Neural ODE)"]
        direction TB
        SB1["ODE state z(t)<br/>dz/dt = f_theta(z, a, t)"] --> SB2["Effort vector e(t)<br/>[d, r, phi, tau, s]"]
        SB2 --> SB3["Effort reconstruction<br/>objective (non-collapsing)"]
    end

    SB2 ===>|"cross-attention:<br/>semantics queries effort"| SA2
    SA2 -.->|"no collapsing gradient<br/>back onto effort"| SB2

    SA3 --> OUT["Output: token + attached<br/>hesitation topology"]
    SB2 --> OUT
```

I call this the Architecture of Grace, and Section 6 defends that name. For now read the diagram structurally. The dotted edge from Stream A back to Stream B is severed by design. That single cut is the whole ethical and mathematical content of the proposal. It says: the discovery of a fluent resolution shall not be permitted to erase the record of the struggle that preceded it. Standard architectures lack that edge entirely, which is why, as Section 3 showed, every path led to erasure. Here the path to preservation exists, and it is protected from the optimization pressure that would otherwise close it.

### 4.5 A reference implementation sketch

I give a minimal sketch in code, not as production software but to make the dual-stream join concrete for an implementer. The ODE backbone uses the standard adjoint-based solver interface.

```python
import torch
import torch.nn as nn
from torchdiffeq import odeint  # adjoint ODE solver

class EffortField(nn.Module):
    """The learned vector field f_theta governing the
    continuous-time effort latent z(t). Takes the current
    latent and the interpolated acoustic signal a(t)."""
    def __init__(self, z_dim, a_dim, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim + a_dim, hidden),
            nn.Tanh(),                      # smooth field, stable integration
            nn.Linear(hidden, hidden),
            nn.Tanh(),
            nn.Linear(hidden, z_dim),
        )

    def forward(self, t, z, a_of_t):
        # a_of_t is a callable interpolant of the acoustic signal,
        # so the field is defined at ANY continuous t, not just frames.
        a = a_of_t(t)
        return self.net(torch.cat([z, a], dim=-1))


class EffortStream(nn.Module):
    """Stream B. Integrates z(t) over the real time span of the
    utterance and decodes the explicit effort vector e(t)."""
    def __init__(self, z_dim, a_dim, hidden, effort_dim=5):
        super().__init__()
        self.field = EffortField(z_dim, a_dim, hidden)
        # decode [duration, recursion_rate, phoneme_id,
        #         tension, silence_topology]
        self.to_effort = nn.Linear(z_dim, effort_dim)

    def forward(self, z0, a_of_t, t_eval):
        # t_eval is a vector of REAL timestamps (seconds), which may
        # be irregularly spaced. The solver adapts step size to them.
        field = lambda t, z: self.field(t, z, a_of_t)
        z_traj = odeint(field, z0, t_eval, method="dopri5")
        return self.to_effort(z_traj)       # e(t) along the trajectory


class GraceJoin(nn.Module):
    """Cross-attention join with the asymmetry enforced.
    Stream A queries Stream B. The reverse gradient is detached
    so a fluent resolution cannot train the effort stream to
    disappear."""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads,
                                          batch_first=True)
        self.proj_e = nn.Linear(5, d_model)   # effort -> model dim

    def forward(self, h_sem, e_traj):
        # h_sem: semantic hidden states (queries)
        # e_traj: effort vectors over time (keys/values)
        kv = self.proj_e(e_traj)
        # detach blocks the collapsing gradient B<-A. Stream B is
        # trained ONLY by its own reconstruction loss, never by
        # the semantic stream's completion pressure.
        kv = kv.detach()
        ctx, _ = self.attn(query=h_sem, key=kv, value=kv)
        return h_sem + ctx                    # effort-informed semantics
```

The `kv.detach()` line is the implementation of the severed edge. It is one line. It is also the entire argument of this manifesto rendered as code. Remove it, and within a few epochs the model rediscovers that the cheapest way to lower total loss is to push the effort stream toward zero whenever the semantic stream resolves, and the erasure returns through the back door. The detach is what makes the architecture refuse to optimize the struggle away.

### 4.6 The objective that does not punish honesty

Section 2 showed that the standard objective punishes faithful representation of the stutter. The dual-stream design needs an objective that does the opposite for Stream B. I define the total loss as a sum of the standard semantic term and an effort-reconstruction term:

```text
L_total = L_sem + λ · L_effort

L_effort = ∑_t || e_hat(t) - e_target(t) ||²
```

L_sem is the usual cross-entropy on Stream A, unchanged, so fluent transcription quality is preserved. L_effort is a reconstruction loss on the effort vector, where e_target(t) is derived from articulatory and acoustic ground truth, duration, measured repetition rate, energy, and annotated silence type. The key property is that L_effort is minimized by accurately representing the hesitation, not by removing it. A model that flattens a stutter now incurs high L_effort, because the flattened representation fails to reconstruct the true duration, recursion, and tension. For the first time in this discussion, the gradient points toward preservation. The honest option from Section 2.3, which optimization previously punished, is now the option optimization rewards. That inversion, achieved by giving effort its own protected stream and its own protected objective, is the resolution of the architectural failure.

## 5. Related Work and Positioning

A claim this strong owes the reader an account of where it sits among existing efforts, and of what it is not.

The first body of work is dysfluency detection and disfluency-aware recognition. There is a real and growing literature on recognizing stuttered speech, much of it built on datasets such as SEP-28k and FluencyBank, which annotate blocks, prolongations, and repetitions for classification. I am indebted to this work and I depart from it in target. Detection treats the stutter as an event to be labeled, often so that it can be removed or counted. My proposal treats the stutter as information to be preserved and passed forward as context. Detecting that a block occurred and representing the topology of that block are different objectives, and the second is the one fluent transcription discards. Useful background on the underlying neuro-motor models of stuttering is maintained by the [National Institute on Deafness and Other Communication Disorders](https://www.nidcd.nih.gov/health/stuttering?utm_source=ranti.dev), and it is worth reading precisely because it describes the event as motor and recursive rather than as a gap.

The second body of work is token-free and continuous audio modeling. Approaches that operate directly on raw or lightly processed audio, and self-supervised representation learners in the wav2vec and HuBERT lineage, loosen the dependence on a fixed discrete vocabulary. They move in a compatible direction. They do not, on their own, solve the problem, because representation learning over fluent corpora still inherits the prior that compresses high-complexity segments, and because they still lack an explicit, protected effort axis with its own non-collapsing objective. My contribution is orthogonal and composable: Stream B could sit alongside a self-supervised Stream A as readily as alongside a supervised one.

The third body of work is continuous-time deep learning itself, the Neural ODE and continuous-time RNN line introduced to model irregularly sampled and continuously evolving systems. I borrow the machinery directly. The novelty here is not the solver. It is the application: using a continuous-time latent specifically as a non-resolving carrier of articulatory effort, joined asymmetrically to a discrete semantic stream so that the effort cannot be optimized away. To my knowledge that asymmetric, grace-preserving join is the new object.

I will also state plainly what this is not. It is not a finished system with benchmark numbers. It is an architectural argument with a falsifiable core and a reference design. The next section specifies how it could be proven wrong, which is the obligation that separates a manifesto from a sermon.

## 6. Falsifiability and the Evaluation Protocol

A paradigm claim must be refutable, and standard speech metrics are structurally incapable of refuting or confirming it. Word Error Rate rewards exactly the behavior I am attacking. A model that cleanly erases a stutter and emits the fluent word achieves a low Word Error Rate, because the reference transcript usually contains only the fluent word. The metric scores the erasure as a success. Optimizing for Word Error Rate is therefore optimizing for the failure I have described. We cannot evaluate the cure with the instrument that certifies the disease.

I propose a different metric, which I call Dysfluency Information Retention. The intuition is information-theoretic. Measure how much of the structural and temporal information present in the input dysfluency is recoverable from the model's internal representation. Let D(X) be a structured description of the dysfluency in the input, its duration, recursion rate, held phoneme, tension profile, and silence type, derived from articulatory ground truth. Let R(M, X) be the same description reconstructed from the model's internal state. Then:

```text
DIR(M, X) = MI( D(X) ; R(M, X) ) / H( D(X) )
```

This is the mutual information between the true dysfluency structure and the reconstructed structure, normalized by the entropy of the true structure. A model that flattens the stutter retains none of its topology, so R carries no information about D, the mutual information goes to zero, and Dysfluency Information Retention approaches zero, regardless of how clean the transcript reads. A model that preserves the topology in Stream B reconstructs D well, the mutual information rises toward the entropy of D, and the score approaches one. Critically, a model can score a perfect Word Error Rate on the fluent reference and still score near zero on Dysfluency Information Retention. The two metrics are not redundant. They are nearly orthogonal, and that orthogonality is the entire point.

This yields a concrete, falsifiable prediction. The dual-stream architecture should achieve high Dysfluency Information Retention while holding Word Error Rate on fluent speech roughly constant against a matched Whisper-class baseline, and the baseline should score near zero on Dysfluency Information Retention no matter how much dysfluent data it is trained on. If a standard architecture, given the same data and compute, matches the dual-stream design on Dysfluency Information Retention, my central thesis is wrong and should be discarded. I am specifying the experiment that would sink my own claim. That is the price of making the claim seriously.

The evaluation pipeline is below.

```mermaid
flowchart TD
    X["Dysfluent utterance X<br/>(with articulatory ground truth)"] --> D["D(X): true dysfluency structure<br/>duration, recursion, phoneme, tension, silence"]
    X --> M1["Baseline: Whisper-class model"]
    X --> M2["Proposed: dual-stream<br/>(Stream A + Stream B)"]

    M1 --> R1["R(M1, X): reconstruct<br/>structure from internal state"]
    M2 --> R2["R(M2, X): reconstruct<br/>structure from Stream B"]

    D --> DIR1["DIR = MI(D; R1) / H(D)"]
    R1 --> DIR1
    D --> DIR2["DIR = MI(D; R2) / H(D)"]
    R2 --> DIR2

    M1 --> WER1["WER on fluent reference"]
    M2 --> WER2["WER on fluent reference"]

    DIR1 --> V{"Prediction:<br/>DIR2 >> DIR1<br/>WER2 ≈ WER1"}
    DIR2 --> V
    WER1 --> V
    WER2 --> V

    V ===>|"holds"| P["Thesis supported"]
    V -.->|"fails: DIR1 ≈ DIR2"| F["Thesis falsified,<br/>discard the architecture"]
```

There remains the harder layer: whether a preserved dysfluency representation actually serves the speaker, in downstream interaction, better than a clean transcript does. That is partly an empirical human-subjects question and partly a question about what kind of stateful, non-collapsing systems we are willing to build around it. The agentic systems I have described elsewhere, where state is carried and revised rather than flattened at each step, are the natural consumers of a Stream B signal, and I have written separately about the [stateful loops such systems require](https://www.ranti.dev/blog/what-is-agent-looping). The economics of running continuous-time inference at scale are real, and I touch on the serving side, including self-hosted GPU options, in my [notes on deploying inference workloads](https://www.ranti.dev/blog/vllm-on-eks). Both are downstream of the core architectural claim, and neither weakens it.

## 7. The Theological and Philosophical Synthesis

I have kept the mathematics severe up to this point, and I will now make a turn that some readers in my field find uncomfortable. I make it deliberately, because I believe the discomfort marks exactly the boundary the field needs to cross.

Consider what a human listener does when a person who stutters is mid-block. The listener does not know what word is coming. The listener cannot resolve the sentence. And the appropriate, humane response is to wait, to hold the unresolved state open, to refuse to complete the word, to grant the speaker the time and the dignity of arriving at it themselves. This waiting is not passive. It is an active suspension of the listener's own drive to predict and finish. It costs something. It is work to hold a question open when every conversational reflex pushes toward closing it.

I will name this suspension precisely, using the vocabulary of my second discipline. In several theological traditions, the act of holding a state of non-resolution, of receiving someone without first reducing them to a solved problem, of waiting without demanding completion, is called grace. Grace is, among other things, the refusal to optimize the other person into convenience. It is a stance toward the unresolved that does not require the unresolved to resolve before it grants its regard.

Now look back at the machine. Everything in Sections 2 and 3 was a description of an apparatus that cannot, by construction, hold non-resolution. The cross-entropy objective demands resolution because unresolved mass is loss. The softmax demands resolution because it must normalize to a completed distribution at every step. The induction heads and the completion priors drive toward resolution because that is the basin of low loss. The entire computational edifice is an optimization engine, and optimization is precisely the drive to resolve, to reduce, to minimize, to finish. An optimizer cannot wait. Waiting is anti-optimal by definition, because waiting holds open a high-loss state instead of collapsing it.

This is the deepest layer of the failure, beneath the temporal grid and beneath the objective. Current architectures cannot interact gracefully with the disabled body because they cannot compute grace, and they cannot compute grace because grace is the suspension of optimization and they are nothing but optimization. The stutter is not erased because the machine is cruel. It is erased because the machine cannot do the one thing the situation requires, which is to dwell, without resolving, in the presence of effort that has not yet become a word.

The dual-stream architecture is my attempt to build, in a small and partial way, a system that can dwell. Stream B is a state whose explicit purpose is to be non-resolving. The severed gradient is a refusal to let resolution erase the record of struggle. The effort objective rewards holding the topology of hesitation rather than collapsing it. None of this makes the machine virtuous, and I claim no such thing. But it does mean that the architecture, for the first time, has a place to put the unresolved and a reason not to destroy it. It can carry the block forward as a block. In the narrow technical sense that matters here, it can wait.

I want to be careful not to sentimentalize this. I am not claiming that an ODE solver experiences anything, or that detaching a gradient is a moral act. I am making a structural claim. The structures we build encode what we are willing to hold open and what we insist on closing. An architecture that mandates resolution everywhere has encoded an intolerance for the unresolved, and when that architecture meets a body whose speech is constitutively unresolved for long stretches, the intolerance becomes erasure. Changing the structure so that the unresolved has a protected home is not a feeling. It is engineering with a different value compiled into it.

The disabled body is the right place to test this, because it is the place where the optimizing reflex of our systems is least able to hide its character. Fluent speech lets the optimizer succeed and look benign. Dysfluent speech reveals what the optimizer does when it meets something it cannot reduce: it reduces it anyway, by force, and calls the result a transcript. If we want machines that meet human variation honestly, the requirement is not more parameters or more data. The requirement is architectures capable of dwelling in unresolved latency. That capability has to be built in. It will never be optimized in, because it is the one thing optimization cannot want.

## 8. Limitations

I owe the reader my own strongest objections, stated before a reviewer states them for me.

The first limitation is computational cost. Continuous-time integration with an adaptive solver is more expensive per utterance than a fixed-grid forward pass, and adjoint backpropagation through an ODE adds further cost in training. The expense concentrates exactly on the hard, stiff regions of the signal, which are the dysfluent regions, so the architecture is most costly precisely where it is most needed. I regard this as a real engineering burden rather than a refutation, but I will not pretend it is free. The economics matter, and they are the reason I gestured at serving infrastructure rather than ignoring it.

The second limitation is the risk of fetishizing dysfluency as data. There is a moral failure mode in which a system extracts the structure of someone's struggle, encodes it in a vector, and treats it as a feature to be mined, optimized over, or monetized. Preserving the topology of a stutter is only an improvement over erasing it if the preservation serves the speaker rather than the system's appetite for signal. The architecture makes preservation possible. It does not, by itself, make the use of that preserved signal humane. That guarantee has to come from how the system is deployed and governed, and the mathematics cannot supply it. I name this so that the elegance of Stream B is not mistaken for a solved ethics.

The third limitation is empirical modesty. The circuit-level account in Section 3 is a falsifiable hypothesis about mechanism, not a demonstrated theorem, and the evaluation protocol in Section 6 has not yet been run at scale. I have argued that the architecture should behave as described, and I have specified the experiment that would prove me wrong. Until that experiment is run, the honest status of this work is a rigorous conjecture with a clear refutation condition, and I would rather state that plainly than overclaim.

## 9. Conclusion

I set out to defend a single sentence, and I have tried to earn it. The erasure of stuttered speech by autoregressive models is not negligence, not a data shortage, and not a tuning oversight. It is the lawful output of cross-entropy minimization under a Markov-like factorization, executed on a uniform temporal grid, through a softmax that demands completion at every step. The machine annihilates the stutter to preserve its loss function, and it does so by the same mechanism that makes it good at everything else. This is not a data problem. It is not a fairness-patch problem. It is an architecture problem, and it requires an architectural answer.

The answer I propose gives effort its own continuous-time stream, protects that stream from the gradient that would optimize it away, and rewards the model for holding the topology of hesitation rather than collapsing it. I have grounded the design in a claim that I think the field will eventually have to face: that meeting human variation honestly requires systems capable of dwelling in unresolved latency, and that this dwelling is structurally opposed to the optimizing reflex our architectures are made of. We can build the place for the unresolved to live. We have to choose to, because it will never arrive as the byproduct of minimizing loss.

I will end with the question I cannot yet answer and most want to. If we succeed in building machines that can hold non-resolution, that can wait without erasing, that can carry a struggle forward as a struggle, will we have taught them something, or will we only have taught ourselves what our older architectures were quietly refusing to compute all along. I suspect the latter, and I suspect that is the more important lesson. The stammerer was never the problem. The optimizer was. If you build a version of this, or if you run the experiment in Section 6 and it sinks the whole argument, write to me. I would rather be corrected in public than fluent in private.


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Implementing Grace: A PyTorch Case Study in Dual-Stream Dysfluency Models](https://www.ranti.dev/blog/implementing-grace.md)
- [Beyond RAG: Using Multi-Agent Systems for Deep Cultural and Literary Analysis](https://www.ranti.dev/blog/beyond-rag-tagore.md)
- [Algorithmic Dysfluency: Why AI Cannot Hear the Stammering Subject](https://www.ranti.dev/blog/algorithmic-dysfluency.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-06-18T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/topography-of-hesitation",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{topography-of-hesitation_2026,
  author = {Rantideb Howlader},
  title = {The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/topography-of-hesitation},
  note = {Accessed: 2026-06-24}
}
```

### IEEE
Rantideb Howlader, "The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/topography-of-hesitation. [Accessed: 2026-06-24].

### APA
Rantideb Howlader. (2026). The Topography of Hesitation: Non-Markovian Ruptures and the Mathematical Failure of Autoregressive Models on Dysfluent Speech. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/topography-of-hesitation

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->