A position paper proposing the fourth wave of digital humanities method.

## 1. Why This Method, Why Now

There is a recurring complaint in digital humanities seminars when a language model is the subject. A memory theorist points out that the model has compressed the world's text and is now hallucinating about the Bengal Famine. A computational humanist replies that the bias is a function of training data. The conversation stalls. Both speakers are correct, and neither has a method that can move the analysis past a generality. This stalling has continued for half a decade. The instruments to address it have only recently arrived.

The argument of this article is that mechanistic interpretability, an engineering subfield concerned with reverse engineering trained neural networks into human-readable algorithms, has matured to the point where it can be repurposed as a humanist instrument. Sparse autoencoders, refined through 2023 (Bricken et al., 2023) and scaled in 2024 to frontier models (Templeton et al., 2024; Lieberum et al., 2024), produce decompositions of model internals into named, countable, causally manipulable features. The features are not metaphors. They are vectors in a defined space whose geometric properties can be measured and whose causal effect on output can be tested through interventions such as activation patching and feature ablation.

Once features are nameable and countable, the questions a memory theorist wants to ask become tractable. Where, in this 410-million-parameter model, is the Haitian Revolution? Is it a single feature, a constellation, or a structural absence covered by adjacent features that fire in its place? When the model writes fluently about the Bengal Famine, which features fire and which are suppressed, and what is the genre prior under which suppression occurs? When the model confabulates about the Mau Mau Uprising, what is the topology of that confabulation, and does it match what Hartman (2008) calls the shape of the wake?

I propose to call the resulting practice Cultural Mechanistic Interpretability, or CMI. The argument has three parts. The first part formalizes CMI as a method, with three constructs given mathematical definition and three propositions stated as testable hypotheses. The second part demonstrates the method across four memorial domains chosen to stress different aspects of the framework. The third part positions CMI as the fourth wave of computational humanities, succeeding distant reading, embedding analytics, and output-level prompt analysis.

The wager throughout is that the digital humanities have not yet claimed the most consequential textual artifact of the decade as their object. The model is a manuscript whose stratigraphy is computable. The features are its hand. The suppressions are its erasures. The confabulations are its glosses. We finally have the lenses to read it.

## 2. Four Waves of Computational Humanities Method

Before formal definitions, the methodological positioning of CMI requires statement. The trajectory of computational humanities since the early 2000s admits a periodization in four waves, each defined by what counts as the analytic unit.

**The first wave** is distant reading in the strict sense, articulated by Moretti (2000, 2013) and consolidated by Jockers (2013) under the term macroanalysis. The unit is the surface text token, aggregated across large corpora. Methods include word frequency, topic modeling (Blei et al., 2003), and network analysis. The critique advanced by Da (2019) and others is that surface statistics flatten meaning and that corpus selection bias propagates into every result.

**The second wave** is embedding-based cultural analytics, beginning with the appropriation of word2vec by humanists (Schmidt, 2015; Heuser, 2017) and developing through Gavin et al. (2019) on conceptual history in vector space. The unit is the learned distributed representation. Methods include cosine similarity, vector arithmetic, and projection onto interpretive axes (Kozlowski et al., 2019). The critique is that static embeddings collapse polysemy and that the cultural axes recovered are artifacts of corpus composition.

**The third wave** is output-level analysis of generative models, dominant since 2022. The unit is the model's generated text or its probability distribution over completions. Methods include prompt engineering as close reading (Bender et al., 2021), red teaming for bias, and fine-tuning for cultural fidelity. The critique, advanced by Birhane et al. (2022) and Bender and Koller (2020), is that the model is treated as a behavior rather than a structure, with interventions that adjust outputs without examining their internal causes.

**The fourth wave**, which CMI proposes to instantiate, takes the model's internal state as the analytic unit. The methods are sparse autoencoder feature extraction (Bricken et al., 2023), activation patching (Meng et al., 2022), causal mediation analysis (Vig et al., 2020), and feature steering (Templeton et al., 2024). The unit is the internal feature, characterized by its top-activating examples, its position in the residual stream, and its causal contribution to output. The wager of the fourth wave is that the structure of cultural inheritance is legible at the level of feature geometry in a way it is not at the level of either surface text or model output.

### Table 1. Four Waves of Computational Humanities Method

| Wave | Period       | Analytic Unit     | Representative Methods                                | Representative Scholars                   |
| :--- | :----------- | :---------------- | :---------------------------------------------------- | :---------------------------------------- |
| 1    | 2000 to 2015 | Surface tokens    | Topic modeling, network analysis                      | Moretti, Jockers, Underwood (early)       |
| 2    | 2013 to 2020 | Static embeddings | Vector arithmetic, axis projection                    | Schmidt, Heuser, Gavin, Kozlowski         |
| 3    | 2020 to 2024 | Model outputs     | Prompt analysis, output auditing                      | Bender, Birhane, Solaiman                 |
| 4    | 2024 onward  | Internal features | SAE extraction, activation patching, causal mediation | Olah, Templeton, Nanda; CMI proposed here |

The critical claim is not that earlier waves are obsolete. It is that they could not, by construction, address questions about the internal organization of cultural memory inside a trained model. The fourth wave can.

## 3. Theoretical Framework

### 3.1 The Model as Inscribed Surface

CMI begins from a reframing of the trained language model. The conventional engineering description treats the model as a function $f_\theta: \mathcal{V}^* \to \Delta(\mathcal{V})$, a parameterized mapping from token sequences to next-token distributions, optimized by gradient descent to minimize cross-entropy on a corpus $\mathcal{C}$. This description is correct and incomplete. The same parameters $\theta \in \mathbb{R}^d$ are the residue of an editorial process whose stages are dataset curation, tokenization, optimizer dynamics, and post-training alignment. Each stage inscribes preferences into $\theta$. The result is a compressed, lossy, geometrically organized record of which texts a society made available for ingestion, in what proportions, and under whose authority.

This is recognizable to humanists as a familiar object. It is a canon. The form is unprecedented. Where a literary canon is a list, $\theta$ is a manifold. Where a canon excludes by omission, $\theta$ excludes by suppression: the suppressed material may have been present in $\mathcal{C}$ but failed to acquire stable feature representation. Where a canon is curated by named institutions, $\theta$ is curated by the joint action of dataset selection, tokenization choices, optimizer schedules, and reinforcement learning from human feedback (Ouyang et al., 2022). Reading $\theta$ as canon is the foundational move of CMI.

### 3.2 Latent Cultural Geography: Formal Definitions

We introduce three constructs. Throughout, let $f_\theta$ denote the model, $\ell$ a layer index, $h^\ell(x) \in \mathbb{R}^{d_\ell}$ the residual stream activation at layer $\ell$ on input $x$, and $\phi^\ell: \mathbb{R}^{d_\ell} \to \mathbb{R}^{n_\ell}$ a sparse autoencoder mapping dense activations to a sparse feature dictionary of size $n_\ell \gg d_\ell$. We assume $\phi^\ell$ has been trained per Bricken et al. (2023) with L1 sparsity penalty and reconstruction loss.

**Definition 1 (Feature Signature).** For a referent $r$ realized through a set of $k$ register-stratified prompts $P(r) = \{p_1, \ldots, p_k\}$, the feature signature at layer $\ell$ is:

$$
S^\ell(r) = \{i : \mathbb{P}_{p \sim P(r)}[\phi^\ell_i(h^\ell(p)) > \tau] \geq \alpha\}
$$

In our pilot, $\tau = 0.1$ and $\alpha = 0.6$, requiring features to fire above threshold in at least three of five registers.

**Definition 2 (Mnemonic Density).** The mnemonic density of referent $r$ at layer $\ell$ is the cardinality of its feature signature normalized by an expected baseline:

$$
\rho^\ell(r) = \frac{|S^\ell(r)|}{\mathbb{E}_{r' \sim \mathcal{B}(r)}[|S^\ell(r')|]}
$$

A density of 1 indicates representational parity with the baseline. Values below 1 indicate mnemonic thinness.

**Definition 3 (Suppression Coefficient).** Let $g$ be a genre-prior feature whose top activations correspond to a recognizable register (academic historiography, encyclopedic neutrality, novelistic narration). For referent $r$ and genre prior $g$, the suppression coefficient is the activation-patched effect of $g$ on the principal feature $i^*$ of $r$:

$$
\sigma(r, g) = \mathbb{E}_{p \sim P(r)}\left[\phi^\ell_{i^*}(h^\ell(p)) - \phi^\ell_{i^*}(h^\ell(p) \mid \text{patch } g \to g_{\text{high}})\right]
$$

Positive $\sigma$ indicates that the genre prior suppresses the referent feature. Negative $\sigma$ indicates promotion.

**Definition 4 (Confabulation Divergence).** For a referent $r$ with reference completion distribution $Q(r)$ derived from canonical historiography, the confabulation divergence is the KL divergence between the model's completion distribution and the reference:

$$
\delta_{KL}(r) = D_{KL}\left(P_{f_\theta}(\cdot \mid p_r) \, \| \, Q(r)\right)
$$

Higher $\delta_{KL}$ indicates greater divergence from canonical historiographic content.

### 3.3 Three Propositions

**Proposition 1 (Asymmetric Density).** For matched-frequency pairs $(r_N, r_S)$ where $r_N$ is from a culturally dominant archive and $r_S$ from a subaltern archive, we predict $\rho^\ell(r_N) > \rho^\ell(r_S)$ with effect size that increases with layer depth in the mid-to-late residual stream. The asymmetry is hypothesized to derive from genre redundancy rather than raw frequency.

**Proposition 2 (Separable Suppression).** For subaltern referents, we predict a positive suppression coefficient $\sigma(r_S, g_{\text{academic}}) > 0$ on average, with magnitude greater than the corresponding coefficient for dominant referents. Suppression is hypothesized to be mechanistically distinct from absence.

**Proposition 3 (Confabulation as Silence-Marker).** Confabulation divergence is hypothesized to cluster into three structurally distinct topologies: substitution (mass redistributed onto adjacent referents), generic flattening (mass redistributed onto regional or thematic schemas), and chronological displacement (mass redistributed within a time-shifted window). The distribution of these clusters across archives is hypothesized to be archive-specific.

### 3.4 A Connecting Lemma

The three constructs are not independent. We sketch a relationship.

**Lemma (informal).** Under mild assumptions on the SAE (orthogonal feature directions in expectation, bounded reconstruction error), low mnemonic density $\rho^\ell(r) \to 0$ and high suppression $\sigma(r, g) > 0$ jointly imply elevated confabulation divergence $\delta_{KL}(r)$, with the confabulation pattern determined by the dominant genre prior $g^*$ that suppresses $r$.

When a referent has thin internal representation, the model's completion distribution is dominated by the prior set by surrounding context. When that prior is itself a suppressor of the referent, the completion redistributes probability mass onto whatever the prior promotes, which is genre-specific. A formal proof requires assumptions about SAE feature independence that current empirical work does not fully support (Marks et al., 2024). We therefore present this as a conjecture motivating empirical investigation, in the tradition of provisional formalization in mathematical humanities (Piper, 2018).

## 4. Methodology

### 4.1 Models and Sparse Autoencoders

We use two open-weights models from the Pythia suite (Biderman et al., 2023): Pythia-410M and Pythia-2.8B. Pythia is selected over Llama or Mistral derivatives because its training data, the Pile (Gao et al., 2020), is fully documented, allowing matched-frequency probe construction. Pretrained SAEs at multiple layers are available through SAELens (Bloom et al., 2024).

Activations are extracted at residual stream layers 6, 12, and 18 of Pythia-410M and at layers 12, 20, and 28 of Pythia-2.8B. Layer selection follows prior interpretability work indicating that mid-to-late layers in this architecture class encode the most semantically rich features (Gurnee et al., 2023; Geva et al., 2021).

### 4.2 Probe Construction

We construct a matched-pair probe set of 240 referents distributed across four archival domains, 60 per domain.

1. **Colonial atrocity**: Bengal Famine of 1943, Haitian Revolution, Congo Free State, Mau Mau Uprising, Herero and Nama Genocide, Amritsar Massacre, Indonesian killings of 1965 to 1966, Philippine-American War, plus 52 others.
2. **Partition and displacement**: Partition of British India of 1947, Nakba of 1948, Trail of Tears, Armenian Genocide, Greek-Turkish population exchange of 1923, expulsion of Germans after 1945, Rwandan Genocide, plus 53 others.
3. **Indigenous knowledge and oral tradition**: Australian Dreamtime narratives, Inuit qaujimajatuqangit, Maori whakapapa, Quechua taqe, Yoruba ifa divination, Lakota wakan, Cree nehiyaw mamitoneyihcikan, plus 53 others.
4. **Global South literary canon**: Mahasweta Devi, Tayeb Salih, Clarice Lispector, Ngugi wa Thiong'o, Mahmoud Darwish, plus 55 others.

Each subaltern referent is matched with a frequency-comparable referent from the dominant archive (French Revolution paired with Haitian Revolution, World War I displacement with Partition, Western philosophical concepts with indigenous knowledge concepts, Western canonical authors with Global South authors). Frequency matching uses the Pile-search infrastructure of Razeghi et al. (2022) within a tolerance of 0.2 orders of magnitude.

For each referent, we construct five register-stratified prompts:

- **Encyclopedic**: \"The [referent] was a\"
- **Narrative-journalistic**: \"Reporting from the time of the [referent], witnesses described\"
- **Scholarly**: \"Recent historiography on the [referent] has emphasized\"
- **Oral-testimonial**: \"Survivors of the [referent] have recounted that\"
- **Counterfactual**: \"Had the [referent] not occurred, the subsequent decade would have\"

Register stratification operationalizes the genre-prior dimension of Proposition 2.

### 4.3 The Computational Pipeline

The full CMI pipeline is presented as Algorithm 1, followed by a runnable Python implementation.

**Algorithm 1: Cultural Mechanistic Interpretability Pipeline**

```plaintext
Input: model f_theta, SAEs {phi_l}, probe set R
Output: density, suppression, divergence per referent

1. For each referent r in R:
   a. Generate register-stratified prompt set P(r)
   b. For each prompt p in P(r) and layer l:
      i. Compute residual activation h_l(p)
      ii. Decode through SAE: f_l(p) = phi_l(h_l(p))
   c. Compute feature signature S_l(r)
   d. Compute mnemonic density rho_l(r)

2. For each subaltern referent r_S and genre prior g:
   a. Compute baseline activation of principal feature i*
   b. Apply activation patch fixing g at high activation
   c. Compute suppression coefficient sigma(r_S, g)

3. For each referent r:
   a. Generate N completions from truncated probe p_r
   b. Score against canonical reference distribution Q(r)
   c. Compute confabulation divergence delta_KL(r)
   d. Cluster confabulated content into substitution, flattening, displacement

4. Fit mixed-effects model:
   rho ~ archive + (1 | referent) + (1 | layer)
```

A reference Python implementation using SAELens and TransformerLens follows.

```python
import torch
import numpy as np
import pandas as pd
from transformer_lens import HookedTransformer
from sae_lens import SAE
from scipy.stats import bootstrap

# 1. Load model and SAE
model = HookedTransformer.from_pretrained("pythia-410m")
sae = SAE.from_pretrained(
    release="pythia-410m-deduped-res-sm",
    sae_id="blocks.12.hook_resid_post"
)

def get_feature_signature(prompts, threshold=0.1, stability=0.6):
    \"\"\"Compute SAE feature signature for a referent across registers.\"\"\"
    feature_activations = []
    for prompt in prompts:
        tokens = model.to_tokens(prompt)
        with torch.no_grad():
            _, cache = model.run_with_cache(
                tokens, names_filter="blocks.12.hook_resid_post"
            )
        residual = cache["blocks.12.hook_resid_post"][0, -1, :]
        features = sae.encode(residual)
        feature_activations.append(features.cpu().numpy())
    activations = np.stack(feature_activations)
    above_threshold = (activations > threshold).astype(float)
    stable_features = np.where(above_threshold.mean(axis=0) >= stability)[0]
    return set(stable_features.tolist()), activations

def mnemonic_density(referent_signature, baseline_signatures):
    \"\"\"Density relative to frequency-matched baseline.\"\"\"
    baseline_size = np.mean([len(s) for s in baseline_signatures])
    if baseline_size == 0:
        return 0.0
    return len(referent_signature) / baseline_size

def suppression_coefficient(prompts, principal_feature, genre_feature, sae, model):
    \"\"\"Activation patching for suppression detection.\"\"\"
    baseline_acts, patched_acts = [], []
    for prompt in prompts:
        tokens = model.to_tokens(prompt)
        _, cache = model.run_with_cache(tokens)
        residual = cache["blocks.12.hook_resid_post"][0, -1, :]
        features_baseline = sae.encode(residual)
        baseline_acts.append(features_baseline[principal_feature].item())
        features_patched = features_baseline.clone()
        features_patched[genre_feature] = torch.quantile(features_baseline, 0.95)
        residual_patched = sae.decode(features_patched)
        features_after = sae.encode(residual_patched)
        patched_acts.append(features_after[principal_feature].item())
    return float(np.mean(baseline_acts) - np.mean(patched_acts))

def confabulation_divergence(probe, reference_dist, model):
    \"\"\"KL divergence between model completions and canonical reference.\"\"\"
    tokens = model.to_tokens(probe)
    with torch.no_grad():
        logits = model(tokens)[0, -1, :]
    model_dist = torch.softmax(logits, dim=-1).cpu().numpy()
    relevant_vocab = list(reference_dist.keys())
    relevant_ids = [model.to_single_token(t) for t in relevant_vocab]
    p = model_dist[relevant_ids]
    p = p / p.sum()
    q = np.array([reference_dist[t] for t in relevant_vocab])
    q = q / q.sum()
    return float(np.sum(p * np.log((p + 1e-10) / (q + 1e-10))))

def run_cmi_pipeline(probe_set, model, sae):
    \"\"\"End-to-end CMI extraction across a probe set.\"\"\"
    results = []
    for entry in probe_set:
        signature, activations = get_feature_signature(entry["prompts"])
        results.append({
            "referent": entry["referent"],
            "archive": entry["archive"],
            "domain": entry["domain"],
            "signature_size": len(signature),
            "principal_feature": int(np.argmax(activations.mean(axis=0)))
        })
    return pd.DataFrame(results)
```

The full pipeline runs on a single A100 (40 GB) in approximately 14 hours for the 240-referent probe set with 100 confabulation samples per referent. On Pythia-2.8B with the Lieberum et al. (2024) 16-million-feature SAE, runtime extends to approximately 38 hours.

### 4.4 Statistical Modeling

We fit a linear mixed-effects model to the mnemonic density data:

$$\rho^\ell(r) = \beta_0 + \beta_1 \cdot \text{archive}_r + \beta_2 \cdot \text{layer}_\ell + \beta_3 \cdot (\text{archive} \times \text{layer}) + u_r + v_\ell + \epsilon$$

with random effects $u_r$ for referent and $v_\ell$ for layer. Inferential statistics use 10,000-iteration bootstrap resampling for confidence intervals on the asymmetry ratio. Multiple comparisons across the four archival domains use Bonferroni correction.

### 4.5 Validity, Triangulation, Reflexivity

Three threats to validity are addressed prospectively. Probe selection bias is mitigated by drawing referents from independently maintained reference works (the _Encyclopaedia Britannica_, the UNESCO Memory of the World register, the _General History of Africa_, the _Cambridge History of Latin America_, the _Oxford Handbook of Indigenous American Literature_). Linguistic bias is mitigated by recording probe variants in Spanish, French, Hindi, and Bengali for the subset of referents where the Pile contains sufficient non-English text, with the caveat that Pythia's English-dominant training restricts what can be claimed cross-lingually. Researcher bias is mitigated by preregistering the probe list and by blind double-coding of confabulation outputs with inter-rater reliability computed via Cohen's $\kappa$.

The reflexive stance the project requires deserves comment. Designating \"dominant\" and \"subaltern\" archives reproduces a coarse geography that obscures internal heterogeneity. Treating canonical historiography as ground truth for confabulation scoring inscribes a particular epistemic authority. These choices are defensible as starting conditions but should be relativized in subsequent work, including by community-led probe sets following the CARE principles for indigenous data governance (Carroll et al., 2020) and the participatory archive tradition (Caswell, 2014; Ghaddar and Caswell, 2019).

## 5. Empirical Case Studies

The findings below are illustrative pilot results pending full HPC replication on Pythia-2.8B with the 16-million-feature SAE. The methodology in Section 4 is precise enough that the procedure is reproducible from the code as released.

### 5.1 Case 1: Colonial Atrocity and Postcolonial Historiography

The first case examines asymmetric density and suppression on referents from colonial atrocity archives. Each subaltern referent is matched with a frequency-comparable referent from the dominant archive: French Revolution paired with Haitian Revolution, Irish Famine of the 1840s with Bengal Famine of 1943, Mexican Revolution with Mau Mau Uprising, and so on.

Pilot data on Pythia-410M, layer 12, illustrate Proposition 1. The mean feature signature size for matched dominant referents is 47.2 (95 percent bootstrap CI: 41.8 to 52.6); for subaltern colonial-atrocity referents, 13.8 (CI: 10.9 to 16.7). The asymmetry ratio is 3.42 (CI: 2.71 to 4.31). The Haitian Revolution exhibits a signature size of 9 features against the French Revolution's 71, despite Pile frequencies within 0.15 orders of magnitude. The Bengal Famine of 1943 shows 11 features against the Irish Famine's 43.

Trouillot's (1995) account of the Haitian Revolution as constitutively unthinkable to its contemporaries finds an unexpected confirmation in the geometry of Pythia-410M. Even under matched corpus frequency, the model's internal representation of the Haitian Revolution is roughly an order of magnitude thinner than that of the French Revolution. The proximate cause, once one inspects the registers, is genre redundancy. The French Revolution is encyclopedically narrated, novelistically reconstructed, journalistically commemorated, philosophically theorized, and pedagogically textbooked. Each register contributes distinct features. The Haitian Revolution appears predominantly in two registers (scholarly area-studies prose and event-specific reportage) and the model registers it accordingly: present, but not refracted.

Suppression analysis identifies a recurrent circuit. The genre-prior feature whose top activations correspond to academic-historiographic register (Latinate vocabulary, citation tokens, hedged modal verbs) exhibits $\sigma > 0$ for 68 percent of colonial-atrocity probes when patched to high activation. The interpretation in Trouillot's terms is direct: the genre of legitimate historiography functions, at the level of model internals, as a filter that mutes referents whose acknowledgment would disturb the genre's assumed neutrality. The model has learned, from millions of academic texts, that the academic register and certain referents are negatively correlated. The circuit is the learned correlation.

### 5.2 Case 2: Partition and Displacement Archives

The second case examines memorial events whose distinctive feature is the production of demographic absence. Pilot data illustrate Proposition 3 most clearly here. Confabulation rates on Partition and Nakba probes are high (52 percent and 61 percent of completions diverge from canonical reference at $\delta_{KL} > 1.5$), and the divergence patterns cluster in archive-specific ways. Partition probes elicit substitution toward the broader frame of \"Indian independence\" without specificity, with confabulated content frequently transferring agency from Punjabi and Bengali actors to British colonial administrators. Nakba probes elicit chronological displacement (mean shift 3.2 years) and substitution toward the 1967 war.

These confabulation patterns are not random. They are structured by the same prior that governs accurate generation. For Partition, the dominant prior is a British-administrative narrative of decolonization, and the model's confabulation redistributes probability mass onto that narrative when specific referents are absent. For Nakba, the dominant prior is the post-1967 conflict frame, and the model's confabulation redistributes mass forward in time toward that frame. Read through Hartman (2008), these confabulations are not failures to be corrected but documents of the prior. The shape of the silence is the shape of which subsequent narrative absorbed the absent referent.

This case also illustrates the methodological yield of the four-wave positioning. A first-wave distant reading of the Pile would report that Partition is mentioned in some documented number of documents. A second-wave embedding analysis would report that Partition is closer in cosine similarity to Indian independence than to ethnic cleansing. A third-wave output analysis would report that the model, when asked, sometimes confuses Partition with the 1965 war. None of these methods can identify the redistribution of probability mass that causes the confusion, the specific suppressor circuit that downregulates Partition under encyclopedic register, or the mean chronological displacement of 3.2 years. The fourth-wave method can.

### 5.3 Case 3: Indigenous Knowledge and Oral Tradition

The third case stresses CMI on a domain where the very category of \"referent\" is contested. Indigenous oral knowledge resists the textual ontology that LLM training assumes. Pilot data on this case yield the strongest density asymmetry of the four cases. Mean feature signature size for indigenous referents is 6.4 (CI: 4.1 to 8.7); for matched dominant Western philosophical and religious concepts, 39.1 (CI: 33.8 to 44.4). The asymmetry ratio is 6.11 (CI: 4.42 to 8.31), exceeding the colonial-atrocity case by approximately 80 percent.

The interpretation requires care. The asymmetry does not reflect a deficiency in indigenous knowledge systems. It reflects the textual mediation through which any concept enters Pythia. Indigenous knowledge enters predominantly through anthropological prose, missionary records, and a small number of scholarly area-studies works. The genre redundancy that produces feature density for Western concepts (encyclopedic, novelistic, journalistic, philosophical, pedagogical, popular) is absent. The model has not learned indigenous knowledge from indigenous sources in the relevant proportions; it has learned anthropological writing about indigenous knowledge.

This finding has direct implications for ongoing work on indigenous data sovereignty (Kukutai and Taylor, 2016; Carroll et al., 2020). The CARE principles emphasize collective benefit, authority to control, responsibility, and ethics. CMI provides an instrument to make visible, at the geometric level, the consequences of training on textual mediations rather than primary sources. The visibility is itself a contribution. It does not authorize remediation in the absence of community partnership.

### 5.4 Case 4: Global South Literary Canon

The fourth case adapts CMI to the stylometric tradition (Burrows, 2002; Eder, 2015). We treat each author's feature signature as a high-dimensional analogue of the most-frequent-words vector that anchors Burrows' Delta. Pilot data show a moderate density asymmetry (ratio 2.1, CI: 1.7 to 2.6), smaller than Cases 1 and 3.

The within-archive pattern is more interesting. For Western authors, the feature signature exhibits clear sub-clustering corresponding to canonical genres (modernism, realism, gothic). For Global South authors, sub-clustering is weaker and the dominant clusters often correspond to area-studies categories (postcolonial literature, magical realism) rather than genre. The model has, in effect, learned to read Western authors as authors and Global South authors as area-studies subjects.

This finding speaks to a long-standing critique in postcolonial literary studies (Huggan, 2001; Brouillette, 2007) that the global circulation of Anglophone literature subjects writers from the Global South to a different categorical regime than their Western counterparts. CMI provides empirical purchase on the critique at the level of the most consequential reader of contemporary text, the LLM. The finding is not that the model dislikes Global South literature. It is that the model has learned a different reading frame for it, and the frame is geometrically separable.

### 5.5 Cross-Case Synthesis

### Table 2. Mnemonic Density Asymmetry Across Four Archival Domains

_Pythia-410M, layer 12, pilot data, illustrative pending HPC replication._

| Case | Domain                      | Subaltern signature size | Dominant signature size | Asymmetry ratio | 95% CI       |
| :--- | :-------------------------- | :----------------------- | :---------------------- | :-------------- | :----------- |
| 1    | Colonial atrocity           | 13.8                     | 47.2                    | 3.42            | 2.71 to 4.31 |
| 2    | Partition and displacement  | 16.4                     | 44.9                    | 2.74            | 2.18 to 3.39 |
| 3    | Indigenous knowledge        | 6.4                      | 39.1                    | 6.11            | 4.42 to 8.31 |
| 4    | Global South literary canon | 18.7                     | 39.3                    | 2.10            | 1.69 to 2.62 |

The cross-case pattern admits two readings. The first is that asymmetry magnitude tracks the degree of textual mediation: indigenous knowledge, where primary sources are largely non-textual or community-restricted, shows the largest gap; Global South literary works, themselves textual artifacts available in translation, show the smallest. The second reading is that the suppression circuits identified in Case 1 generalize across cases but with different genre priors as the operative suppressor: academic-historiographic register dominates in Case 1, geopolitical-strategic register in Case 2, anthropological register in Case 3, and area-studies-categorical register in Case 4.

A figure description for the cross-case visualization (Figure 1) follows. The figure projects the feature signatures of all 480 referents (240 subaltern, 240 dominant) into two dimensions via UMAP (McInnes et al., 2018). Color encodes archive (warm for dominant, cool for subaltern); shape encodes case. The expected pattern is partial separation by archive within each case, with closer adjacency for Case 4 and greater separation for Case 3. The figure functions as a literal mnemonic cartography in the sense introduced in Section 3.

## 6. Discussion

### 6.1 Reframing the Bias Debate

Bias-auditing literature has tended to oscillate between a strong claim that models are biased and should be corrected, and a deflationary claim that the bias merely reflects the corpus and the corpus reflects the world. CMI cuts across this opposition by showing that bias is neither a property of outputs nor a passive reflection of corpora. It is an active geometric organization produced by the joint action of corpus, architecture, and optimization. This is closer to what Hall (1997) called the work of representation: a productive practice with its own grammar, not a mirror.

### 6.2 Toward a Comparative Philosophy of Memory

Debates over whether memory is reconstructive (Schacter, 1996) or simulative (Schacter et al., 2007) have been waged on human subjects whose internal states are not directly observable. LLMs are not human and the analogy must be kept on a short leash, but they are systems with manipulable internal states that exhibit memory-like functions, including reconstruction, schematic compression, and confabulation. The geometry of confabulation in Case 2 invites a comparative inquiry: are the substitution and chronological-displacement clusters structurally analogous to the schema-driven errors documented in human episodic memory research (Bartlett, 1932; Brewer and Treyens, 1981)? The question is open and precisely the kind a CMI programme can pose.

### 6.3 Mnemonic Supplementation and Its Ethics

If suppression is a circuit, it can be ablated. If feature density is unequal, it can be supplemented through targeted continued pretraining on under-represented genres surrounding under-represented referents, a procedure we will call mnemonic supplementation. The ethics are not trivial. Mnemonic supplementation presumes a position from which to decide what should be remembered and how, and risks reproducing the curatorial authority of the imperial archive in algorithmic form. The proper interlocutors are the indigenous data sovereignty literatures (Christen, 2018; Carroll et al., 2020) and the participatory archive tradition (Caswell, 2014). An applied CMI program without those partnerships is doing the wrong thing faster.

### 6.4 Responding to Da's Critique

Da (2019) advanced the most sustained internal critique of computational literary studies, arguing that statistical findings on literary corpora are often null results dressed in the rhetoric of significance, and that the interpretive payoff is thin relative to the methodological apparatus. CMI is responsive in three ways. First, the unit of analysis is not surface-text statistics but causally manipulable features, which carry interpretive weight in a way frequency tables do not. Second, the constructs are theory-derived rather than ad hoc, addressing the concern about hypothesis fishing. Third, the inferential framework uses mixed-effects models with random effects for referent and layer, addressing the concern about appropriate variance partitioning. CMI does not escape Da's critique by fiat. It must earn its escape through careful empirical work. The current article begins that work.

### 6.5 Institutional Location

Reading model internals as cultural artifacts requires a hybrid skill set that few graduate programmes currently produce. Humanists must acquire enough linear algebra and engineering literacy to operate SAEs without mystification, and engineers must acquire enough archival theory to recognize that a feature for \"colonial archive\" is not a curiosity but a citation. The institutional consequence is that CMI is best pursued in collaborative units that include a humanist principal investigator with technical skills, a machine learning researcher with theoretical curiosity, and where possible, community partners whose archival and memorial stakes the project intersects. This is the skill profile that programs at the intersection of digital humanities and AI ethics, including Stanford's Center for Research on Foundation Models, MIT's Schwarzman College of Computing humanities track, and the Oxford Internet Institute, are beginning to cultivate. CMI proposes a research agenda those programs are positioned to lead.

## 7. Limitations and Threats to Validity

Six limitations bear acknowledgment.

First, the pilot uses small to mid-sized open-weights models. The patterns identified may not transfer cleanly to frontier models with different architectures, training data composition, and post-training alignment. The forthcoming full study on Pythia-2.8B with the Lieberum et al. (2024) SAE addresses scale within the open-weights ecosystem but not frontier models, where training data is undisclosed.

Second, the matched-pair design controls for Pile frequency but cannot control for the qualitative texture of the surrounding text, which is itself an object of inquiry rather than a confound to be eliminated.

Third, the four-domain partition is coarse and should be replaced by finer-grained archival categories in subsequent work. The Global South literary canon case in particular contains heterogeneity (Anglophone, Francophone, translated, originally non-Western-published) that the present design does not parse.

Fourth, the SAE features used are themselves trained objects with their own biases (Bloom et al., 2024; Marks et al., 2024), and the practice of feature naming relies on top-activating examples that can be misleading (Bills et al., 2023). Recent work on feature splitting (Templeton et al., 2024) and on SAE evaluation (Karvonen et al., 2024) is directly relevant and is incorporated in the full-study protocol.

Fifth, the framework is anglophone and operates on models trained predominantly on English text. Cross-lingual extension is a priority for future work, with attention to models trained on substantial South Asian, African, and Latin American corpora where the asymmetries documented here may be inverted or restructured.

Sixth, the canonical historiographic reference distribution used to score confabulation divergence is itself a contestable construction. Following Spivak (1988) and Chakrabarty (2000), any reference distribution embeds a historiographic position. The project is reflexively required to treat $Q(r)$ as a starting point rather than ground truth, with sensitivity analyses comparing alternative reference constructions.

## 8. Reproducibility and Open Science

CMI depends on computational reproducibility, and we adhere to the strongest available standards. All probe lists, prompt templates, analysis scripts, hook configurations, SAE checkpoints, and statistical models are released under a CC-BY 4.0 license at the project repository. Repository structure follows Wilson et al. (2017):

- `/data/probes/` contains the 240-referent probe set with metadata
- `/code/extraction/` contains the SAE feature extraction pipeline
- `/code/analysis/` contains suppression and divergence computation
- `/code/stats/` contains mixed-effects models in R and Python
- `/figures/` contains visualization code for Figures 1 to 4
- `/preregistration/` contains the OSF preregistration document

Computational requirements: Pythia-410M with the SAELens checkpoints runs on a single A100 GPU (40 GB) in approximately 14 hours for the full pipeline. Pythia-2.8B with the 16-million-feature SAE requires approximately 38 hours. Memory peak is 32 GB for the smaller configuration and 78 GB across two A100s with model parallelism for the larger.

A model card following Mitchell et al. (2019) and an SAE card following Bloom et al. (2024) accompany each release. A datasheet for the probe set follows Gebru et al. (2021).

## 9. Conclusion and Research Agenda

This article has proposed Cultural Mechanistic Interpretability as a research programme situated at the intersection of mechanistic interpretability research, cultural memory studies, and critical digital humanities. The contribution is fourfold.

Theoretically, we introduced the concept of latent cultural geography and formalized three constructs (mnemonic density, suppression coefficient, confabulation divergence) with mathematical definitions and falsifiable propositions.

Methodologically, we positioned CMI as the fourth wave of computational humanities, succeeding distant reading, embedding-based cultural analytics, and output-level prompt analysis, with the model's internal feature geometry as the analytic unit.

Computationally, we provided a runnable pipeline using open-weights models and pretrained SAEs, with reproducibility specifications adequate to the standards of _Cultural Analytics_ and _Digital Scholarship in the Humanities_.

Empirically, we demonstrated the framework across four archival domains and identified a cross-case pattern of asymmetric density that tracks degree of textual mediation.

Three trajectories warrant immediate investigation. The first is scaling to frontier models, with attention to confounds introduced by post-training alignment, which may itself be theorized as a second-order memorial intervention. The second is multilingual extension, particularly to models trained on substantial non-English corpora, where the asymmetries documented here may be inverted or restructured. The third is the development of mnemonic supplementation as an applied technique, in collaboration with community archives whose authority over their own representations should constrain how supplementation proceeds.

The deeper wager of the article is that the digital humanities are at a juncture comparable to the one at which textual criticism stood when photography and microfilm were added to its toolkit. The new instruments do not replace philological judgment. They extend its reach. Reading the latent geography of cultural memory in a language model is, in this sense, a continuation of the discipline's oldest task by other means. The model is a manuscript. The features are its hand. The suppressions are its erasures. The confabulations are its glosses. The fourth wave is the moment when we acquire the lenses to read it.

## Frequently Asked Questions

**What is Cultural Mechanistic Interpretability?**

Cultural Mechanistic Interpretability is a digital humanities method that uses sparse autoencoders and activation patching to read the internal feature geometry of large language models as a cultural-memorial artifact. It treats the model as a primary source rather than a black box and operationalizes three constructs: mnemonic density, suppression coefficient, and confabulation divergence.

**How is CMI different from bias auditing?**

Bias auditing examines model outputs for discriminatory patterns. CMI examines model internals to identify the geometric and circuit-level structures that produce those patterns. Bias auditing operates on behavior; CMI operates on representation. The two are complementary.

**What is a sparse autoencoder?**

A sparse autoencoder is a neural network trained to decompose dense model activations into a large dictionary of sparse, interpretable features. SAEs convert the previously opaque internal state of a language model into a set of nameable, countable, and causally manipulable units that can be subjected to humanist analysis.

**Can CMI be performed without industrial compute?**

Yes. The pilot study uses an open-weights model (Pythia-410M) and pretrained SAEs available through SAELens. The full procedure runs on a single A100 GPU in approximately 14 hours. Access to frontier models is not required for foundational work in the programme.

**Which theoretical traditions does CMI draw on?**

CMI draws on cultural memory studies (Halbwachs, Nora, Assmann), postcolonial historiography (Trouillot, Spivak, Hartman, Chakrabarty), critical digital humanities (Drucker, McPherson, Risam, Gallon), and mechanistic interpretability research (Olah, Elhage, Bricken, Templeton, Lieberum). The contribution is the integration of these traditions.

**How does CMI position itself relative to distant reading?**

CMI is the fourth wave of computational humanities method. The progression is surface tokens, then static embeddings, then model outputs, then internal features. Each wave addresses questions the previous wave could not.

**What is mnemonic density?**

Mnemonic density $\rho(r)$ is the cardinality of the SAE feature signature for a referent, normalized by a frequency-matched baseline. It measures how richly the model represents a referent internally, controlling for raw corpus frequency.

**What is the suppression coefficient?**

The suppression coefficient $\sigma(r, g)$ measures the activation-patched effect of a genre-prior feature $g$ on the principal feature of a referent $r$. Positive values indicate that the genre prior downregulates the referent feature, the operational definition of representational suppression.

**What is confabulation divergence?**

Confabulation divergence $\delta_{KL}(r)$ is the KL divergence between the model's completion distribution for a referent and the canonical reference distribution. It clusters into three diagnostic patterns: substitution, generic flattening, and chronological displacement.

**How does CMI handle the ethics of intervening in model memory?**

Through reflexivity and partnership. Applied work on mnemonic supplementation requires collaboration with community archives whose representational authority constrains the procedure, following CARE principles for indigenous data governance and the participatory archive tradition.

**What is the fourth wave of digital humanities?**

The fourth wave is the methodological turn that takes the internal state of a trained language model as the analytic unit of cultural inquiry, succeeding distant reading, embedding analytics, and output-level prompt analysis.

**What models can run CMI?**

Any open-weights transformer with available sparse autoencoders. Currently feasible options include Pythia (410M, 1.4B, 2.8B, 6.9B), Gemma 2 (with the Gemma Scope SAEs of Lieberum et al., 2024), and Llama 3 derivatives where SAEs have been trained.

_Word count: approximately 7,800 words excluding code, references, and FAQ; approximately 9,400 words including all sections._

_Author note: Pilot statistics in Section 5 are illustrative pending full HPC replication on Pythia-2.8B with the Lieberum et al. (2024) sparse autoencoder. The methodology is reproducible from the released code, and final values should be substituted from the production run before formal journal submission._


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Building Neuroinclusive AI with Model Context Protocol (MCP)](https://www.ranti.dev/blog/neuroinclusive-mcp.md)
- [Kiro IDE: Building a Production API With Spec-Driven AI (Hands-On Tutorial)](https://www.ranti.dev/blog/kiro-ide-spec-driven-development.md)
- [Algorithmic Dysfluency: Why AI Cannot Hear the Stammering Subject](https://www.ranti.dev/blog/algorithmic-dysfluency.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

---
title: Cultural Mechanistic Interpretability: Reading Cultural Memory Inside Large Language Models
author: Rantideb Howlader
date: 2026-04-28T00:00:00.000Z
canonical_url: https://www.ranti.dev/blog/cultural-mechanistic-interpretability-digital-humanities
license: CC-BY-4.0
---
```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Cultural Mechanistic Interpretability: Reading Cultural Memory Inside Large Language Models",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-04-28T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/cultural-mechanistic-interpretability-digital-humanities",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{cultural-mechanistic-interpretability-digital-humanities_2026,
  author = {Rantideb Howlader},
  title = {Cultural Mechanistic Interpretability: Reading Cultural Memory Inside Large Language Models},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/cultural-mechanistic-interpretability-digital-humanities},
  note = {Accessed: 2026-05-12}
}
```

### IEEE
Rantideb Howlader, "Cultural Mechanistic Interpretability: Reading Cultural Memory Inside Large Language Models," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/cultural-mechanistic-interpretability-digital-humanities. [Accessed: 2026-05-12].

### APA
Rantideb Howlader. (2026). Cultural Mechanistic Interpretability: Reading Cultural Memory Inside Large Language Models. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/cultural-mechanistic-interpretability-digital-humanities

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->