What This Template Gives You
Starting a digital humanities research project from scratch is slow. You spend weeks setting up environments, figuring out directory structures, debugging dependency conflicts, and writing boilerplate code before you even begin your actual research. This template eliminates that startup cost.
The DH Notebook for Cultural Analysis is a ready-to-use project scaffold that gives you:
- A tested Python environment with all common DH libraries pre-configured
- Four sequential notebooks that walk you through a complete cultural analysis pipeline
- A data documentation framework that meets open science standards
- Automated reproducibility testing through GitHub Actions
- A publication-ready structure that reviewers and collaborators can navigate immediately
You download it, fill in your research question and data, and start analyzing. The infrastructure is handled.
Who This Template Is For
This template serves three audiences:
Graduate students starting their first computational humanities project. You know your research question and your cultural tradition, but you have not built a reproducible computational pipeline before. The template gives you a proven structure to follow.
Established DH researchers who want to adopt reproducibility best practices without rebuilding their workflow from scratch. You have been doing computational work for years, but your projects live in scattered scripts and undocumented notebooks. The template shows you how to organize existing work into a reproducible format.
Interdisciplinary researchers from computer science, linguistics, or social science who want to apply computational methods to cultural questions. You know how to code, but you are not sure how to structure a humanities research project or how to document interpretive choices alongside computational ones.
Template Structure
graph TD
A[Project Root] --> B[notebooks/]
A --> C[data/]
A --> D[src/]
A --> E[outputs/]
A --> F[docs/]
A --> G[Configuration Files]
B --> B1[01-data-collection.ipynb]
B --> B2[02-preprocessing.ipynb]
B --> B3[03-analysis.ipynb]
B --> B4[04-visualization.ipynb]
C --> C1[raw/]
C --> C2[processed/]
C --> C3[README.md]
D --> D1[utils.py]
D --> D2[metrics.py]
D --> D3[visualization.py]
G --> G1[environment.yml]
G --> G2[Makefile]
G --> G3[.github/workflows/ci.yml]
G --> G4[CITATION.cff]Each component has a specific purpose. Let me walk through them.
Notebook 1: Data Collection
The first notebook handles acquiring and documenting your primary sources. It includes:
Source Acquisition
python
"""
Notebook 01: Data Collection
=============================
Purpose: Acquire and document primary sources for cultural analysis.
Output: Raw data files in data/raw/ with full provenance documentation.
"""
# Configuration
PROJECT_NAME = "your-project-name"
CULTURAL_TRADITION = "your-tradition"
TIME_PERIOD = "your-period"
# Data sources (modify for your project)
SOURCES = {
"corpus_a": {
"description": "Primary literary corpus",
"url": "https://example.com/corpus",
"license": "CC-BY-4.0",
"language": "en",
"size": "500 documents",
},
"corpus_b": {
"description": "Comparison corpus",
"url": "https://example.com/comparison",
"license": "Public Domain",
"language": "en",
"size": "500 documents",
},
}
Provenance Documentation
The notebook automatically generates a data provenance record:
python
import json
from datetime import datetime
provenance = {
"project": PROJECT_NAME,
"collection_date": datetime.now().isoformat(),
"sources": SOURCES,
"collector": "Your Name",
"notes": "Describe any filtering or selection criteria here",
}
with open("data/raw/PROVENANCE.json", "w") as f:
json.dump(provenance, f, indent=2)
Ethical Checklist
The notebook includes an ethical review checklist that you fill out before proceeding:
python
ethical_checklist = {
"data_consent": "N/A - public domain texts",
"cultural_sensitivity": "Reviewed with [community/advisor]",
"copyright_status": "All texts verified public domain or CC-licensed",
"indigenous_data_sovereignty": "N/A or describe CARE compliance",
"potential_harms": "Describe any risks of misuse",
}
This is not bureaucratic overhead. It is a record that demonstrates you thought about the ethical dimensions of your data before you started computing on it.
Notebook 2: Preprocessing
The second notebook transforms raw data into analysis-ready formats. It handles:
Text Cleaning
python
"""
Notebook 02: Preprocessing
============================
Purpose: Transform raw data into analysis-ready format.
Input: data/raw/
Output: data/processed/
"""
import pandas as pd
from src.utils import clean_text, tokenize, normalize
# Load raw data
raw_texts = pd.read_csv("data/raw/corpus.csv")
print(f"Loaded {len(raw_texts)} documents")
# Cleaning pipeline
def preprocess(text):
"""Standard preprocessing pipeline. Modify for your needs."""
text = clean_text(text) # Remove markup, fix encoding
text = normalize(text) # Lowercase, normalize whitespace
tokens = tokenize(text) # Language-appropriate tokenization
return tokens
raw_texts["tokens"] = raw_texts["text"].apply(preprocess)
raw_texts["token_count"] = raw_texts["tokens"].apply(len)
# Summary statistics
print(f"Mean document length: {raw_texts['token_count'].mean():.0f} tokens")
print(f"Total tokens: {raw_texts['token_count'].sum():,}")
Quality Checks
python
# Data quality checks
assert raw_texts["text"].notna().all(), "Found null texts"
assert (raw_texts["token_count"] > 10).all(), "Found very short documents"
# Check for duplicates
duplicates = raw_texts.duplicated(subset=["text"])
print(f"Duplicates found: {duplicates.sum()}")
if duplicates.sum() > 0:
raw_texts = raw_texts[~duplicates]
print(f"Removed duplicates. Remaining: {len(raw_texts)}")
Save Processed Data
python
# Save processed data with metadata
raw_texts.to_parquet("data/processed/corpus_processed.parquet")
# Document preprocessing decisions
preprocessing_log = {
"steps": [
"Removed HTML markup",
"Fixed UTF-8 encoding issues",
"Lowercased all text",
"Tokenized using spaCy (language model: xx_ent_wiki_sm)",
"Removed documents shorter than 10 tokens",
"Removed exact duplicates",
],
"input_count": len(raw_texts) + duplicates.sum(),
"output_count": len(raw_texts),
"removed": int(duplicates.sum()),
}
Notebook 3: Analysis
The third notebook is where your research happens. The template provides three analysis tracks that you can use individually or combine:
Track A: Activation Analysis (for Cultural Interpretability)
python
"""
Notebook 03: Analysis
======================
Purpose: Apply cultural interpretability methods to processed data.
Input: data/processed/
Output: outputs/tables/, outputs/intermediate/
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sae_lens import SAE
from src.metrics import compute_density, feature_overlap, completion_divergence
# Load model (modify for your chosen model)
MODEL_NAME = "EleutherAI/pythia-410m"
SAE_ID = "pythia-410m-layer6-32k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
sae = SAE.from_pretrained(SAE_ID)
# Define your cultural referents
# (Replace with your own tradition)
referents_a = [...] # Your primary cultural tradition
referents_b = [...] # Comparison group
# Compute metrics
density_a = compute_density(referents_a, model, tokenizer, sae)
density_b = compute_density(referents_b, model, tokenizer, sae)
overlap = feature_overlap(referents_a, referents_b, model, tokenizer, sae)
Track B: Corpus Analysis (for Topic Modeling / Stylometry)
python
# Alternative track: corpus-based analysis
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Topic modeling
vectorizer = CountVectorizer(max_features=5000, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(raw_texts["text"])
lda = LatentDirichletAllocation(
n_components=20,
random_state=42, # Fixed seed for reproducibility
max_iter=50,
)
topics = lda.fit_transform(doc_term_matrix)
Track C: Network Analysis
python
# Alternative track: network analysis
import networkx as nx
# Build network from your relational data
G = nx.Graph()
# Add nodes and edges based on your research design
# G.add_edge(author_a, author_b, weight=similarity_score)
# Compute centrality metrics
centrality = nx.betweenness_centrality(G)
communities = nx.community.louvain_communities(G)
Statistical Testing
Regardless of which track you use, the template includes a statistical testing section:
python
from scipy import stats
# Compare groups
statistic, p_value = stats.mannwhitneyu(
density_a, density_b, alternative="two-sided"
)
# Effect size (Cohen's d)
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
var1, var2 = group1.var(), group2.var()
pooled_std = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
return (group1.mean() - group2.mean()) / (pooled_std ** 0.5)
effect = cohens_d(density_a, density_b)
print(f"Mann-Whitney U: {statistic:.1f}, p = {p_value:.6f}")
print(f"Effect size (Cohen's d): {effect:.2f}")
Notebook 4: Visualization
The fourth notebook generates publication-ready figures:
python
"""
Notebook 04: Visualization
============================
Purpose: Generate publication-ready figures and tables.
Input: outputs/intermediate/
Output: outputs/figures/, outputs/tables/
"""
import matplotlib.pyplot as plt
import seaborn as sns
# Set publication style
plt.style.use("seaborn-v0_8-paper")
plt.rcParams.update({
"font.size": 11,
"axes.labelsize": 12,
"axes.titlesize": 13,
"figure.dpi": 300,
"savefig.dpi": 300,
"savefig.bbox": "tight",
})
# Example: density comparison plot
fig, ax = plt.subplots(figsize=(8, 5))
data = pd.DataFrame({
"Density": list(density_a) + list(density_b),
"Group": ["Tradition A"] * len(density_a) + ["Tradition B"] * len(density_b),
})
sns.boxplot(data=data, x="Group", y="Density", ax=ax)
ax.set_title("Representational Density Comparison")
ax.set_ylabel("Active SAE Features")
plt.savefig("outputs/figures/density_comparison.png")
plt.savefig("outputs/figures/density_comparison.pdf") # Vector for publication
The Environment File
The environment.yml file pins every dependency:
yaml
name: dh-cultural-analysis
channels:
- conda-forge
- defaults
dependencies:
- python=3.11.8
- jupyter=1.0.0
- pandas=2.2.1
- numpy=1.26.4
- matplotlib=3.8.3
- seaborn=0.13.2
- scikit-learn=1.4.1
- scipy=1.12.0
- networkx=3.2.1
- nltk=3.8.1
- spacy=3.7.4
- pip:
- transformers==4.38.2
- torch==2.2.1
- sae-lens==3.2.0
- datasets==2.18.0
- pyarrow==15.0.0
The Makefile
The Makefile automates the full pipeline:
makefile
.PHONY: all clean data preprocess analyze visualize test
all: data preprocess analyze visualize
data:
jupyter nbconvert --execute --to notebook \
--output-dir=outputs/executed \
notebooks/01-data-collection.ipynb
preprocess: data
jupyter nbconvert --execute --to notebook \
--output-dir=outputs/executed \
notebooks/02-preprocessing.ipynb
analyze: preprocess
jupyter nbconvert --execute --to notebook \
--output-dir=outputs/executed \
notebooks/03-analysis.ipynb
visualize: analyze
jupyter nbconvert --execute --to notebook \
--output-dir=outputs/executed \
notebooks/04-visualization.ipynb
test:
pytest tests/ -v
clean:
rm -rf outputs/executed/
rm -rf outputs/figures/
rm -rf outputs/tables/
Running make all executes the entire pipeline from data collection to visualization. Running make clean && make all verifies full reproducibility from scratch.
GitHub Actions CI
The template includes a CI workflow that tests reproducibility on every push:
yaml
name: Reproducibility Check
on: [push, pull_request]
jobs:
reproduce:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: conda-incubator/setup-miniconda@v3
with:
environment-file: environment.yml
activate-environment: dh-cultural-analysis
- name: Run full pipeline
run: make all
- name: Check outputs exist
run: |
test -f outputs/figures/density_comparison.png
test -f outputs/tables/results_summary.csv
This means every time you push changes to your repository, the CI system verifies that your notebooks still run from top to bottom without errors. If a dependency update breaks something, you find out immediately.
How to Get Started
Step 1: Download the Template
bash
# Option A: Clone from GitHub
git clone https://github.com/Rantideb/dh-cultural-analysis-template.git
cd dh-cultural-analysis-template
# Option B: Download as ZIP from the releases page
# https://github.com/Rantideb/dh-cultural-analysis-template/releases
Step 2: Set Up Your Environment
bash
# Install Anaconda or Miniconda if you have not already
# Then create the environment:
conda env create -f environment.yml
conda activate dh-cultural-analysis
Step 3: Customize for Your Project
Open each notebook and replace the placeholder content with your own:
- In Notebook 01: Define your data sources and cultural context
- In Notebook 02: Adjust preprocessing for your language and text type
- In Notebook 03: Choose your analysis track and define your referents
- In Notebook 04: Customize visualizations for your findings
Step 4: Run and Verify
bash
# Run the full pipeline
make all
# Verify everything worked
ls outputs/figures/
ls outputs/tables/
Step 5: Publish
When your analysis is complete:
- Push to GitHub
- Create a release
- Link to Zenodo for a DOI
- Reference the DOI in your manuscript
Adapting the Template for Different Research Questions
The template is designed for cultural interpretability but adapts to other digital humanities methods with minimal changes:
For topic modeling: Remove the model loading code from Notebook 03 and use Track B. Add MALLET or gensim to the environment file.
For network analysis: Remove the model loading code and use Track C. Add graph visualization libraries (pyvis, nxviz) to the environment file.
For stylometry: Replace the analysis notebook with stylometric feature extraction. Add the Python stylo equivalent or use R with rpy2.
For mixed methods: Use multiple tracks in Notebook 03, each in its own section with clear narrative transitions explaining why you are combining methods.
Support and Collaboration
If you run into issues setting up the template, have questions about adapting it for your specific cultural tradition, or want more hands-on support for your research project, I offer digital humanities consultancy and research collaboration services.
Common requests include:
- Help choosing the right analysis track for a specific research question
- Custom modifications to the template for non-Latin script languages
- Guidance on publishing reproducible notebooks alongside journal articles
- Collaboration on cultural interpretability studies for underrepresented traditions
You can request research collaboration through the contact page. I work with graduate students, postdocs, and established researchers across all career stages.
The template is free and open source. Use it, modify it, share it. The only thing I ask is that you cite it if it contributes to your published work, and that you share your own notebooks so the field keeps growing.