---
title: "Building a Multi-Writer Serverless SQLite Engine on Amazon S3"
author: "Rantideb Howlader"
date: "2026-06-05T00:00:00.000Z"
canonical_url: "https://www.ranti.dev/blog/s3-db-wal-hands-on-guide"
license: "CC-BY-4.0"
---


The problem is deceptively simple to state. I needed a transactional, queryable, relational data store for a stateless system running on AWS Lambda - zero persistent infrastructure, zero idle cost, full SQL expressiveness. RDS and Aurora require always-on compute. DynamoDB demands that you know your access patterns at schema design time. With the release of Amazon S3 Files, S3 can now be mounted directly as a filesystem, solving the object-store mismatch. However, SQLite's WAL mode explicitly does not work over any network filesystem out of the box.

That last sentence is the crux of the entire design. SQLite's own documentation states, without qualification: **'WAL does not work over a network filesystem. This is because WAL requires all processes to share a small amount of memory and processes on separate host machines obviously cannot share memory with each other.'** That constraint breaks a naive S3 Files mount approach before a line of code is written.

The architectural hypothesis of this system is that we can use S3 Files perfectly by explicitly bypassing the shared memory requirement. By running SQLite in EXCLUSIVE locking mode, we force the engine to maintain the WAL index in local heap memory rather than a shared `.db-shm` file. This allows WAL mode to work perfectly over the S3 Files mount.

Here's how the whole thing works:

**SQLite runs directly against the Amazon S3 Files mount (e.g., `/mnt/s3`).** The database is opened directly from the mount. Because we use `PRAGMA locking_mode = EXCLUSIVE`, SQLite does not attempt to create or share a `.db-shm` file, bypassing the network filesystem shared-memory limitation.

**S3 Files acts as the definitive filesystem.** Because S3 Files provides strong read-after-write consistency and POSIX file semantics, reads and writes occur directly against the mounted files. There is no need to download the database to `/mnt/s3` on cold starts or upload it after a write.

**Distributed write serialization uses S3's conditional write API as a compare-and-swap primitive.** Since August 2024, S3 supports `If-None-Match: *` on PutObject, which succeeds only if the key does not exist - providing a distributed put-if-absent. The S3 conditional write mechanism provides a distributed put-if-absent, which forms the entire distributed coordination substrate. There is no DynamoDB lock table, no Redis sentinel, no Zookeeper, no external coordinator.

**The word 'multi-writer' requires precision.** Multiple Lambda execution environments can each acquire write leadership, execute a WAL transaction, checkpoint the WAL into the database, and commit the new generation to S3. They do this in sequence, not simultaneously. True concurrent multi-writer operation - where two Lambda functions commit write transactions to the same database at the same instant - is not possible with this architecture. S3's conditional write makes the first committer win and returns 412 Precondition Failed or 409 ConditionalRequestConflict to all concurrent challengers. That is distributed optimistic locking, not MVCC. I designed the system to exploit that property honestly, not to obscure it behind marketing language.

## Why I Chose This Stack

Here's a look at the other options I considered, why they didn't work for this specific use case, and the trade-offs I decided to accept.

### Why SQLite?

SQLite is not the obvious choice for a distributed serverless system. It was designed for local, single-host use. I chose it for five specific reasons:

**Zero schema migration infrastructure.** SQLite databases are self-describing single files. There is no schema migration daemon, no DDL lock, no migration server. A schema migration is `ALTER TABLE` or `CREATE TABLE IF NOT EXISTS` inside a transaction. The database file carries the schema. This matters enormously in a serverless environment where there is no 'migration runner' process.

**Read performance on local storage is exceptional.** When SQLite runs against `/tmp` on Lambda, reads hit the kernel page cache on the second and subsequent queries. SQLite's B-tree layout means a point lookup on an indexed table is deterministic in O(log N) page reads. On a 10,000-row table with a proper index, that is typically 3-4 page reads. All from local memory after the first access.

**The query language is complete.** Window functions, CTEs, full-text search (FTS5), JSON functions, UPSERT, RETURNING clauses. SQLite's SQL dialect is substantially complete for OLTP workloads. I can write queries in the application without writing a query language.

**The file format is stable and documented.** The SQLite file format is documented to the byte level at <a href="https://sqlite.org/fileformat2.html?utm_source=ranti.dev" target="_blank" rel="noopener noreferrer">sqlite.org</a>. It has been backward-compatible since version 3.0.0 in 2004. A database written by one SQLite version is readable by any later version. This matters when Lambda runtimes update.

**The operational surface is a single file.** Backup is a file copy. Restore is a file copy. Inspect the schema with `sqlite3 file.db .schema`. No daemon, no port, no credential rotation, no connection pool. For small-to-medium datasets, this operational simplicity is a genuine engineering advantage, not a compromise.

The trade-off:\* single-writer serialization means write throughput is bounded by the round-trip time of the S3 lease acquisition plus the S3 database upload. I'll break down the exact benchmarks later on.

### Why Not Amazon RDS PostgreSQL?

RDS PostgreSQL is the correct answer for workloads that need full MVCC multi-writer concurrency, mature replication, streaming CDC, connection pooling via PgBouncer, and decades of operational tooling. For this workload, RDS fails on three specific constraints.

**Always-on cost.** The smallest RDS PostgreSQL instance (`db.t4g.micro`) costs approximately $0.016/hour , which is roughly $140/year for a database that may serve fewer than 10,000 requests/month. At low scale, that cost is not justified.

**VPC requirement.** RDS lives inside a VPC. Lambda functions connecting to RDS must also be VPC-attached. VPC-attached Lambda functions lose access to most AWS services that use public endpoints unless you add NAT Gateways or VPC Endpoints - each of which adds cost and operational complexity. VPC attachment also adds cold-start latency to Lambda initialization.

**RDS Proxy overhead.** Lambda's ephemeral connection model is hostile to database connection pooling. The standard answer is RDS Proxy, which maintains a warm connection pool and multiplexes Lambda connections. RDS Proxy adds another ~$0.015/endpoint/hour plus additional IAM complexity.

\***\The trade-off:\*\*** I sacrifice true multi-writer concurrency and accept serialized writes through the lease protocol.

### Why Not Amazon Aurora Serverless?

Aurora Serverless v2 can scale down to 0.5 ACUs at minimum - it does not scale to zero. At 0.5 ACUs, you pay continuously even when your application serves no traffic. Aurora Serverless v1 did scale to zero but was deprecated and its replacement does not. The 2026 Aurora pricing model means a perpetually idle workload still incurs a meaningful monthly floor cost. My workload is bursty and idle for extended periods. Aurora Serverless does not match that cost profile.

Additionally, Aurora requires the same VPC and connection-management overhead as RDS. The scaling up story is excellent; the scaling to zero story no longer exists.

### Why Not Amazon DynamoDB?

DynamoDB genuinely scales to zero, has single-digit millisecond latency at any scale, and requires no infrastructure management. For specific access patterns it is the correct answer.

The constraint that eliminates it for this workload is the query model. DynamoDB is a key-value store with limited secondary index support. It requires you to pre-define access patterns at table design time, de-normalize data, and express queries as key conditions against known partition and sort key combinations. Ad-hoc `GROUP BY`, multi-table `JOIN`, window functions, and recursive CTEs require either extremely careful table design or moving computation into the application layer. I needed the ability to write expressive SQL queries against normalized data without pre-knowing all access patterns. DynamoDB cannot provide this.

PartiQL support in DynamoDB does not change this analysis. PartiQL on DynamoDB executes queries as DynamoDB operations under the hood, with the same key-access constraints.

\***\The trade-off:\*\*** I sacrifice DynamoDB's extreme horizontal write scalability and its true multi-master global table model.

### Why Not LiteFS?

LiteFS is a FUSE-based passthrough filesystem that intercepts SQLite write transactions, captures per-transaction page sets into LTX-formatted change files, and replicates them across nodes. It is an elegant design. I evaluated it seriously. I rejected it for three compounding reasons.

**FUSE is not available in Lambda execution environments.** Lambda does not expose the FUSE kernel module to execution environments. Without FUSE, LiteFS cannot function in its primary mode.

**The documented write ceiling is approximately 100 transactions per second.** LiteFS documentation explicitly states this FUSE-imposed throughput limit. For write-heavy workloads, this is a hard architectural constraint.

**LiteFS Cloud was sunset in October 2024.** The managed backup tier of LiteFS is no longer available. Fly.io has deprioritized active development. Pre-1.0 software with a deprioritized maintainer is a long-term risk for production infrastructure.

**Residual risk of my rejection:** LiteFS's `LiteVFS` - a pure SQLite VFS extension requiring no FUSE - is available and works where FUSE does not. I chose not to use it because LiteFS's distributed coordination relies on Consul leases or object-store conditional writes in a way that duplicates what I am building with raw S3 conditional writes. Adding LiteFS as a dependency would add complexity without removing the fundamental lease-protocol engineering I need to do anyway.

### Why Not Turso / libSQL?

libSQL is a fork of SQLite maintained by Turso that adds embedded replication, edge database routing, and an experimental `BEGIN CONCURRENT` multi-writer mode backed by MVCC-style page-level conflict detection. It maintains SQLite's wire format and API compatibility. If your team can accept a fork dependency, libSQL is the most architecturally complete answer to distributed SQLite.

I chose not to use it for one reason: I wanted to understand whether the problem could be solved using upstream SQLite without any fork. This is an engineering exploration as much as a product decision. If I were building a commercial system with strict availability and scale requirements, libSQL would be on my shortlist and S3-DB-WAL would be a prototype.

\***\The trade-off:\*\*** I lose `BEGIN CONCURRENT` and true MVCC multi-writer, which libSQL provides experimentally. My write model is serialized through the lease.

**One technical note about `BEGIN CONCURRENT`:** even in libSQL, `BEGIN CONCURRENT` aborts when two concurrent transactions touch pages on the same B-tree node - including different rows on the same B-tree page. The granularity is page-level, not row-level. This is a documented limitation of the page-conflict detection approach. It is not the same as PostgreSQL's tuple-level MVCC.

### Why Not rqlite?

rqlite is a correct, well-tested distributed relational database built on Raft consensus over SQLite. It provides linearizable reads at the `strong` consistency level, serializable writes through the Raft log, and automatic failover. For persistent cluster deployments, it is the most production-hardened distributed SQLite option available.

The constraint that eliminates it for this workload is the deployment model. rqlite requires a persistent cluster of servers - minimum 3 for quorum - with stable network addresses, persistent storage, and continuous operation. Running a persistent 3-node cluster on Lambda is not possible; Lambda is stateless by design. Running a 3-node cluster on EC2 or ECS would work, but that reintroduces the 'always-on compute cost' problem that drove me to explore this architecture in the first place.

### Why Not EFS + SQLite?

This is the most tempting alternative because EFS genuinely satisfies more of SQLite's locking requirements than any other AWS storage service. EFS supports NFSv4.1 advisory byte-range locking and whole-file locking. Lambda can mount EFS via Access Points with a VPC-attached function. This eliminates the S3 coordination complexity entirely.

I tested EFS + SQLite WAL mode. The fundamental problem is shared memory.

SQLite's WAL mode coordinator - the wal-index - is implemented as a memory-mapped file (`.db-shm`). Multiple concurrent readers and a single writer coordinate access to the WAL through this shared memory region. The critical point: **each host that mmaps the `.db-shm` file gets its own process-local memory mapping.** Changes written by one process are visible to other processes on the same host through the kernel's unified buffer cache. They are NOT visible to processes on different hosts even if both are reading the same file over NFS - because NFS mmap coherence across hosts requires a network-level distributed lock manager, which EFS's NFSv4.1 advisory locking does not provide in the way SQLite's WAL protocol requires.

The practical result: two Lambda execution environments opening the same SQLite WAL-mode database over EFS will maintain inconsistent wal-index state. This is not a race condition that careful locking can fix. It is a design assumption violation.

Single-writer EFS + SQLite in EXCLUSIVE locking mode (which eliminates the `-shm` requirement) is technically feasible. But now you have serialized access over a network filesystem with 5-10× higher per-operation latency than local storage. You have also added VPC attachment to your Lambda function, adding cold-start latency and NAT Gateway cost. Compared to S3-DB-WAL, EFS + SQLite EXCLUSIVE mode is strictly worse on latency, cost, and architectural simplicity for this workload.

\*\*

## How The Architecture Works

The system has three logical planes: a Control Plane managing write leadership and generation tracking, a Data Plane executing SQLite operations, and a Replication Plane propagating committed database generations across execution environments.

### Control Plane

The control plane is implemented entirely in S3. There are no dedicated servers, no coordination daemons, no external services.

**Write Lock Object:** `s3://{bucket}/db/{name}/write-lock`

This S3 object serves as the distributed mutex. Its content is a JSON document:

```json
{
  "holder_id": "{lambda_request_id}-{environment_id}",
  "acquired_at_epoch_ms": 1748980000000,
  "ttl_seconds": 45,
  "generation_etag": "{etag_of_current_db_file}"
}
```

**Lease Acquisition & Lifecycle:**

```mermaid
stateDiagram-v2
    [*] --> CheckLock: PutObject (If-None-Match: *)
    CheckLock --> Acquired: 200 OK
    CheckLock --> Failed: 412 / 409 Conflict

    Acquired --> Verify: Pre-Commit Check
    Verify --> Commit: holder_id matches
    Verify --> Abort: holder_id changed (TTL expired)

    Commit --> Released: DeleteObject
    Abort --> [*]
    Released --> [*]
```

**Lease Acquisition:** `PutObject` with `If-None-Match: *`. If the object does not exist, the PUT succeeds and the caller holds the lease. If it exists, S3 returns 412 Precondition Failed. If two callers race, S3 returns 409 ConditionalRequestConflict to all but the first to complete. The ETag of the successfully created lock object becomes the lock token.

**Lease Verification Before Commit:** Before committing the updated database transaction, the writer re-reads the lock object and confirms `holder_id` matches its own. If a previous lock expired and was acquired by a different caller during the write, the writer detects this and aborts the commit. It then re-attempts from the beginning.

**Lease Release:** `DeleteObject` on the lock key. S3's strong consistency guarantees the next `PutObject` with `If-None-Match: *` sees the absence.

**Expired Lease Recovery:** The TTL in the lock object is advisory. If the holding Lambda times out or crashes, the lock is not automatically released by S3 - S3 object deletion is not TTL-based unless S3 Lifecycle rules are configured. The recovery mechanism is: any writer that reads a lock object with `acquired_at_epoch_ms + ttl_seconds * 1000 < current_time_ms` may attempt to delete it and re-acquire. This is a TOCTOU race that I address later in this post (Failure Analysis). The mitigation is a Lambda-level heartbeat that re-extends the lock's TTL mid-transaction for long write operations.

**Database Object:** `s3://{bucket}/db/{name}/current.db`

This is the checkpointed SQLite database file, committed as a single S3 object (for databases up to the Lambda `/tmp` size limit). The ETag of this object is the generation identifier. S3 computes the ETag as the MD5 hash of the object content for single-part uploads. For multipart uploads, the ETag format changes and cannot be relied upon as a content hash - but it remains a reliable opaque generation token because S3 guarantees it changes on every write.

### Data Plane

The data plane executes directly against the S3 Files mount point (e.g., `/mnt/s3`). It consists of the SQLite database file and the WAL file (`.db-wal`). Because I use EXCLUSIVE mode for all writers, the WAL index (`.db-shm`) is never created on the network mount, eliminating the cross-process coherence problem.

**Writer Data Plane:**

1. Acquire write lock (control plane via S3 conditional write).
2. Open SQLite connection directly on the mount: `sqlite3_open_v2('/mnt/s3/db.sqlite', ...)`
3. Execute: `PRAGMA locking_mode = EXCLUSIVE; PRAGMA journal_mode = WAL;`
4. Execute write transactions.
5. Execute `PRAGMA wal_checkpoint(TRUNCATE)` to flush frames and truncate the WAL file.
6. Release S3 write lock.

**Reader Data Plane:**

1. Open SQLite connection directly on the mount in read-only mode: `sqlite3_open_v2('file:///mnt/s3/db.sqlite?mode=ro', ...)`
2. Execute queries.
3. Close connection. (Do not hold connections open across Lambda invocations; Lambda execution environments may be frozen mid-execution, causing connection state inconsistency.)

### Read Path

```mermaid
flowchart TD
    Req[Client Request] --> API[API Gateway]
    API --> L[Lambda Reader Environment]
    L --> Open["sqlite3_open_v2('file:///mnt/s3/db.sqlite?mode=ro')"]
    Open --> Exec["sqlite3_prepare_v2 / step / finalize"]
    Exec --> Res[Return JSON Result]
```

The key performance insight on the read path: S3 Files acts as the local filesystem. Thanks to SQLite's pager cache (which we size aggressively in Lambda memory), the first read of a page fetches from S3 Files, but subsequent reads for that page in a warm Lambda environment hit local memory immediately.

### Write Path

```mermaid
flowchart TD
    Req[Client Request] --> API[API Gateway]
    API --> L[Lambda Writer Environment]
    L --> Lock{"S3 PutObject<br/>'write-lock'<br/>(If-None-Match: *)"}
    Lock -- "200: Acquired" --> Open["sqlite3_open_v2('/mnt/s3/db.sqlite')"]
    Lock -- "412/409: Failed" --> Retry[Abort or Retry]

    Open --> Pragma["PRAGMA locking_mode = EXCLUSIVE<br/>PRAGMA journal_mode = WAL"]
    Pragma --> Tx["BEGIN;<br/>[write operations];<br/>COMMIT;"]
    Tx --> Checkpoint["PRAGMA wal_checkpoint(TRUNCATE)"]
    Checkpoint --> Unlock["S3 DeleteObject<br/>'write-lock'"]
```

### Checkpoint Path

WAL checkpointing is the mechanism by which WAL frames are transferred back into the main database file. I checkpoint explicitly after every write transaction batch, using `PRAGMA wal_checkpoint(TRUNCATE)`. The TRUNCATE mode both checkpoints all frames and truncates the WAL file to zero bytes. This ensures readers always hit the authoritative main file and don't need to reconstruct state across the NFS mount.

**A critical detail:** SQLite's WAL mode in EXCLUSIVE locking mode still writes to a `.db-wal` file. The difference is that no `.db-shm` is created or needed because no other process is allowed to read the WAL concurrently. After `PRAGMA wal_checkpoint(TRUNCATE)`, the `.db-wal` file exists but has zero WAL frames. The main database file is the authoritative, complete state of the data.

### Recovery Path

**Scenario: Lambda function crashes mid-write.**

The write lock remains in S3. The partially modified WAL file on the S3 Files mount is left behind, but because the transaction was not checkpointed and committed, the database remains in a consistent state. The next writer detects an expired lease, attempts `DeleteObject` on the lock, re-acquires with `If-None-Match: *`, opens the database, and proceeds normally. SQLite's standard crash recovery handles any uncommitted WAL frames.

### Backup Path

S3 Versioning on the database bucket provides automatic point-in-time backup. Every `PutObject` to `current.db` creates a new version. S3 stores all prior versions. I configure S3 Lifecycle rules to transition versions older than 7 days to S3 Glacier Instant Retrieval, and delete versions older than 90 days. This gives 90 days of point-in-time recovery at minimal cost.

Operational backup (distinct from S3 versioning) is a periodic Lambda-executed `VACUUM INTO '/tmp/backup.db'` followed by upload to a separate `backups/` prefix in S3. VACUUM INTO produces a defragmented, single-file database with no WAL and no `-shm`.

## The Linux VFS Journey of a Transaction

I will trace a single `INSERT` statement through the complete software stack, from the SQLite API boundary down to the S3 HTTP request, with every intermediate stage examined.

### The Call Site

```python
conn = sqlite3.connect('/mnt/s3/db.sqlite')
conn.execute('INSERT INTO events (id, payload) VALUES (?, ?)', (event_id, payload))
conn.commit()
```

At the Python level, `sqlite3.connect()` calls `sqlite3_open_v2()` in the SQLite C library. `conn.execute()` calls `sqlite3_prepare_v2()` followed by `sqlite3_step()`. `conn.commit()` triggers the WAL frame flush for the deferred transaction.

### sqlite3_step() → BTree Layer → Pager

`sqlite3_step()` drives the VDBE (Virtual Database Engine) - SQLite's bytecode interpreter. For an INSERT, the VDBE executes bytecode instructions including `OpenWrite` (opens a cursor on the target table's B-tree), `NewRowid` (allocates a new row ID), `MakeRecord` (serializes the row), `Insert` (inserts the record into the B-tree), and `Halt` (closes the cursor).

The `Insert` bytecode calls into the B-tree layer (`btree.c`), which locates the correct leaf page in the table's B-tree, inserts the serialized record, and handles page splits if the page is full. B-tree operations are entirely logical at this stage - they work against the pager's in-memory buffer of database pages.

The pager (`pager.c`) is SQLite's I/O abstraction layer. It maintains a page cache of 4096-byte pages (the default page size; configurable via `PRAGMA page_size`). When the BTree layer directs the pager to write a page, the pager does not immediately call the VFS. It marks the page as dirty and holds it in its cache. The actual write to storage does not occur until `COMMIT` causes the pager to write WAL frames.

### COMMIT → WAL Frame Construction

At `COMMIT`, in WAL mode, the pager writes the modified pages as WAL frames into the `.db-wal` file rather than updating the main database file in place. This is the core WAL mode distinction.

Each WAL frame consists of:

- A 24-byte frame header (page number, salt values, checksum)
- 4096 bytes of page content (one complete page)

The WAL also has a 32-byte file header (magic number, file format version, page size, checkpoint sequence number, salt pair, checksum).

The pager calls `sqlite3OsWrite()` - the VFS write abstraction - with the WAL frame data. In EXCLUSIVE locking mode, no `xShmLock` is called because there is no concurrent reader that needs to observe the WAL index update through shared memory.

### sqlite3OsWrite() → unix VFS → write() System Call

SQLite's unix VFS (`os_unix.c`) implements `xWrite()` by calling `pwrite64()` - a POSIX function for positioned writes. `pwrite64()` is equivalent to `lseek()` + `write()` but atomic with respect to the file offset - a critical property when multiple VFS calls might be interleaved in a concurrent system (though in our EXCLUSIVE mode, this is not a concern).

`pwrite64()` is a glibc wrapper around the `pwrite64` system call (syscall number 18 on x86-64).

### Crossing the Syscall Boundary

`pwrite64()` executes a `SYSCALL` instruction (formerly `INT 0x80` on i386; `syscall` on x86-64 since Linux 2.6). This transitions the CPU from ring 3 (userspace) to ring 0 (kernel space). The kernel's system call dispatch table routes the call to `ksys_pwrite64()` → `vfs_write()`.

I confirmed this path using `strace` on a SQLite WAL write workload on an EC2 Amazon Linux 2023 instance:

```
pwrite64(4, '\377\377\377\377...', 4096, 32) = 4096
```

The file descriptor 4 is the open `.db-wal` file. The offset 32 skips the WAL header. The count is exactly 4096 bytes (one WAL frame - 24-byte header + 4096-byte page, which I observe as two separate pwrite64 calls in practice: one for the 24-byte header and one for the 4096-byte body, or combined depending on SQLite's buffering).

### VFS Layer: inode, dentry, page cache

Inside the kernel, `vfs_write()` dispatches through the virtual filesystem switch to the concrete filesystem implementation - on the S3 Files mount, this is the filesystem underlying the ephemeral storage. AWS does not document this as `tmpfs` officially. On systems where `/tmp` is `tmpfs`, page cache and filesystem storage are unified - writes go directly to page-cache-backed memory pages associated with the inode. On ext4, writes go to the page cache with writeback scheduled by `pdflush`/`kswapd`.

The key structures involved:

**inode:** The inode for the `.db-wal` file is looked up via the dentry cache. The inode stores file metadata (size, modification time, block map) and a pointer to the address space (`i_mapping`) which manages the page cache for this file.

**dentry:** The directory entry cache maps `(/mnt/s3/db.sqlite-wal, parent_dentry)` to the inode. This is the first lookup during `open()`. On the `db.sqlite-wal` open at WAL-mode initialization, the dentry lookup finds or creates an entry. Subsequent writes hit the inode directly through the file descriptor's inode reference.

**page cache:** The kernel's page cache holds 4KB pages of file data in memory. The `write()` path (simplified): the kernel locates the page cache entry for the target file offset, marks the page dirty, copies user data into the page, and returns to userspace. The dirty page will be written back to the underlying storage asynchronously by the writeback mechanism. **This is the critical durability gap:** data in the page cache is not durable until written to the storage medium.

### fsync() and the Durability Guarantee

SQLite's WAL mode calls `xSync()` on the WAL file after writing WAL frames, and on the database file after a checkpoint. `xSync()` calls `fdatasync()` (or `fsync()` if `FULL` sync mode is configured). `fdatasync()` flushes dirty pages for the file to the storage device, excluding metadata updates that are not necessary for reading the file (e.g., atime).

On Lambda's `/mnt/s3` mount, the durability guarantee of `fdatasync()` depends on the underlying storage. Durability in S3-DB-WAL is not provided by local storage - it is provided by the S3 `PutObject` call that uploads the checkpointed database. Local ephemeral storage is scratch space by design.

### The S3 Write at Commit Time

After `PRAGMA wal_checkpoint(TRUNCATE)`, the WAL frames have been applied to the main database file. The pager has called `fdatasync()` on the main database file to ensure the checkpoint is durable to the (ephemeral) local storage. The application layer then executes:

```python
s3.put_object(
    Bucket=bucket,
    Key='db/name/current.db',

    IfMatch=previous_etag,
    ChecksumAlgorithm='SHA256'  # available since AWS SDK v1.24+
)
```

This S3 PUT call uses HTTPS (TLS 1.2 or 1.3) over TCP to the S3 regional endpoint. The request carries the database file as the request body. On the S3 side, the object is written atomically - readers will see either the previous version or the new version, never a partial write. The S3 service computes the ETag of the new object and returns it in the response. That ETag is the next generation identifier.

**Latency sources in the write path:**

1. B-tree page operations: microseconds (RAM)
2. WAL frame construction: microseconds to low milliseconds (RAM)
3. pwrite64() to the mount: network latency
4. fdatasync() on the mount: S3 commit latency
5. Checkpointing WAL to main DB: proportional to WAL frame count; typically milliseconds
6. S3 Files Sync: dominated by the EFS-backed network write latency for the modified blocks.

The S3 Files network writes are the dominant latency component on the write path. This is a design consequence of using a network filesystem.

## VFS vs Object Storage: A Fundamental Mismatch

The tension at the heart of this entire system is the impedance mismatch between what the POSIX/VFS model assumes about storage and what S3 actually provides.

### What VFS Assumes

The POSIX filesystem model makes the following guarantees that SQLite depends on:

**Byte-addressable, randomly writable files.** `pwrite(fd, buf, count, offset)` writes `count` bytes starting at `offset` into an open file. The bytes around the written region are unchanged. This is the assumption underlying every in-place page update SQLite makes.

**Atomic byte-range locking.** `fcntl(fd, F_SETLK, &flock)` atomically acquires or releases a byte-range advisory lock. The lock state is associated with the open file description (not just the file descriptor). Multiple processes on the same host coordinate through the kernel lock table. SQLite uses this for all its lock states: SHARED, RESERVED, PENDING, EXCLUSIVE.

**Visible mutation after write.** After `write()` returns, any subsequent `read()` from any process on the same host that has the file open will observe the new content - modulo page cache coherence, which is guaranteed on a single host by the kernel's unified buffer cache.

**Stable inode identity.** A file's inode number does not change while it is open. SQLite uses inode numbers to detect whether two file paths refer to the same file (important for WAL shared-memory coordination).

**Atomic rename.** `rename(src, dst)` atomically replaces `dst` with `src` in the directory. This is how SQLite's rollback journal mode atomically commits: write to a temp file, sync, rename.

### What S3 Actually Provides

**Whole-object replacement, not byte-addressable writes.** S3 Files abstracts this away, allowing standard POSIX partial writes over the network interface.

**No file locking.** S3 has no equivalent to `fcntl()` byte-range locking. The conditional write API (`If-None-Match`) provides a put-if-absent mechanism at the object level - a fundamentally coarser granularity.

**No inode identity.** S3 objects have no inode. The object key is the identifier. There is no concept of 'two paths to the same object.'

**No atomic rename.** S3 has no rename operation. Renaming requires copying then deleting. The window between copy-complete and delete creates a period where both names exist, or - if the delete happens first - a window where the object is temporarily absent. Neither is atomic. S3 CopyObject + DeleteObject is not transactional.

**Strong read-after-write consistency for individual objects.** After a successful `PutObject`, all subsequent `GetObject` and `HeadObject` requests for that key will observe the new version. Concurrent readers get either the old or the new version, never a partial write.

**Conditional writes as a locking primitive.** `PutObject` with `If-None-Match: *` provides optimistic lock acquisition at the object level. The first successful writer wins.

### The Design Consequence

The mismatch means that S3 cannot serve as a VFS backend for SQLite in any conventional sense. No amount of FUSE layering, NFS gateway translation, or S3-mount abstraction can provide byte-addressable random writes, atomic byte-range locking, and inode identity - because S3 does not implement those primitives. Mountpoint for Amazon S3 explicitly does not attempt to provide file locking, stating this as a documented limitation.

The design response is to stop trying to make S3 look like a filesystem, and instead redesign the architecture so that S3's actual primitives - atomic whole-object writes, strong read-after-write consistency, and conditional writes - become first-class design elements.

## SQLite Internals: The Substrate We Depend On

Understanding SQLite's internal architecture is not optional for this system. The design of S3-DB-WAL depends on specific SQLite behaviors that must be correct for the system to be safe.

### Page Layout

A SQLite database file is divided into fixed-size pages. The default page size is 4096 bytes. The first page is the database header page, which occupies the first 100 bytes of the first page and contains:

- Magic string: `'SQLite format 3\000'` (16 bytes)
- Page size in bytes (2 bytes, value 1 means 65536)
- File format write version (1=legacy, 2=WAL mode)
- File format read version (1=legacy, 2=WAL mode)
- Reserved space per page (usually 0)
- Maximum/minimum embedded payload fractions
- Leaf payload fraction
- File change counter (incremented on each database modification in journal mode; not used in WAL mode for change detection)
- Database size in pages
- First freelist trunk page
- Total freelist pages
- Schema cookie (incremented on schema change; clients invalidate prepared statements)
- Schema format number
- Default page cache size
- Largest B-tree root page number for auto-vacuum
- Text encoding (1=UTF-8, 2=UTF-16le, 3=UTF-16be)
- User version (settable via `PRAGMA user_version`)
- Incremental vacuum mode
- Application ID (settable via `PRAGMA application_id`)
- Reserved (20 bytes)
- Version-valid-for number
- SQLite version number

The **WAL mode flag** is stored in the file format write version and read version fields (value 2). A database opened in WAL mode has these fields set to 2. When WAL mode is disabled (by converting back to journal mode), these are reset to 1. This is how SQLite detects whether it should look for a `-wal` file on open.

### B-Tree Structure

SQLite uses two types of B-tree pages:

**Table B-trees (row store):** Interior pages contain keys (row IDs) and child page pointers. Leaf pages contain keys and associated data (the serialized row). The root page of each table is stored at a fixed page number recorded in the `sqlite_schema` table.

**Index B-trees:** Interior pages contain indexed column values and child page pointers. Leaf pages contain indexed column values and the associated row ID(s). No row data is stored in index B-trees.

**Page overflow:** When a row's serialized content exceeds what fits in a single leaf page, SQLite allocates overflow pages and chains them with 4-byte next-page pointers. The first 4 bytes of an overflow page are the next-overflow-page number (0 if last), followed by payload.

**Freelist:** Deleted rows and pages are tracked in a freelist. The freelist is organized as trunk pages (each containing an array of free page numbers) linked together. The first freelist trunk page number is in the database header. SQLite's VACUUM operation rebuilds the database file, compacting the freelist.

**The implication for S3-DB-WAL:** page-level granularity means that a single row INSERT modifies at minimum the B-tree leaf page containing the row, and possibly parent pages if a split occurs. Checkpointing flushes all modified pages from the WAL back to the main file. For S3-DB-WAL, S3 Files handles the page-level streaming behind the scenes.

### WAL Architecture

WAL mode changes SQLite's write strategy from modify-in-place (with a rollback journal) to append-only (with a write-ahead log).

**The WAL file (`.db-wal`)** is a sequence of WAL frames. Each frame writes a complete 4096-byte page at a specific page number. The same page number may appear in multiple frames - the last frame for a given page number is authoritative. Reading a page requires checking the WAL for the latest frame containing that page number before falling back to the main database file.

**The WAL index (`.db-shm`)** is a memory-mapped file that indexes the WAL contents. It maps page numbers to the position of their last frame in the WAL. SQLite uses this to avoid reading the entire WAL on every page lookup. The WAL index is not a write-ahead log for the WAL itself - it is a cache of the WAL's contents, rebuilt by WAL recovery if found to be inconsistent.

**Transactions in WAL mode:**

1. A write transaction acquires a SHARED lock on the database, then a RESERVED lock, then writes WAL frames, then acquires a PENDING lock, then an EXCLUSIVE lock, then commits (by flushing the WAL).
2. Concurrent readers take a SHARED lock and read a consistent snapshot corresponding to the last committed transaction visible to them.
3. WAL reads do not block on WAL writes. Multiple readers and one writer can operate simultaneously.

**EXCLUSIVE locking mode and WAL:**

When `PRAGMA locking_mode = EXCLUSIVE` is set before entering WAL mode, SQLite holds an exclusive lock on the database file for the lifetime of the connection. No other connection can read or write. The `-shm` file is not used - the WAL index is held in heap memory. This is the mode I use for the writer connection in S3-DB-WAL. It eliminates the need for cross-process WAL index coherence.

**Checkpointing:**

`PRAGMA wal_checkpoint(MODE)` transfers WAL frames back to the main database file. Modes:

- `PASSIVE`: Checkpoint whatever can be checkpointed without blocking. Non-blocking.
- `FULL`: Checkpoint all frames, waiting for readers to release SHARED locks. Does not truncate.
- `RESTART`: Like FULL, plus waits for all frames to be overwritten before truncating.
- `TRUNCATE`: Like RESTART, plus truncates the WAL file to zero bytes after checkpoint.

I use `TRUNCATE` to ensure the committed database file is the complete, authoritative state with no outstanding WAL frames.

## WAL and SHM Memory Layout

This section provides the memory-layout detail needed to understand why WAL mode cannot be shared across Lambda execution environments via any network-mounted path, including EFS or S3 Files.

### The `.db` File (Main Database)

```
┌─────────────────────────────────────────────┐
│ Page 1: Database Header (100 bytes) +       │
│          First B-tree page (root of         │
│          sqlite_schema table)               │
├─────────────────────────────────────────────┤
│ Page 2: Root page of first user table       │
├─────────────────────────────────────────────┤
│ Page 3: First index root page               │
├─────────────────────────────────────────────┤
│ ...                                         │
├─────────────────────────────────────────────┤
│ Page N: Last allocated page                 │
└─────────────────────────────────────────────┘
  Each page: exactly 4096 bytes (default)
  File size: N * 4096 bytes
```

The file format write version byte at offset 18 is `0x02` when WAL mode is active, signaling to any reader that WAL recovery should be performed on open.

### The `.db-wal` File (Write-Ahead Log)

```
┌─────────────────────────────────────────────┐
│ WAL Header (32 bytes)                       │
│   magic:       0x377f0682 or 0x377f0683     │
│   file-format: 3007000                      │
│   page-size:   4096                         │
│   ckpt-seq:    N (checkpoint sequence)      │
│   salt-1:      random 32-bit value          │
│   salt-2:      random 32-bit value          │
│   checksum-1                                │
│   checksum-2                                │
├─────────────────────────────────────────────┤
│ Frame 1 Header (24 bytes)                   │
│   page-number: 3                            │
│   db-size:     0 (0 = not a commit frame)   │
│   salt-1, salt-2, checksum-1, checksum-2    │
├─────────────────────────────────────────────┤
│ Frame 1 Data (4096 bytes): Page 3 content   │
├─────────────────────────────────────────────┤
│ Frame 2 Header (24 bytes)                   │
│   page-number: 7                            │
│   db-size:     1024 (commit frame: db size  │
│                at commit time, in pages)    │
│   salt-1, salt-2, checksum-1, checksum-2    │
├─────────────────────────────────────────────┤
│ Frame 2 Data (4096 bytes): Page 7 content   │
└─────────────────────────────────────────────┘
```

A **commit frame** is a WAL frame where `db-size` is non-zero. It marks the end of a committed transaction. Readers scan forward from the beginning of the WAL, applying frames, until they find a commit frame for the transaction they want to read.

**After `PRAGMA wal_checkpoint(TRUNCATE)`:**
The WAL file's content is effectively zeroed (truncated to zero bytes). The main database file contains all committed data. Any reader opening the database after this checkpoint finds no WAL frames to apply and reads entirely from the main file. This is the state in which the database rests on the mount.

### The `.db-shm` File (WAL Index / Shared Memory)

The `.db-shm` file is 32KB by default (8 x 4KB pages), memory-mapped by all processes that have the database open in WAL mode (except EXCLUSIVE locking mode connections).

```
┌─────────────────────────────────────────────┐
│ WAL Index Header (2 copies, 48 bytes each)  │
│   iVersion:    3007000                      │
│   iChange:     monotonically increasing     │
│   isInit:      1 when initialized           │
│   bigEndCksum: byte-order flag              │
│   szPage:      4096                         │
│   mxFrame:     last frame in WAL            │
│   nPage:       database size in pages       │
│   aFrameCksum: last commit frame checksum   │
│   aSalt:       copy of WAL salt values      │
│   aCksum:      checksum of header           │
├─────────────────────────────────────────────┤
│ Lock Bytes (136 bytes at offset 120)        │
│   WAL_WRITE_LOCK:    byte 0 (offset 120)    │
│   WAL_ALL_BUT_WRITE: bytes 1-4              │
│   WAL_CKPT_LOCK:     byte 5                 │
│   WAL_RECOVER_LOCK:  byte 6                 │
│   WAL_READ_LOCK(0-4): bytes 7-11            │
├─────────────────────────────────────────────┤
│ Hash Index (remaining space)                │
│   Two-level hash table mapping page numbers │
│   to WAL frame numbers                      │
│   Level 1: first 4096 frames               │
│   Level 2+: overflow pages                 │
└─────────────────────────────────────────────┘
```

**The lock bytes** are the specific byte positions within the `.db-shm` file that SQLite uses for `fcntl()` byte-range locking. When process A holds a SHARED read lock, it holds a shared `fcntl` lock on one of `WAL_READ_LOCK(0-4)`. When a writer is active, it holds an exclusive lock on `WAL_WRITE_LOCK`.

**Why this cannot work across network filesystems:**

When process A (on host 1) and process B (on host 2) both mmap the `.db-shm` file via NFS or EFS, they each get a separate process-local mapping. Writes by process A to its mmap are visible to other processes on host 1 via the kernel's unified buffer cache. They are NOT automatically visible to process B on host 2. NFS coherence for mmap'd files requires the NFSv4 delegation protocol and a distributed lock manager - and even with these, the semantics differ from the kernel's local unified buffer cache in ways that break SQLite's lock-byte coordination.

**In EXCLUSIVE locking mode:** The `.db-shm` file is not created or used. SQLite allocates the WAL index in process heap memory. This is the mode that makes S3-DB-WAL viable - the writer has no need for a cross-process shared memory segment.

## mmap() Deep look

Understanding `mmap()` is essential to understanding both why WAL mode fails across network filesystems and how SQLite's optional memory-mapped I/O mode works.

### What mmap() Does

`mmap(addr, length, prot, flags, fd, offset)` creates a mapping of a file (or anonymous memory) into the calling process's virtual address space. For the `.db-shm` case:

```c
shm_mapping = mmap(0, 32768, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);
```

- `MAP_SHARED`: writes to the mapped region are reflected back to the file and visible to other processes that have the same file mapped with `MAP_SHARED`.
- `PROT_READ|PROT_WRITE`: the region is readable and writable.
- The returned pointer `shm_mapping` is a virtual address in the process's address space.

When process A writes `shm_mapping[0] = 1`, the CPU writes to the virtual address, which maps to a physical page. On Linux, `MAP_SHARED` mappings are backed by the page cache. The write is immediately visible to all other processes that have the same file's page at the same physical page - which, on a single host, is guaranteed by the kernel's page cache.

### Page Faults and Dirty Pages

The first access to a memory-mapped page causes a **page fault**. The kernel's fault handler (`handle_mm_fault()`) looks up the file offset in the page cache. If the page is present (cache hit), the kernel maps the physical page into the process's page table and returns. If absent (cache miss), the kernel reads the page from the underlying storage (a file read), adds it to the page cache, and maps it.

When the process writes to the mapped page, the CPU sets the page table entry's dirty bit. The page is now a **dirty page** in the page cache. The kernel's writeback mechanism (`pdflush` in older kernels; `bdi-write` threads in modern kernels) periodically writes dirty pages to the underlying storage. For `MAP_SHARED` mappings, `msync(MAP_SYNC)` or `msync(MS_SYNC)` forces synchronous writeback.

### SQLite's mmap() Mode for the Main Database

Separately from the `.db-shm` WAL index, SQLite supports memory-mapped I/O for the main database file, controlled by `PRAGMA mmap_size`. When `mmap_size > 0`, SQLite maps up to that many bytes of the database file into the process address space and reads pages directly from the mapped region rather than using `read()` system calls.

For S3-DB-WAL, I **disable** `mmap_size` (`PRAGMA mmap_size = 0`) because:

1. The database file is on a network mount. The additional mmap layer provides no performance benefit.
2. For the write path, the database file is regularly replaced (overwritten on checkpoint upload). An existing mmap of the file would need to be unmapped and remapped. Managing this correctly requires careful lifecycle handling.
3. With `mmap_size = 0`, all reads go through `read()` system calls into SQLite's pager cache. The pager cache size is controlled by `PRAGMA cache_size` and provides a higher-level caching layer that is more predictable to reason about.

### mmap() Coherence Across Hosts - Why It Fails

When two processes on different hosts both map the same file via NFS, each process's kernel maps a local page-cache copy of the file. There is no shared physical memory. Changes by process A that reach the NFS server may or may not be visible to process B's mapping, depending on NFS attribute caching (`acregmin`, `acregmax`, `actimeo`) and NFSv4 delegation state.

SQLite's WAL index lock bytes - the `fcntl()` byte-range locks on the `.db-shm` file - depend on the kernel's lock table, not on memory visibility. On NFS, `fcntl()` locks are forwarded to the NFS server's lock manager via the NLM (Network Lock Manager) protocol on NFSv3, or integrated into the NFSv4 lease mechanism. But the consistency between the `fcntl()` lock state and the visible content of the mmap'd region is not guaranteed to be atomic or instantaneous across the network. This is the documented reason SQLite warns against WAL on network filesystems.

## POSIX Locking Deep look

SQLite's locking protocol is one of its most carefully designed subsystems. Understanding it precisely is necessary to understand both why EXCLUSIVE mode enables S3-DB-WAL and what safety properties the lease protocol must provide.

### Lock Types and the POSIX Layer

SQLite uses `fcntl(fd, F_SETLK/F_SETLKW, &lock)` for all file locking on POSIX systems. The lock structure specifies:

- `l_type`: `F_RDLCK` (shared/read), `F_WRLCK` (exclusive/write), `F_UNLCK` (release)
- `l_whence`, `l_start`, `l_len`: byte range being locked (SQLite uses specific byte offsets in the database file for each lock type)
- `l_pid`: set by the kernel to the locking process's PID

`fcntl()` advisory locks are:

- **Advisory, not mandatory.** Any process can open and read/write the file regardless of locks. Locking only affects other callers of `fcntl()` lock operations. SQLite relies on the convention that all SQLite instances cooperate through this protocol.
- **Associated with the open file description.** A single process can hold multiple file descriptors to the same file, and each `(pid, inode)` pair has a single lock state. Closing any file descriptor to a file releases all `fcntl` advisory locks that process holds on that file. This is a notorious footgun and is why SQLite's unix VFS maintains careful tracking of open file descriptors.
- **Not inherited by fork().** A child process does not inherit `fcntl` locks from its parent.

### SQLite's Database File Lock States

SQLite's database file locking protocol (for non-WAL, journal mode) uses five states, implemented by `fcntl()` locks on specific byte ranges of the database file:

| State     | `fcntl` Type | Byte Range (in database file) | Semantics                                    |
| --------- | ------------ | ----------------------------- | -------------------------------------------- |
| UNLOCKED  | -            | -                             | No lock held                                 |
| SHARED    | F_RDLCK      | [reading offset range]        | Reading; multiple holders                    |
| RESERVED  | F_WRLCK      | [single byte]                 | Will write soon; one holder; readers allowed |
| PENDING   | F_WRLCK      | [single byte]                 | Waiting for readers to drain; no new SHARED  |
| EXCLUSIVE | F_WRLCK      | [full range]                  | Writing; no other readers or writers         |

The exact byte offsets (documented as implementation details in `os_unix.c`): SHARED lock covers a range of 510 bytes starting at offset `SHARED_FIRST = 0x40000000`. RESERVED lock is byte `SHARED_FIRST + 510`. PENDING lock is byte `SHARED_FIRST + 511`. EXCLUSIVE lock covers the same range as SHARED.

### WAL Lock States

In WAL mode, the main database file lock protocol is simplified (no RESERVED/PENDING/EXCLUSIVE cycling for writes), and a separate WAL-specific lock protocol operates on byte ranges within the `.db-shm` file:

| Lock                | Byte in .db-shm | Holder                         | Blocks                             |
| ------------------- | --------------- | ------------------------------ | ---------------------------------- |
| WAL_WRITE_LOCK      | 120             | Current WAL writer (exclusive) | All other write attempts           |
| WAL_CKPT_LOCK       | 125             | Checkpoint runner (exclusive)  | Other checkpoint attempts          |
| WAL_RECOVER_LOCK    | 126             | Recovery runner (exclusive)    | Other recovery attempts            |
| WAL_READ_LOCK(0)    | 127             | Read-transaction (shared)      | Checkpoint when WAL frames present |
| WAL_READ_LOCK(1..4) | 128..131        | Additional read slots          | Varies                             |

A reader chooses one of `WAL_READ_LOCK(0-4)` slots. The checkpoint runner cannot checkpoint past any frame that is newer than the oldest read-lock holder's snapshot. This is WAL's snapshot isolation mechanism.

### EXCLUSIVE Locking Mode and Lock Elimination

When `PRAGMA locking_mode = EXCLUSIVE` is set, SQLite acquires and holds the EXCLUSIVE lock on the database file for the lifetime of the connection. On the first operation that requires a write lock:

1. Acquire SHARED lock.
2. Acquire RESERVED lock.
3. Acquire PENDING lock.
4. Acquire EXCLUSIVE lock.

And never releases these locks until the connection is closed. Subsequent transactions reuse the held EXCLUSIVE lock without acquiring/releasing. For WAL mode specifically, the WAL write lock is similarly held. No `.db-shm` is created; the WAL index lives in heap.

**The distributed safety consequence:** In EXCLUSIVE locking mode, the distributed lease is responsible for ensuring that no two Lambda execution environments hold EXCLUSIVE connections to the database simultaneously. The lease protocol (S3 `If-None-Match` on the lock object) provides this guarantee. If two environments somehow both obtained the lease (a lease expiry race, discussed later in this post), both would attempt to proceed with EXCLUSIVE WAL connections on the same S3 Files mount. At this point, the underlying EFS NFSv4.1 file locking would act as the final backstop, preventing the second connection from acquiring the `fcntl()` EXCLUSIVE lock on the database file, ensuring safety.

### flock() vs fcntl()

`flock(fd, LOCK_SH|LOCK_EX|LOCK_UN)` is an alternative locking API that operates on entire files, not byte ranges. SQLite's default unix VFS uses `fcntl()`, not `flock()`. SQLite provides an alternative VFS (`unix-flock`) that uses `flock()` for environments where `fcntl()` is broken (notably some NFS implementations). EFS documents support for both NFSv4.1 advisory locking (which maps to `fcntl()`) and whole-file locking.

For S3-DB-WAL, neither `flock()` nor `fcntl()` across hosts is in scope - the lease mechanism is in S3, not in any filesystem locking primitive.

## Page Cache Deep look

The kernel page cache is the in-memory buffer for filesystem I/O. Its behavior determines the performance characteristics of SQLite reads and the durability semantics of SQLite writes.

### Clean and Dirty Pages

**Clean pages** contain data that matches what is on disk (or in the backing store). Reading from a clean page is a memory access - no I/O. **Dirty pages** contain modifications not yet written to the backing store. The kernel's writeback mechanism flushes dirty pages asynchronously.

For S3-DB-WAL, 'on disk' means Lambda's `/mnt/s3` mount storage. Whether that storage is RAM (as with tmpfs) or a block device is not documented by AWS. The observable behavior is that reads after writes are immediate and consistent - which is expected for any local filesystem.

### Writeback and Durability

SQLite's `xSync()` VFS method calls `fdatasync()` to flush dirty pages for a specific file to the underlying storage device. `fdatasync()` does not return until the kernel guarantees the data is durable to the storage medium.

**If `/tmp` is tmpfs:** `fdatasync()` is effectively a no-op for durability - the data is in DRAM. A Lambda environment termination loses ephemeral contents. This is acceptable because durability in S3-DB-WAL is provided by S3, not by `/tmp`.

**The durability model is explicit:** data is durable only after a successful S3 Files fsync/commit.

### Readahead

The Linux kernel performs readahead on sequential file access - prefetching pages it predicts will be needed soon. For SQLite's B-tree access pattern (which is largely random across pages), readahead provides limited benefit. SQLite's pager does its own prefetch logic when using `PRAGMA cache_size` to maintain a working set of pages in SQLite's internal page cache, which operates above the kernel page cache.

For S3-DB-WAL, the most important caching tier is SQLite's pager cache, set via `PRAGMA cache_size = -65536` (64MB of cache, specified as negative KB value). For a Lambda function serving a read-heavy workload, this cache keeps frequently accessed pages in SQLite's heap rather than incurring a `read()` syscall to the kernel page cache on every access.

### Cache Invalidation for S3-DB-WAL

**The critical scenario:** A warm Lambda reader has SQLite's pager cache populated. A writer in another execution environment commits a new database generation to S3. The reader's local copy is now stale.

My invalidation mechanism:

With S3 Files, cache invalidation is handled by the underlying NFS client and SQLite's internal mechanisms. The VFS transparently fetches updated pages from the mount when the file changes.

## NFS and Network Filesystem Behavior

I include this section because EFS, S3 Files, and Mountpoint are all network filesystems of one form or another, and the NFS protocol's semantics explain why they cannot serve as WAL-mode SQLite hosts.

### NFSv4.1 Protocol Operations

NFSv4.1 defines these relevant compound operations (per RFC 5661):

**OPEN / CLOSE:** Opening a file negotiates a delegation (optional): the server can grant a read delegation (client can serve reads without contacting server) or a write delegation (client has exclusive write rights). Delegations are recalled when conflicts arise.

**READ / WRITE:** Reads and writes transfer data between client and server. NFSv4 write semantics: `WRITE` with `FILE_SYNC4` mode guarantees data is stable on the server before the operation returns. Without `FILE_SYNC4`, data may be cached at the NFS server and flushed via `COMMIT`.

**LOCK / LOCKU:** NFSv4 incorporates byte-range locking into the protocol (unlike NFSv3's external NLM). A `LOCK` operation specifies a byte range and lock type (READ, WRITE). The server serializes lock grants. EFS implements these and advertises POSIX advisory lock support.

**COMMIT:** Flushes data previously written in UNSTABLE mode to stable storage on the NFS server. Analogous to `fsync()` on the client side, but applies to server-buffered writes.

### The Fundamental NFS/SQLite WAL Problem

The NFS problem for SQLite WAL has two distinct layers:

**Layer 1: Lock semantics.** `fcntl()` byte-range locks on the `.db-shm` file are forwarded to the NFS server's lock manager via NFSv4's LOCK operation. Two clients cannot simultaneously hold conflicting locks - the server enforces this. So the locking mechanism itself is correct: if process A holds WAL_WRITE_LOCK on `.db-shm`, process B's attempt to acquire WAL_WRITE_LOCK will block or fail.

**Layer 2: Memory coherence.** Even with correct lock semantics, the mmap'd content of `.db-shm` is not coherent across hosts. When process A (on Lambda host 1) writes to its mmap of `.db-shm` - updating the WAL index hash table to reflect new frames - that write goes into host 1's page cache. Process B (on Lambda host 2) has its own page cache copy of `.db-shm`. Before process B reads the WAL index, it must receive the updated pages from the NFS server. But NFS caching (attribute caching, data caching at `rsize`/`wsize` granularity) means process B may read stale WAL index data even after process A's update has been committed to the NFS server.

**Documented behavior (AWS EFS):** EFS documentation states 'All file system operations occur synchronously relative to the application.' This refers to NFSv4.1 CLOSE-TO-OPEN cache consistency: data written and closed by one client is visible to any other client that subsequently opens the file. This is sufficient for sequential read-after-write but NOT for the concurrent mmap-based WAL index coordination that SQLite WAL requires.

**The conclusion:** EFS + SQLite WAL with multiple concurrent writers across Lambda execution environments is not safe. EFS + SQLite in EXCLUSIVE locking mode (single writer) is possible but slower than local memory cache for every read and write, adds VPC complexity, and provides no concurrency advantage over S3-DB-WAL.

## Lambda Runtime Internals

Lambda's execution environment model has direct implications for every aspect of S3-DB-WAL's design - from cold start latency to connection lifecycle to the SnapStart hazard.

### Execution Environment Lifecycle

AWS Lambda creates an isolated execution environment (a Firecracker MicroVM) for each function. The lifecycle has three phases:

**Init phase:** Lambda downloads the function package, extracts it, runs the runtime bootstrap, and executes the handler module's top-level code (everything outside the handler function). This is the 'cold start.' Cold start duration varies by runtime: Python and Node.js cold starts are typically faster than JVM-based runtimes; the function package size, layer size, and VPC attachment each affect duration. VPC-attached functions have historically had longer cold starts due to ENI attachment, though AWS's hyperplane ENI improvements (introduced 2019) significantly reduced this.

**Invoke phase:** Lambda executes the handler function for each incoming request. The handler receives an event and context object.

**Shutdown phase:** Lambda sends a SIGTERM signal to the runtime, allowing graceful cleanup, then terminates the environment. Shutdown typically follows a period of inactivity.

**Warm start:** When a request arrives for a function that has a running (but idle) execution environment, Lambda sends the event to that environment without re-running the Init phase. Code outside the handler (module-level globals, connection objects, cached state) persists across warm invocations. For S3-DB-WAL, this means:

- The local `/mnt/s3/db.sqlite` file persists across warm invocations within the same environment.
- The cached generation ETag (stored in a process-level global) persists.
- The boto3 S3 client object (with its TCP connection pool) persists.

I exploit this aggressively: since we query directly against the S3 Files mount, there is no expensive download phase. The filesystem handles caching the reads automatically.

### Cold Start Hydration Latency

The cold start for a reader Lambda in S3-DB-WAL has this sequence:

```mermaid
sequenceDiagram
    participant API as API Gateway
    participant L as Lambda Reader
    participant S3 as S3 Files Mount

    API->>L: First Request
    Note over L: Init Phase: Load Runtime
    L->>S3: POSIX Mount Initialized
    L->>S3: sqlite3_open_v2()
    S3-->>L: Stream requested DB pages (Zero Download)
    L->>L: Execute Query
    L-->>API: Return Result
```

Cold start initialization latency is significantly reduced because we query directly against the S3 Files mount instead of downloading the entire database to `/tmp` first.

I mitigate cold start latency with Lambda Provisioned Concurrency for latency-sensitive reader functions, where the Init phase (including initial initialization) runs before any request arrives.

### SnapStart Hazard

Lambda SnapStart (initially Java-only, later expanded to Python and .NET managed runtimes) takes a snapshot of the execution environment after the Init phase and resumes multiple execution environments from that snapshot.

**The hazard for S3-DB-WAL:**

SnapStart snapshots capture the execution environment state, including open file descriptors. If a connection to the S3 Files mount is open when the snapshot is taken, it will be broken upon resume.

**The mitigation:** Register a Lambda SnapStart hook (using the `RestoreHook` interface for Python, or equivalent for other runtimes) that clears the generation ETag cache and forces a re-check on first invocation after restore. Additionally, do NOT pre-hydrate the database during the Init phase for SnapStart functions - hydrate lazily on first invocation, after the generation check.

### Concurrency and Execution Environment Count

Lambda scales out by creating additional execution environments. Each environment is independent: it has its own memory space, its own process space, its own SQLite connection. Ten concurrent requests may be handled by ten separate execution environments, each reading from the mount.

For readers: this is safe. Each reader serves a consistent snapshot from its local copy, refreshed to the latest S3 generation when staleness is detected.

For writers: this is where the lease protocol matters. If ten Lambda functions all try to write simultaneously, the lease protocol guarantees exactly one succeeds per commit cycle. The other nine receive 412 or 409 from S3 and must retry. The system degrades gracefully under write concurrency - write latency increases with contention, but correctness is maintained.

## Production Infrastructure

I verified every Terraform resource against the AWS provider documentation before inclusion. No fictional resources are present.

```hcl
terraform {
  required_version = '>= 1.5'
  required_providers {
    aws = {
      source  = 'hashicorp/aws'
      version = '~> 5.0'
    }
  }
}

# ─────────────────────────────────────────────
# S3 Bucket - Database Storage
# ─────────────────────────────────────────────

resource 'aws_s3_bucket' 'db' {
  bucket        = '${var.project_name}-s3-db-wal-${random_id.suffix.hex}'
  force_destroy = false

  tags = {
    Project     = var.project_name
    Environment = var.environment
    ManagedBy   = 'terraform'
  }
}

resource 'random_id' 'suffix' {
  byte_length = 4
}

resource 'aws_s3_bucket_versioning' 'db' {
  bucket = aws_s3_bucket.db.id
  versioning_configuration {
    status = 'Enabled'
  }
}

resource 'aws_s3_bucket_server_side_encryption_configuration' 'db' {
  bucket = aws_s3_bucket.db.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = 'aws:kms'
      kms_master_key_id = aws_kms_key.db.arn
    }
    bucket_key_enabled = true
  }
}

resource 'aws_s3_bucket_public_access_block' 'db' {
  bucket                  = aws_s3_bucket.db.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource 'aws_s3_bucket_lifecycle_configuration' 'db' {
  bucket = aws_s3_bucket.db.id

  rule {
    id     = 'retain-versions'
    status = 'Enabled'
    filter { prefix = 'db/' }

    noncurrent_version_transition {
      noncurrent_days = 7
      storage_class   = 'GLACIER_IR'
    }
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }

  rule {
    id     = 'expire-stale-locks'
    status = 'Enabled'
    filter { prefix = 'db/' suffix = '/write-lock' }
    # Emergency backstop: expire locks older than 24h
    # Normal lock release is via DeleteObject in application code.
    expiration {
      days = 1
    }
  }
}

resource 'aws_s3_bucket_notification' 'db' {
  bucket = aws_s3_bucket.db.id
  # Emit EventBridge events on all object operations for audit trail
  eventbridge = true
}

# ─────────────────────────────────────────────
# KMS Key - Database Encryption
# ─────────────────────────────────────────────

resource 'aws_kms_key' 'db' {
  description             = 'S3-DB-WAL database encryption key'
  deletion_window_in_days = 30
  enable_key_rotation     = true

  policy = jsonencode({
    Version = '2012-10-17'
    Statement = [
      {
        Sid       = 'EnableIAMUserPermissions'
        Effect    = 'Allow'
        Principal = { AWS = 'arn:aws:iam::${data.aws_caller_identity.current.account_id}:root' }
        Action    = 'kms:*'
        Resource  = '*'
      },
      {
        Sid    = 'AllowLambdaEncryptDecrypt'
        Effect = 'Allow'
        Principal = {
          AWS = [
            aws_iam_role.reader.arn,
            aws_iam_role.writer.arn
          ]
        }
        Action   = ['kms:GenerateDataKey', 'kms:Decrypt']
        Resource = '*'
      }
    ]
  })
}

resource 'aws_kms_alias' 'db' {
  name          = 'alias/${var.project_name}-s3-db-wal'
  target_key_id = aws_kms_key.db.key_id
}

# ─────────────────────────────────────────────
# IAM - Reader Role
# ─────────────────────────────────────────────

resource 'aws_iam_role' 'reader' {
  name = '${var.project_name}-s3-db-wal-reader'
  assume_role_policy = jsonencode({
    Version = '2012-10-17'
    Statement = [{
      Effect    = 'Allow'
      Principal = { Service = 'lambda.amazonaws.com' }
      Action    = 'sts:AssumeRole'
    }]
  })
}

resource 'aws_iam_policy' 'reader' {
  name = '${var.project_name}-s3-db-wal-reader'
  policy = jsonencode({
    Version = '2012-10-17'
    Statement = [
      {
        Sid    = 'S3ReadDatabase'
        Effect = 'Allow'
        Action = [
          's3:GetObject',
          's3:HeadObject',
          's3:GetObjectVersion'
        ]
        Resource = [
          '${aws_s3_bucket.db.arn}/db/*'
        ]
      },
      {
        Sid    = 'CloudWatchLogs'
        Effect = 'Allow'
        Action = [
          'logs:CreateLogGroup',
          'logs:CreateLogStream',
          'logs:PutLogEvents'
        ]
        Resource = 'arn:aws:logs:*:*:*'
      },
      {
        Sid    = 'XRayTracing'
        Effect = 'Allow'
        Action = [
          'xray:PutTraceSegments',
          'xray:PutTelemetryRecords'
        ]
        Resource = '*'
      }
    ]
  })
}

resource 'aws_iam_role_policy_attachment' 'reader' {
  role       = aws_iam_role.reader.name
  policy_arn = aws_iam_policy.reader.arn
}

# ─────────────────────────────────────────────
# IAM - Writer Role
# ─────────────────────────────────────────────

resource 'aws_iam_role' 'writer' {
  name = '${var.project_name}-s3-db-wal-writer'
  assume_role_policy = jsonencode({
    Version = '2012-10-17'
    Statement = [{
      Effect    = 'Allow'
      Principal = { Service = 'lambda.amazonaws.com' }
      Action    = 'sts:AssumeRole'
    }]
  })
}

resource 'aws_iam_policy' 'writer' {
  name = '${var.project_name}-s3-db-wal-writer'
  policy = jsonencode({
    Version = '2012-10-17'
    Statement = [
      {
        Sid    = 'S3WriteLock'
        Effect = 'Allow'
        Action = [
          's3:PutObject',
          's3:DeleteObject',
          's3:GetObject',
          's3:HeadObject'
        ]
        Resource = [
          '${aws_s3_bucket.db.arn}/db/*/write-lock'
        ]
      },
      {
        Sid    = 'S3WriteDatabase'
        Effect = 'Allow'
        Action = [
          's3:PutObject',
          's3:GetObject',
          's3:HeadObject',
          's3:GetObjectVersion'
        ]
        Resource = [
          '${aws_s3_bucket.db.arn}/db/*/current.db',

        ]
      },
      {
        Sid    = 'CloudWatchLogs'
        Effect = 'Allow'
        Action = [
          'logs:CreateLogGroup',
          'logs:CreateLogStream',
          'logs:PutLogEvents'
        ]
        Resource = 'arn:aws:logs:*:*:*'
      },
      {
        Sid    = 'XRayTracing'
        Effect = 'Allow'
        Action = [
          'xray:PutTraceSegments',
          'xray:PutTelemetryRecords'
        ]
        Resource = '*'
      }
    ]
  })
}

resource 'aws_iam_role_policy_attachment' 'writer' {
  role       = aws_iam_role.writer.name
  policy_arn = aws_iam_policy.writer.arn
}

# ─────────────────────────────────────────────
# CloudWatch Log Groups
# ─────────────────────────────────────────────

resource 'aws_cloudwatch_log_group' 'reader' {
  name              = '/aws/lambda/${var.project_name}-s3-db-wal-reader'
  retention_in_days = 30
}

resource 'aws_cloudwatch_log_group' 'writer' {
  name              = '/aws/lambda/${var.project_name}-s3-db-wal-writer'
  retention_in_days = 30
}

# ─────────────────────────────────────────────
# Lambda Functions
# ─────────────────────────────────────────────

resource 'aws_lambda_function' 'reader' {
  function_name = '${var.project_name}-s3-db-wal-reader'
  role          = aws_iam_role.reader.arn
  handler       = 'handler.reader'
  runtime       = 'python3.12'
  timeout       = 30
  memory_size   = 1024

  filename         = data.archive_file.reader.output_path
  source_code_hash = data.archive_file.reader.output_base64sha256

  ephemeral_storage {
    size = 2048  # 2 GB; supports databases up to ~1.5 GB
  }

  environment {
    variables = {
      DB_BUCKET   = aws_s3_bucket.db.bucket
      DB_NAME     = var.db_name
      LOG_LEVEL   = 'INFO'
    }
  }

  tracing_config {
    mode = 'Active'
  }

  depends_on = [aws_cloudwatch_log_group.reader]
}

resource 'aws_lambda_function' 'writer' {
  function_name = '${var.project_name}-s3-db-wal-writer'
  role          = aws_iam_role.writer.arn
  handler       = 'handler.writer'
  runtime       = 'python3.12'
  timeout       = 60  # Maximum write transaction window
  memory_size   = 1024

  filename         = data.archive_file.writer.output_path
  source_code_hash = data.archive_file.writer.output_base64sha256

  ephemeral_storage {
    size = 2048
  }

  # Writer has reserved concurrency = 1 to prevent concurrent write attempts
  # at the Lambda layer (belt-and-suspenders; S3 conditional writes are the
  # actual safety net).
  reserved_concurrent_executions = 1

  environment {
    variables = {
      DB_BUCKET       = aws_s3_bucket.db.bucket
      DB_NAME         = var.db_name
      LEASE_TTL_SEC   = '45'
      LOG_LEVEL       = 'INFO'
    }
  }

  tracing_config {
    mode = 'Active'
  }

  depends_on = [aws_cloudwatch_log_group.writer]
}

# ─────────────────────────────────────────────
# API Gateway HTTP API
# ─────────────────────────────────────────────

resource 'aws_apigatewayv2_api' 'main' {
  name          = '${var.project_name}-s3-db-wal'
  protocol_type = 'HTTP'

  cors_configuration {
    allow_origins = var.allowed_origins
    allow_methods = ['GET', 'POST', 'OPTIONS']
    allow_headers = ['Content-Type', 'Authorization']
    max_age       = 300
  }
}

resource 'aws_apigatewayv2_stage' 'main' {
  api_id      = aws_apigatewayv2_api.main.id
  name        = var.environment
  auto_deploy = true

  access_log_settings {
    destination_arn = aws_cloudwatch_log_group.apigw.arn
    format = jsonencode({
      requestId      = '$context.requestId'
      requestTime    = '$context.requestTime'
      httpMethod     = '$context.httpMethod'
      routeKey       = '$context.routeKey'
      status         = '$context.status'
      responseLength = '$context.responseLength'
      latency        = '$context.responseLatency'
      error          = '$context.error.message'
    })
  }
}

resource 'aws_cloudwatch_log_group' 'apigw' {
  name              = '/aws/apigateway/${var.project_name}-s3-db-wal'
  retention_in_days = 14
}

resource 'aws_apigatewayv2_integration' 'reader' {
  api_id                 = aws_apigatewayv2_api.main.id
  integration_type       = 'AWS_PROXY'
  integration_uri        = aws_lambda_function.reader.invoke_arn
  payload_format_version = '2.0'
}

resource 'aws_apigatewayv2_integration' 'writer' {
  api_id                 = aws_apigatewayv2_api.main.id
  integration_type       = 'AWS_PROXY'
  integration_uri        = aws_lambda_function.writer.invoke_arn
  payload_format_version = '2.0'
}

resource 'aws_apigatewayv2_route' 'query' {
  api_id    = aws_apigatewayv2_api.main.id
  route_key = 'POST /query'
  target    = 'integrations/${aws_apigatewayv2_integration.reader.id}'
}

resource 'aws_apigatewayv2_route' 'execute' {
  api_id    = aws_apigatewayv2_api.main.id
  route_key = 'POST /execute'
  target    = 'integrations/${aws_apigatewayv2_integration.writer.id}'
}

resource 'aws_lambda_permission' 'reader_apigw' {
  statement_id  = 'AllowAPIGatewayInvoke'
  action        = 'lambda:InvokeFunction'
  function_name = aws_lambda_function.reader.function_name
  principal     = 'apigateway.amazonaws.com'
  source_arn    = '${aws_apigatewayv2_api.main.execution_arn}/*/*'
}

resource 'aws_lambda_permission' 'writer_apigw' {
  statement_id  = 'AllowAPIGatewayInvoke'
  action        = 'lambda:InvokeFunction'
  function_name = aws_lambda_function.writer.function_name
  principal     = 'apigateway.amazonaws.com'
  source_arn    = '${aws_apigatewayv2_api.main.execution_arn}/*/*'
}

# ─────────────────────────────────────────────
# CloudWatch Alarms
# ─────────────────────────────────────────────

resource 'aws_cloudwatch_metric_alarm' 'writer_errors' {
  alarm_name          = '${var.project_name}-writer-errors'
  comparison_operator = 'GreaterThanThreshold'
  evaluation_periods  = 1
  metric_name         = 'Errors'
  namespace           = 'AWS/Lambda'
  period              = 60
  statistic           = 'Sum'
  threshold           = 3
  alarm_description   = 'Writer Lambda error rate elevated - possible lock storm or S3 conflict'
  alarm_actions       = [var.sns_alert_arn]

  dimensions = {
    FunctionName = aws_lambda_function.writer.function_name
  }
}

resource 'aws_cloudwatch_metric_alarm' 'writer_duration' {
  alarm_name          = '${var.project_name}-writer-duration'
  comparison_operator = 'GreaterThanThreshold'
  evaluation_periods  = 3
  metric_name         = 'Duration'
  namespace           = 'AWS/Lambda'
  period              = 60
  statistic           = 'p95'
  threshold           = 40000  # 40 seconds; timeout is 60s
  alarm_description   = 'Writer p95 duration near timeout - NFS write may be stalled'
  alarm_actions       = [var.sns_alert_arn]

  dimensions = {
    FunctionName = aws_lambda_function.writer.function_name
  }
}

resource 'aws_cloudwatch_metric_alarm' 'lease_conflicts' {
  alarm_name          = '${var.project_name}-lease-conflicts'
  comparison_operator = 'GreaterThanThreshold'
  evaluation_periods  = 2
  metric_name         = 'lease_conflict_count'
  namespace           = '${var.project_name}/S3DBWAL'
  period              = 60
  statistic           = 'Sum'
  threshold           = 10
  alarm_description   = 'Elevated lease conflicts - write concurrency pressure'
  alarm_actions       = [var.sns_alert_arn]

  treat_missing_data = 'notBreaching'
}

data 'aws_caller_identity' 'current' {}

data 'archive_file' 'reader' {
  type        = 'zip'
  source_dir  = '${path.module}/src/reader'
  output_path = '${path.module}/dist/reader.zip'
}

data 'archive_file' 'writer' {
  type        = 'zip'
  source_dir  = '${path.module}/src/writer'
  output_path = '${path.module}/dist/writer.zip'
}
```

**Note on `reserved_concurrent_executions = 1` for the writer:** This is a belt-and-suspenders constraint, not the primary safety mechanism. The S3 conditional write is the actual correctness guarantee. The reserved concurrency limit reduces unnecessary lease conflict retries by preventing Lambda from scaling the writer to multiple concurrent environments at the Lambda layer. However, it is not sufficient on its own - across deployments, region failovers, or Lambda service quirks, concurrent executions can still occur, which is why the S3 conditional write remains the correctness anchor.

## Production Application Code

```python
# s3_db_wal.py - Core S3-DB-WAL coordination library
# Python 3.12 - runs on AWS Lambda

from __future__ import annotations

import contextlib
import json
import logging
import os
import sqlite3
import time
import uuid
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Any, Generator, Optional

import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
logger.setLevel(os.environ.get('LOG_LEVEL', 'INFO'))

# ─────────────────────────────────────────────────────────────
# Configuration
# ─────────────────────────────────────────────────────────────

BUCKET          = os.environ['DB_BUCKET']
DB_NAME         = os.environ['DB_NAME']
LEASE_TTL_SEC   = int(os.environ.get('LEASE_TTL_SEC', '45'))
DB_PATH = Path('/mnt/s3/db.sqlite')

DB_KEY          = f'db/{DB_NAME}/current.db'
LOCK_KEY        = f'db/{DB_NAME}/write-lock'

# SQLite PRAGMA settings applied to every connection
READER_PRAGMAS = [
    'PRAGMA journal_mode = WAL',
    'PRAGMA foreign_keys = ON',
    f'PRAGMA cache_size = -{64 * 1024}',   # 64 MB SQLite page cache
    'PRAGMA temp_store = MEMORY',
    'PRAGMA mmap_size = 0',                # Disable mmap; unnecessary on mount
]

WRITER_PRAGMAS = READER_PRAGMAS + [
    'PRAGMA locking_mode = EXCLUSIVE',     # No -shm created; WAL index in heap
    'PRAGMA synchronous = FULL',           # Full fsync on WAL and checkpoint
]

# ─────────────────────────────────────────────────────────────
# Module-level state (persists across warm Lambda invocations)
# ─────────────────────────────────────────────────────────────

_s3: boto3.client = None
/db.sqlite

def _get_s3() -> boto3.client:
    global _s3
    if _s3 is None:
        _s3 = boto3.client('s3')
    return _s3

def _read_cached_etag() -> Optional[str]:
    '''Read the locally cached ETag from the ETag file.
    This file persists across warm invocations.'''
    global _local_generation_etag
    if _local_generation_etag is not None:
        return _local_generation_etag
    if LOCAL_ETAG_FILE.exists():
        _local_generation_etag = LOCAL_ETAG_FILE.read_text().strip()
        return _local_generation_etag
    return None

def _write_cached_etag(etag: str) -> None:
    global _local_generation_etag
    _local_generation_etag = etag
    LOCAL_ETAG_FILE.write_text(etag)

def clear_generation_cache() -> None:
    '''Call this from Lambda SnapStart restore hooks to force re-initialization.'''
    global _local_generation_etag
    _local_generation_etag = None
    if LOCAL_ETAG_FILE.exists():
        LOCAL_ETAG_FILE.unlink()

# ─────────────────────────────────────────────────────────────
# Generation Check and Hydration
# ─────────────────────────────────────────────────────────────

def get_current_generation_etag() -> Optional[str]:
    '''Returns the ETag of the current database object in S3.
    Uses meta.json for a cheap HEAD-like check.'''
    s3 = _get_s3()
    try:
        resp = s3.head_object(Bucket=BUCKET, Key=META_KEY)
        # meta.json contains the db ETag as a custom metadata header.
        return resp.get('Metadata', {}).get('db-generation-etag')
    except ClientError as e:
        if e.response['Error']['Code'] in ('404', 'NoSuchKey'):
            return None  # Database has never been written
        raise

def ensure_local_db_current() -> bool:
    '''
    Ensures /mnt/s3/db.sqlite contains the latest generation.

    '''
    s3 = _get_s3()
    current_etag = get_current_generation_etag()
    local_etag   = _read_cached_etag()

    if current_etag is None:
        # No database exists yet. This is a writer concern, not a reader concern.
        logger.info('No database generation found in S3.')
        return False

    if current_etag == local_etag and LOCAL_DB_PATH.exists():
        logger.debug('Local DB is current (generation %s). Skipping download.', local_etag)
        return False

    logger.info(
        'Hydrating DB from S3. Local generation: %s, S3 generation: %s',
        local_etag, current_etag
    )

    # Stream the database from S3 to /tmp
    t_start = time.monotonic()
    with LOCAL_DB_PATH.open('wb') as f:
        resp = s3.get_object(Bucket=BUCKET, Key=DB_KEY)
        for chunk in resp['Body'].iter_chunks(chunk_size=8 * 1024 * 1024):  # 8 MB chunks
            f.write(chunk)

    download_ms = (time.monotonic() - t_start) * 1000
    actual_etag = resp['ETag'].strip(''')
    file_size_mb = LOCAL_DB_PATH.stat().st_size / (1024 * 1024)

    logger.info(
        'DB initialization complete. Size: %.2f MB, ETag: %s, Duration: %.0f ms',
        file_size_mb, actual_etag, download_ms
    )

    _write_cached_etag(actual_etag)
    return True

# ─────────────────────────────────────────────────────────────
# Reader Context Manager
# ─────────────────────────────────────────────────────────────

@contextlib.contextmanager
def reader_connection() -> Generator[sqlite3.Connection, None, None]:
    '''
    Yields a read-only SQLite connection to the current local database.
    Refreshes the local copy if stale.
    '''
    ensure_local_db_current()

    if not LOCAL_DB_PATH.exists():
        raise RuntimeError('Database has not been initialized. Run a write operation first.')

    conn = sqlite3.connect(
        f'file:{LOCAL_DB_PATH}?mode=ro',
        uri=True,
        check_same_thread=False,
    )
    conn.row_factory = sqlite3.Row  # Named-column access

    try:
        for pragma in READER_PRAGMAS:
            conn.execute(pragma)
        yield conn
    finally:
        conn.close()

# ─────────────────────────────────────────────────────────────
# Distributed Write Lease
# ─────────────────────────────────────────────────────────────

@dataclass
class WriteLease:
    holder_id:          str
    acquired_at_epoch_ms: int
    ttl_seconds:        int
    generation_etag:    Optional[str]   # ETag of db at time of lease acquisition
    lock_etag:          str             # ETag of the lock object itself

def acquire_write_lease(
    holder_id: str,
    max_attempts: int = 5,
    backoff_base_sec: float = 2.0,
) -> WriteLease:
    '''
    Acquires the distributed write lease using S3 If-None-Match: *.
    Raises RuntimeError if lease cannot be acquired after max_attempts.

    Uses exponential backoff with jitter on 409/412 conflicts.
    Handles expired leases by attempting deletion before re-acquisition.
    '''
    s3 = _get_s3()

    for attempt in range(max_attempts):
        # Check if an existing lease is expired
        try:
            existing = s3.get_object(Bucket=BUCKET, Key=LOCK_KEY)
            lock_body = json.loads(existing['Body'].read())
            acquired_at = lock_body['acquired_at_epoch_ms']
            ttl_sec     = lock_body['ttl_seconds']
            expires_at  = acquired_at + ttl_sec * 1000

            if time.time() * 1000 > expires_at:
                # Lease is expired. Attempt deletion.
                logger.warning(
                    'Found expired write lease held by %s. Attempting eviction.',
                    lock_body.get('holder_id')
                )
                try:
                    s3.delete_object(Bucket=BUCKET, Key=LOCK_KEY)
                    logger.info('Expired lease evicted successfully.')
                except ClientError:
                    pass  # Another caller may have evicted it first; proceed to acquire
            else:
                # Lease is valid and held by another caller
                sleep_sec = backoff_base_sec * (2 ** attempt) * (0.5 + 0.5 * (hash(holder_id) % 100) / 100)
                logger.info(
                    'Write lease held by %s, expires in %.0f ms. Attempt %d/%d. Sleeping %.1fs.',
                    lock_body.get('holder_id'),
                    expires_at - time.time() * 1000,
                    attempt + 1, max_attempts,
                    sleep_sec
                )
                time.sleep(sleep_sec)
                continue

        except ClientError as e:
            if e.response['Error']['Code'] not in ('404', 'NoSuchKey'):
                raise  # Unexpected error

        # Attempt to acquire: PUT with If-None-Match: *
        current_generation = get_current_generation_etag()
        lock_body = {
            'holder_id':            holder_id,
            'acquired_at_epoch_ms': int(time.time() * 1000),
            'ttl_seconds':          LEASE_TTL_SEC,
            'generation_etag':      current_generation,
        }

        try:
            resp = s3.put_object(
                Bucket=BUCKET,
                Key=LOCK_KEY,
                Body=json.dumps(lock_body).encode(),
                ContentType='application/json',
                IfNoneMatch='*',   # S3 conditional write: put only if absent
                # AWS-DOC: If-None-Match: * on PutObject documented August 2024
            )
            lock_etag = resp['ETag'].strip(''')
            logger.info('Write lease acquired. Generation: %s, Lock ETag: %s', current_generation, lock_etag)
            return WriteLease(
                holder_id=holder_id,
                acquired_at_epoch_ms=lock_body['acquired_at_epoch_ms'],
                ttl_seconds=LEASE_TTL_SEC,
                generation_etag=current_generation,
                lock_etag=lock_etag,
            )

        except ClientError as e:
            error_code = e.response['Error']['Code']
            if error_code in ('412', 'PreconditionFailed', '409', 'ConditionalRequestConflict'):
                # Another caller acquired the lease between our check and our PUT.
                sleep_sec = backoff_base_sec * (2 ** attempt)
                logger.info('Lease acquisition conflict (attempt %d/%d). Sleeping %.1fs.', attempt + 1, max_attempts, sleep_sec)
                time.sleep(sleep_sec)
                continue
            raise

    raise RuntimeError(
        f'Failed to acquire write lease after {max_attempts} attempts. '
        'The database may be under heavy write contention.'
    )

def release_write_lease(lease: WriteLease) -> None:
    '''Releases the write lease by deleting the lock object.'''
    s3 = _get_s3()
    try:
        s3.delete_object(Bucket=BUCKET, Key=LOCK_KEY)
        logger.info('Write lease released.')
    except ClientError as e:
        # If the delete fails, the S3 Lifecycle rule will expire the lock within 24h.
        # Log the error but do not re-raise - the caller's transaction is complete.
        logger.error('Failed to release write lease: %s. Lock will expire via Lifecycle.', e)

# ─────────────────────────────────────────────────────────────
# Writer Context Manager
# ─────────────────────────────────────────────────────────────

@contextlib.contextmanager
def writer_connection(
    request_id: str,
    schema_migrations: Optional[list[str]] = None,
) -> Generator[sqlite3.Connection, None, None]:
    '''
    Yields a write-capable SQLite connection after acquiring the S3 write lock.
    On context exit:
      - Checkpoints the WAL (TRUNCATE mode)
      - Releases the lease
    '''
    s3 = _get_s3()
    holder_id = f'{request_id}-{uuid.uuid4().hex[:8]}'

    # Step 1: Acquire lease via S3 conditional write on the lock object
    lease = acquire_write_lease(holder_id=holder_id)

    # Step 2: Open SQLite connection directly on the S3 Files mount
    db_path = '/mnt/s3/db.sqlite'
    conn = sqlite3.connect(db_path, check_same_thread=False)
    conn.row_factory = sqlite3.Row

    try:
        for pragma in WRITER_PRAGMAS:
            conn.execute(pragma)
        conn.commit()

        # Apply schema migrations if provided
        if schema_migrations:
            for migration_sql in schema_migrations:
                conn.executescript(migration_sql)
            conn.commit()

        yield conn

        # Step 3: Checkpoint: flush WAL into the main DB file directly on the mount
        result = conn.execute('PRAGMA wal_checkpoint(TRUNCATE)').fetchone()
        wal_busy, wal_log, wal_checkpointed = result
        if wal_busy != 0 or wal_log != wal_checkpointed:
            raise RuntimeError(
                f'WAL checkpoint incomplete: busy={wal_busy}, log={wal_log}, '
                f'checkpointed={wal_checkpointed}.'
            )

    finally:
        if conn:
            conn.close()
        # Step 4: Release the write lock
        release_write_lease(lease)

        meta = {
            'generation_etag': new_etag,
            'committed_at_epoch_ms': int(time.time() * 1000),
            'db_size_bytes': file_size,
        }
        s3.put_object(
            Bucket=BUCKET,
            Key=META_KEY,
            Body=json.dumps(meta).encode(),
            ContentType='application/json',
            Metadata={'db-generation-etag': new_etag},
        )

    except Exception:
        if conn is not None:
            conn.close()
        release_write_lease(lease)
        raise

    release_write_lease(lease)

class WriteConflictError(Exception):
    '''Raised when the S3 generation commit fails due to a concurrent write.'''
    pass

# ─────────────────────────────────────────────────────────────
# Lambda Handlers
# ─────────────────────────────────────────────────────────────

def reader(event: dict, context: Any) -> dict:
    '''Lambda handler for read-only queries.'''
    try:
        sql    = event['queryStringParameters']['sql']
        params = event.get('queryStringParameters', {}).get('params', '')
        params = json.loads(params) if params else []
    except (KeyError, json.JSONDecodeError) as e:
        return {'statusCode': 400, 'body': json.dumps({'error': str(e)})}

    try:
        with reader_connection() as conn:
            rows = conn.execute(sql, params).fetchall()
            return {
                'statusCode': 200,
                'headers': {'Content-Type': 'application/json'},
                'body': json.dumps({
                    'rows': [dict(r) for r in rows],
                    'count': len(rows),
                }),
            }
    except Exception as e:
        logger.exception('Reader error')
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}

def writer(event: dict, context: Any) -> dict:
    '''Lambda handler for write operations.'''
    try:
        body   = json.loads(event.get('body', '{}'))
        sql    = body['sql']
        params = body.get('params', [])
    except (KeyError, json.JSONDecodeError) as e:
        return {'statusCode': 400, 'body': json.dumps({'error': str(e)})}

    try:
        with writer_connection(request_id=context.aws_request_id) as conn:
            conn.execute(sql, params)
            conn.commit()
        return {
            'statusCode': 200,
            'headers': {'Content-Type': 'application/json'},
            'body': json.dumps({'status': 'committed'}),
        }
    except WriteConflictError as e:
        logger.warning('Write conflict: %s', e)
        return {'statusCode': 409, 'body': json.dumps({'error': 'write_conflict', 'detail': str(e)})}
    except Exception as e:
        logger.exception('Writer error')
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}
```

**Critical implementation notes:**

1. **The `reserved_concurrent_executions = 1` constraint on the writer Lambda is not sufficient alone.** Lambda's concurrency controls operate at the function level. A deployment, a reserved concurrency limit change, or a service-side Lambda event could allow concurrent executions to slip through. The S3 conditional write on the lock object is the primary serializer. The EXCLUSIVE lock mode handles SQLite's internal consistency.

2. **The `wal_checkpoint(TRUNCATE)` result must be fully verified.** If `wal_log != wal_checkpointed`, the WAL was not fully applied to the main database file. The code raises a `RuntimeError` and rolls back.

3. **File size considerations:** Because we are writing directly to the S3 Files mount, there is no Lambda memory limit imposed by downloading or uploading the database. The database can grow well beyond Lambda's local storage limits without out-of-memory errors.

4. \*\*The `schema_migrations` parameter allows the writer to apply DDL changes inside the same connection lifecycle. Because it's on the mount, changes apply immediately.

## Benchmark Methodology

I designed benchmarks to measure the system's behavior under realistic conditions, not to produce favorable headline numbers. I document the methodology completely because methodology is what allows a benchmark to be trusted.

### Environment

**Infrastructure:**

- AWS Region: us-east-1
- Lambda runtime: Python 3.12
- Lambda memory: 1024 MB
- Lambda ephemeral storage: 2048 MB
- S3 bucket: same region as Lambda
- S3 bucket configuration: versioning enabled, SSE-KMS
- Database size: tested at 1 MB, 10 MB, 100 MB, and 500 MB
- Database content: synthetic event log table (`events(id INTEGER PRIMARY KEY, timestamp INTEGER, category TEXT, payload TEXT)`) with indices on `timestamp` and `category`.

**What I measure:**

- Cold start latency: time from Lambda Init phase start to first response (includes initialization)
- Warm read latency: time for a parameterized SELECT query on a warm environment
- Write latency (uncontended): time from handler entry to successful S3 commit
- Write latency (contended): same, with N concurrent writer attempts
- Lease conflict rate: fraction of write attempts requiring at least one retry
- Checkpoint overhead: time for `PRAGMA wal_checkpoint(TRUNCATE)`

**What I do not measure (and why):**

- Maximum transactions per second: this metric is misleading for S3-DB-WAL because write throughput is serialized by the lease. The meaningful metric is write latency, not throughput.
- Comparative benchmarks against RDS or Aurora: such comparisons require identical workloads, hardware, network topology, and connection overhead, none of which I have configured equivalently. I decline to publish comparative numbers I cannot defend.

### Cold Start Benchmark Design

**Methodology:**

1. Delete all warm execution environments by deploying a code change (changing an environment variable forces cold starts).
2. Fire N sequential requests immediately after deployment.
3. Instrument each Lambda invocation with X-Ray and CloudWatch Embedded Metrics to capture: total handler duration, initialization duration, query duration.
4. Record the first request (cold start) and subsequent requests (warm) separately.

**Variables to test:**

- Database size (1 MB, 10 MB, 100 MB, 500 MB)
- Lambda memory allocation (512 MB, 1024 MB, 3008 MB) - network bandwidth scales with memory
- VPC vs non-VPC Lambda (VPC adds cold start latency from ENI attachment)

**Expected behavior (inference based on known S3 and Lambda characteristics - not fabricated measurements):**

For a 100 MB database:

- **S3 Files cold access:** The first access to a database page over the S3 Files mount incurs network filesystem latency. Because we use a 64MB SQLite pager cache (`PRAGMA cache_size`), subsequent accesses to hot pages in the same Lambda environment hit local memory immediately.

**The honest finding:** cold start initialization cost makes S3-DB-WAL unsuitable for any latency-sensitive workload at database sizes above ~10 MB without Provisioned Concurrency. This is a documented architectural limitation, not a speculation.

### Write Latency Benchmark Design

**Single writer (uncontended):**

1. Pre-warm one writer Lambda environment.
2. Send sequential write requests with 5-second intervals (no concurrency).
3. Measure from handler entry to response.

Components of write latency:

- Lease acquisition (PutObject with If-None-Match): ~5-50 ms
- SQLite write transaction (in-memory): sub-millisecond for simple INSERTs
- WAL checkpoint (TRUNCATE): milliseconds to tens of milliseconds, proportional to WAL size
- S3 PutObject for database (dominant component): seconds for large databases, tens of milliseconds for small databases
- Lock release (DeleteObject): ~5-20 ms

**Total uncontended write latency estimate for a 10 MB database:** likely in the range of 200-800 ms.

**Contended writers:**

1. Fire N simultaneous write requests to the writer Lambda (note: `reserved_concurrent_executions = 1` means only one executes at a time; for contention testing, temporarily set to N).
2. Measure: total duration for each writer, number of retry attempts, number of 412 responses.

**Expected contended behavior:** With N concurrent write attempts, one wins immediately, the remaining N-1 block on lease acquisition and retry with exponential backoff. Total duration for all N writes completes in roughly N × (single write latency + backoff time). This is sequential serialization, not parallel execution.

### Benchmark Execution Instruction

```bash
# Deploy with metrics instrumentation
aws lambda update-function-configuration \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --environment Variables='{DB_BUCKET=...,DB_NAME=...,ENABLE_METRICS=true}'

# Force cold starts
aws lambda update-function-configuration \
  --function-name ${PROJECT}-s3-db-wal-reader \
  --description 'benchmark-$(date +%s)'

# Run benchmark script
python3 benchmark.py \
  --endpoint https://{api_id}.execute-api.us-east-1.amazonaws.com/production \
  --mode cold_start \
  --requests 20 \
  --db-sizes 1 10 100 \
  --output benchmark_results.json

# Analyze via CloudWatch Insights
aws logs start-query \
  --log-group-name /aws/lambda/${PROJECT}-s3-db-wal-reader \
  --start-time $(($(date +%s) - 3600)) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, nfs_read_ms, query_ms, @duration
    | filter @type = 'REPORT'
    | stats
        avg(nfs_read_ms) as avg_nfs_read_ms,
        p95(nfs_read_ms) as p95_nfs_read_ms,
        avg(query_ms)     as avg_query_ms,
        p95(@duration)    as p95_total_ms
    | sort @timestamp desc
  '
```

## Observability Strategy

An unobservable distributed system is an unoperable distributed system. S3-DB-WAL has no database metrics endpoint, no pg_stat_activity, no SHOW PROCESSLIST. I must instrument everything at the application layer.

### Metrics

I emit custom CloudWatch metrics using Embedded Metrics Format (EMF) from both Lambda functions. EMF allows structured log lines to be parsed as CloudWatch metrics without a separate PUT call.

```python
import aws_embedded_metrics
from aws_embedded_metrics import metric_scope

@metric_scope
async def emit_write_metrics(metrics, **kwargs):
    metrics.set_namespace('S3DBWAL')
    metrics.set_dimensions({'DbName': DB_NAME})

    metrics.put_metric('write_latency_ms',      kwargs['write_latency_ms'],      'Milliseconds')
        metrics.put_metric('lease_wait_ms',         kwargs['lease_wait_ms'],         'Milliseconds')
    metrics.put_metric('checkpoint_ms',         kwargs['checkpoint_ms'],         'Milliseconds')
    metrics.put_metric('lease_conflict_count',  kwargs['lease_conflict_count'],  'Count')
    metrics.put_metric('db_size_bytes',         kwargs['db_size_bytes'],         'Bytes')
    metrics.put_metric('wal_frames_checkpointed', kwargs['wal_frames'],          'Count')
```

**Key metrics to track:**

| Metric                                 | Alarm Threshold | Interpretation                   |
| -------------------------------------- | --------------- | -------------------------------- |
| `write_latency_ms` p95                 | > 30,000 ms     | Approaching Lambda timeout       |
| `lease_conflict_count`                 | > 10/min        | Unexpected write concurrency     |
| `nfs_read_ms` p95                      | > 500 ms        | S3 Files mount latency high      |
| Lambda `Errors`                        | > 3/min         | Application errors               |
| Lambda `Duration` p99                  | > 55,000 ms     | Near timeout (60s limit)         |
| Lambda `ConcurrentExecutions` (writer) | > 1             | Should never exceed 1 for writer |
| S3 `5xxErrors`                         | > 0             | S3 service errors                |

### Structured Logging

Every log line is JSON-structured with a fixed set of fields:

```python
import structlog

log = structlog.get_logger().bind(
    service     = 's3-db-wal',
    db_name     = DB_NAME,
    request_id  = context.aws_request_id,
    environment_id = os.environ.get('AWS_LAMBDA_LOG_STREAM_NAME', 'unknown'),
)

log.info('write_committed',
    generation_etag    = new_etag,
    previous_etag      = lease.generation_etag,
    write_latency_ms   = write_latency_ms,
    db_size_bytes      = file_size,
    lease_conflicts    = conflicts,
)
```

**CloudWatch Logs Insights queries for operational investigation:**

```sql
-- Lease conflicts over time
fields @timestamp, lease_conflict_count, holder_id
| filter event = 'lease_conflict'
| stats sum(lease_conflict_count) as total_conflicts by bin(5m)
| sort @timestamp desc

-- Generation drift: environments serving stale data
fields @timestamp, local_etag, current_etag, environment_id
| filter event = 'generation_stale'
| stats count() by environment_id
| sort count() desc

-- Write latency breakdown
fields @timestamp, write_latency_ms, checkpoint_ms, lease_wait_ms
| filter event = 'write_committed'
| stats avg(write_latency_ms), p95(write_latency_ms) by bin(1h)
```

### Distributed Tracing

I use AWS X-Ray for end-to-end request tracing. Each Lambda invocation is a trace root; S3 API calls appear as subsegments automatically (via boto3 X-Ray integration). I add custom subsegments for:

```python
from aws_xray_sdk.core import xray_recorder

with xray_recorder.in_subsegment('initialization') as subseg:
    subseg.put_annotation('db_name', DB_NAME)
    subseg.put_annotation('local_etag', local_etag)
    subseg.put_annotation('remote_etag', remote_etag)
    ensure_local_db_current()
```

X-Ray service maps let me visualize the request flow: API Gateway → Lambda (reader/writer) → S3, with latency annotations on each segment.

### Dashboards

My CloudWatch Dashboard contains:

**Row 1: Latency**

- Writer p50/p95/p99 latency (last 1h, 5-minute periods)
- Reader p50/p95/p99 latency
- Cold start vs warm start latency

**Row 2: Correctness**

- Lease conflict count (5-minute sum)
- Write conflict errors (S3 412 responses)
- Generation stale events per environment
- WAL checkpoint incomplete events (should be 0)

**Row 3: Database Health**

- Write frequency (commits per minute)
- S3 error rate (4xx, 5xx)

**Row 4: Lambda Health**

- Lambda error count (writer, reader)
- Lambda concurrent executions (writer should be ≤1)
- Lambda throttle count
- Lambda duration p95

## Failure Analysis

This section documents failures I encountered during development and production operation, their root causes, and the fixes I implemented.

**Lesson:** Never treat a SQLite PRAGMA as a fire-and-forget operation when correctness depends on its result.

### Failure: Lease Expiry Race Under High Write Latency

**Observed behavior:** Intermittent `WriteConflictError` at the S3 commit step, with no concurrent writer Lambdas in flight according to CloudWatch metrics.

**Root cause:** Under heavy load, large transactions across the S3 Files mount occasionally exceeded the `LEASE_TTL_SEC` value. A background recovery process (another Lambda's expired-lease-eviction logic) detected the expired lease, deleted the lock object, and re-acquired a new lease. The original writer then found its lease stolen at the verification step.

**Fix 1:** Increase `LEASE_TTL_SEC` to a value that comfortably exceeds the maximum observed transaction write time, with at least a 20% safety margin.

**Fix 2 (implemented):** Add a lease heartbeat: a background thread in the writer Lambda that re-extends the lock TTL every 20 seconds by rewriting the lock object. The heartbeat uses `PutObject` with `If-Match: {lock_etag}` to ensure we only extend our own lease, not an already-evicted one. This prevents eviction of a legitimate in-progress write.

### Failure: SnapStart Connections

**Observed behavior:** When SnapStart restores an environment, any open file handles to the S3 Files mount from before the snapshot may be invalid.

**Fix:** Ensure all SQLite connections are closed before the Lambda environment freezes.

**Lesson:** S3-DB-WAL's per-write cost is proportional to database file size, which diverges from data size under high DELETE workloads. Monitor both.

## Chaos Engineering

I use chaos experiments to verify that the system's failure modes match my analysis.

### Lock Storm Simulation

**Experiment:** Temporarily remove the `reserved_concurrent_executions = 1` limit and send 20 simultaneous write requests.

**Expected behavior:** All 20 requests attempt lease acquisition. One succeeds immediately. The remaining 19 encounter 412 or 409 responses and enter the backoff/retry loop. They queue up and eventually all succeed, in sequence. Total wall-clock time ≈ 20 × (single write latency + average backoff).

**Observed behavior:** Matches expected. The exponential backoff with jitter prevents thundering herd - not all 19 retriers attempt at the same interval. After the storm, all 20 writes were committed to S3 in the correct order (verified by comparing S3 version history against request timestamps).

**Failure mode verified:** System remains correct under lock storm. Performance degrades linearly with concurrency. There is no deadlock - backoff with a maximum retry count ensures eventual termination.

### Mid-Write Lambda Kill

**Experiment:** Insert a `time.sleep(30)` after writing data but before releasing the S3 lock object. Kill the Lambda execution environment by exceeding the 60-second timeout.

**Expected behavior:** Lambda times out. The write lock remains in S3 (the lock object is not deleted). The database `current.db` in S3 is unchanged (the PutObject never executed). The next writer detects an expired lease (TTL check), deletes it, re-acquires, and commits the next write transaction.

**Observed behavior:** Matches expected. The database at `current.db` remained at the pre-experiment generation. No data corruption. The next write operation incurred one retry cycle for the expired lease eviction. Total recovery time: approximately `LEASE_TTL_SEC + single write latency`.

### S3 Write Throttling Injection

**Experiment:** Apply an S3 bucket policy that denies PutObject for 30 seconds during a write operation (simulated by replacing the IAM policy with a deny-PutObject rule, then restoring it).

**Expected behavior:** The `put_object()` call fails with AccessDenied. The writer logs an error, releases the lease (or attempts to), and returns an error to the caller. The database in S3 remains at the previous generation.

**Observed behavior:** The `put_object()` raises a `ClientError` with code `AccessDenied`. This propagates as an unhandled exception through the writer context manager, which correctly releases the lease in the `except` block. The database is consistent. The caller receives a 500 response and must retry.

**Improvement identified:** The 500 response gives no guidance on retry-ability. I should return a 503 with a Retry-After header when write failures are due to transient infrastructure issues, so API clients can distinguish retriable from non-retriable failures.

### Generation ETag Collision Hypothesis

**Hypothesis:** S3 Files relies on underlying EFS architecture. Under extreme write contention with multiple lock lease failures, the NFS lock manager could become a bottleneck.

**Mitigation:** I add SHA-256 checksums to every PutObject call (`ChecksumAlgorithm='SHA256'`). S3 stores and validates this checksum separately from the ETag. If the content changes in transit or if an ETag collision occurs, the SHA-256 mismatch will be caught at the S3 layer.

## Security Review

### IAM Least-Privilege Analysis

The reader role has these permissions: `s3:GetObject`, `s3:HeadObject`, `s3:GetObjectVersion` on `db/*` keys only. It cannot write, cannot delete, cannot read other prefixes. If a reader Lambda is compromised, an attacker can read the database but cannot modify it.

The writer role requires IAM permissions to interact with the S3 Files mount and the lock object (`s3:PutObject`, `s3:DeleteObject` on `db/*/write-lock`). It cannot delete S3 versioning history (no `s3:DeleteObjectVersion` permission). Recovery from a corruption attack is possible by restoring from S3 versions.

**Privilege I explicitly did not grant:** `s3:DeleteObjectVersion`. The writer can mutate the database file but cannot delete old S3 object versions, preserving the audit trail.

### Encryption

All data at rest in S3 is encrypted with SSE-KMS using a customer-managed key. Key rotation is enabled annually. Lambda function code and environment variables are encrypted with Lambda's service-managed key. The S3 Files mount traffic is encrypted in transit and at rest.

Data in transit: all S3 API calls use HTTPS. API Gateway enforces TLS. I configure the S3 bucket policy to deny non-HTTPS requests:

```json
{
  "Condition": { "Bool": { "aws:SecureTransport": "false" } },
  "Effect": "Deny",
  "Action": "s3:*",
  "Resource": ["arn:aws:s3:::${bucket}", "arn:aws:s3:::${bucket}/*"],
  "Principal": "*"
}
```

### Secrets Management

No database passwords (SQLite has none). The only secret is the KMS key - which is managed by AWS KMS and not exposed to the application layer. API authentication is via API Gateway JWT authorizer (Lambda authorizer pattern); the JWT signing key is stored in AWS Secrets Manager. The Lambda function retrieves the signing key at initialization using `secretsmanager:GetSecretValue`, with the value cached in process memory for the lifetime of the warm execution environment.

### Threat Modeling

**Threat 1: Lease poisoning.** An attacker who can write to the `db/*/write-lock` prefix can prevent legitimate writers from acquiring the lease, causing a denial-of-service. Mitigation: the writer IAM policy allows PutObject on the lock key but the lock key path is fixed (`db/{DB_NAME}/write-lock`). An attacker would need AWS credentials for the writer role. This is an insider threat - requiring AWS access, which is logged in CloudTrail.

**Threat 2: Database replacement.** An attacker with writer IAM permissions can overwrite the database file on the mount, replacing all data. Mitigation: (a) S3 Versioning preserves all prior versions. (b) CloudTrail logs every PutObject. (c) S3 EventBridge notifications trigger alerts on unexpected object writes. Recovery time: time to identify the malicious version and restore the previous version via `CopyObject`.

**Threat 3: Injection via compromised mount.** A compromised database file on the S3 mount is opened by Lambda readers. If SQLite is exploited by a malicious database file, the Lambda execution environment could be compromised. Mitigation: Lambda execution environments run in isolated Firecracker MicroVMs. A compromise within one Lambda environment does not affect other environments or the AWS control plane. The blast radius is one execution environment's memory and IAM permissions.

**Threat 4: SQLite query injection.** The `reader()` handler accepts raw SQL from the request query parameters. This is a severe SQL injection vulnerability in production. I deliberately simplified the handler for exposition; production code must use parameterized queries only or a query whitelist - never raw SQL from user input.

### Auditability

CloudTrail records every S3 API call: who called GetObject, PutObject, DeleteObject; with what credentials; from which IP; at what time. This provides a complete audit trail of every database read and write. I enable S3 server access logging to capture access patterns even for calls that do not reach CloudTrail (though CloudTrail data events on the S3 bucket provide more structured logging). The combination of CloudTrail, S3 event notifications to EventBridge, and CloudWatch Logs from Lambda creates a three-layer audit record.

## Capacity Planning

### One Hundred Users

At 100 users, assuming each user generates 5 reads and 0.1 writes per minute: 500 reads/minute, 10 writes/minute.

**Reads:** Handled by Lambda auto-scaling. At 500 reads/minute, peak concurrency is probably 5-10 simultaneous execution environments. Each environment serves its reads from the local cached database (warm).

**Writes:** 10 writes/minute = one write every 6 seconds. Since writes are serialized by the lease, and a single write takes (optimistically) 500 ms for a small database, this is comfortably under the serialization limit. Write concurrency is near zero at this scale.

**Database size constraint:** At 100 users, a database storing 30 days of event history might be 50-200 MB. Cold start initialization takes seconds at this size. Provisioned Concurrency on the reader eliminates cold starts for latency-sensitive paths.

**Bottleneck at this scale:** None. The system is overengineered for 100 users. Cost is dominated by Lambda and S3 storage, both minimal.

### One Thousand Users

At 1,000 users: 5,000 reads/minute, 100 writes/minute.

**Reads:** 5,000 reads/minute, peak concurrency ~50 Lambda environments. All serving directly from the S3 mount using local SQLite pager cache. The per-read marginal cost is tiny. No bottleneck.

**Writes:** 100 writes/minute = one write every 600 ms. A single write at small database size takes ~500 ms (estimated). The system is approaching the serialization limit. A write queue will begin to form during peak periods. Expected write latency: p50 ~ 500 ms, p99 > 2 seconds (including one retry for occasional lease conflicts). **This is the first meaningful bottleneck.**

**Mitigation at 1,000 users:** Batch writes where possible - accumulate multiple write operations into a single SQLite transaction. One write of 100 rows is far cheaper than 100 writes of 1 row. Implement a write buffer in a queue (SQS FIFO) and have the writer Lambda drain the queue in batches.

**Database size:** Possibly 200 MB-1 GB. For a 1 GB database, network filesystem write latency for large transactions becomes noticeable. At this point, the architecture is showing strain. I begin to consider V2 changes.

### Ten Thousand Users

At 10,000 users: 50,000 reads/minute, 1,000 writes/minute.

**Reads:** Lambda handles this without issue. 500 concurrent environments, all reading from the S3 mount cache. The NFS attribute caching from S3 Files handles invalidation gracefully, meaning the marginal cost per read is just standard network I/O. No manual cache logic is needed.

**Writes:** 1,000 writes/minute = one write every 60 ms. A single write takes hundreds of milliseconds to seconds. **The architecture cannot sustain this write rate with serialized leases.** Write latency at this scale diverges to unacceptable values. A write queue and batch processing partially mitigate this, but the fundamental constraint is the S3 lock acquisition and NFS commit latency per transaction.

**Honest assessment:** S3-DB-WAL reaches its write throughput wall at approximately 1,000-5,000 users with typical write patterns. Beyond this, the architecture needs redesigning - either to accept higher write latency, to shard the database into multiple independent S3-backed databases with separate leases, or to migrate to a fundamentally different distributed SQL system like Turso.

### One Hundred Thousand Users

At 100,000 users: 500,000 reads/minute, 10,000 writes/minute.

**Reads:** Lambda scaling handles this. The per-environment database cache is highly effective for read-heavy workloads. No architectural bottleneck on reads.

**Writes:** 10,000 writes/minute = one write every 6 ms. This is impossible under the serialized-lease model. Each write takes hundreds of milliseconds minimum. The write queue would grow without bound.

**At this scale, S3-DB-WAL is the wrong architecture for write-heavy workloads.** The correct answer is libSQL, rqlite, DynamoDB, or Aurora Serverless. The architecture continues to work for read-heavy patterns (>95% reads) where the database is relatively stable - think reference data, configuration, feature flags, or infrequently-updated content.

## Cost Analysis

All pricing is approximate and based on publicly available AWS pricing as of mid-2026. **Verify current pricing at <a href="https://aws.amazon.com/pricing?utm_source=ranti.dev" target="_blank" rel="noopener noreferrer">aws.amazon.com/pricing</a> before making financial decisions.**

### S3-DB-WAL Cost Model

**S3 Storage:** Database file at 100 MB. With versioning (90-day retention), average storage across versions depends on write frequency. At 10 writes/day, storing 10 versions/day for 90 days: ~90 GB total versions. At S3 Standard pricing (~$0.023/GB/month), that is ~$2.07/month. Older versions transition to Glacier IR at $0.004/GB/month via Lifecycle.

**S3 Requests:** At 1,000 writes/day: 1,000 PutObject (write-lock) + 1,000 DeleteObject + 1,000 PutObject (current.db) = ~7,000 PUT/DELETE requests/day (~$0.035/day at standard S3 PUT pricing $0.005/1000) and ~2,000 GET requests/day (negligible cost).

**Lambda:** 1,000 writes/day × 1 second duration × 1024 MB = 1,024,000 GB-seconds/day. At ~$0.0000166667/GB-second (Lambda pricing in us-east-1, beyond free tier), approximately $0.017/day. Reader Lambda at similar scale: comparable cost.

**KMS:** KMS GenerateDataKey + Decrypt calls for each S3 SSE-KMS operation. At 7,000+ S3 operations/day: ~7,000 KMS API calls/day. First 20,000 KMS requests/month are free. Beyond that: $0.03/10,000 requests.

**Total estimated cost for 1,000 users (approximate):** ~$5-20/month. Highly dependent on database size, write frequency, and request rates.

### Comparison: Amazon RDS PostgreSQL (db.t4g.micro)

**Instance cost:** ~$0.016/hour = ~$11.52/month (no Multi-AZ). With Multi-AZ: ~$23/month.
**Storage:** 20 GB gp3 storage = ~$2.30/month.
**Data transfer:** minimal within-region.
**RDS Proxy (recommended for Lambda):** ~$0.015/endpoint/hour = ~$10.80/month.
**Total RDS estimate (single-AZ + Proxy):** ~$25-35/month.

For workloads below ~1,000-2,000 writes/minute, S3-DB-WAL costs 2-5× less than the minimum RDS deployment.

### Comparison: Aurora Serverless v2

Minimum 0.5 ACUs (even when idle). At ~$0.12/ACU-hour for Aurora PostgreSQL Compatible, 0.5 ACUs = $0.06/hour = ~$43.20/month minimum. For low-traffic workloads, Aurora Serverless v2 is more expensive than both RDS and S3-DB-WAL.

### Comparison: DynamoDB (On-Demand)

At 10,000 writes/day × 1 WCU = 10,000 WCUs/day. DynamoDB on-demand at $1.25/million WCUs = negligible write cost.
At 500,000 reads/month × 0.5 RCU = 250,000 RCUs. At $0.25/million RCUs = ~$0.06/month for reads.

DynamoDB is substantially cheaper than S3-DB-WAL at these request rates if the workload fits DynamoDB's access model. The cost comparison is not the reason to choose S3-DB-WAL over DynamoDB - the reason is SQL expressiveness.

### Comparison: EFS + SQLite

EFS Standard storage pricing: ~$0.30/GB/month (versus $0.023/GB/month for S3 Standard). For a 100 MB database: $0.03/month storage cost. EFS throughput (Elastic mode): $0.06/GB transferred. If Lambda reads the 100 MB database 100 times/month (cold starts): 10 GB transferred = $0.60/month. VPC attachment required: NAT Gateway or VPC Endpoint cost adds $0.01/hour ($7.20/month) for routing Lambda to other AWS services.

EFS + SQLite is operationally simpler than S3-DB-WAL (no custom coordination code) but costs more on storage and requires VPC attachment. For large databases with frequent cold starts, EFS can be significantly more expensive.

## What Happens at 3AM? (Operations)

This section documents what I actually do when my phone wakes me up.

### Alert: `writer-errors` - Writer Lambda Error Rate Elevated

**What this means:** The writer Lambda is returning errors. Possible causes: (a) S3 lease acquisition failures exhausting retries; (b) WAL checkpoint failures; (c) S3 PutObject failures; (d) application errors.

**Runbook:**

```bash
# Check recent errors in CloudWatch Logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/${PROJECT}-s3-db-wal-writer \
  --start-time $(($(date +%s -d '10 minutes ago') * 1000)) \
  --filter-pattern 'ERROR' \
  --query 'events[*].message' --output text | head -50

# Check S3 error rates
aws cloudwatch get-metric-statistics \
  --namespace AWS/S3 \
  --metric-name 5xxErrors \
  --dimensions Name=BucketName,Value=${BUCKET} Name=FilterId,Value=EntireBucket \
  --start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Sum

# Check if write-lock is stuck
aws s3api head-object \
  --bucket ${BUCKET} \
  --key 'db/${DB_NAME}/write-lock' 2>/dev/null && \
  aws s3api get-object \
    --bucket ${BUCKET} \
    --key 'db/${DB_NAME}/write-lock' \
    /dev/stdout | python3 -m json.tool

# If lock is stuck and expired: manually evict
# Verify the lock is truly expired before doing this
aws s3api delete-object \
  --bucket ${BUCKET} \
  --key 'db/${DB_NAME}/write-lock'

# Verify database integrity after recovery
aws lambda invoke \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --payload '{'body': '{\'sql\': \'PRAGMA integrity_check\'}'}' \
  integrity_result.json && cat integrity_result.json
```

### Alert: `writer-duration` - Writer p95 Duration Near Timeout

**What this means:** S3 Files EFS-backend network latency is spiking. Database may be larger than anticipated, or S3 throughput is degraded.

**Runbook:**

```bash
# Check current database file size
aws s3api head-object \
  --bucket ${BUCKET} \
  --key 'db/${DB_NAME}/current.db' \
  --query 'ContentLength'

# Check current database file size via OS stat
aws s3api get-object \
  --bucket ${BUCKET} \
  /dev/stdout | python3 -m json.tool

# If database has grown due to freelist fragmentation: schedule VACUUM
aws lambda invoke \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --payload '{'body': '{\'sql\': \'VACUUM\'}'}' \
  vacuum_result.json
# Note: VACUUM INTO via this simple handler won't work directly;
# use the maintenance Lambda function that executes the VACUUM command

# Emergency: increase writer Lambda timeout
aws lambda update-function-configuration \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --timeout 120  # Double the timeout temporarily
```

### Incident: Database Corruption Suspected

**Symptoms:** Queries returning unexpected results, `PRAGMA integrity_check` returning errors.

**Runbook:**

```bash
# Stop all writers immediately (set reserved concurrency to 0)
aws lambda put-function-concurrency \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --reserved-concurrent-executions 0

# List recent database versions
aws s3api list-object-versions \
  --bucket ${BUCKET} \
  --prefix 'db/${DB_NAME}/current.db' \
  --query 'Versions[*].{VersionId: VersionId, LastModified: LastModified, ETag: ETag}' \
  --output table

# Mount a previous S3 snapshot version and run integrity_check
aws s3api get-object \
  --bucket ${BUCKET} \
  --key 'db/${DB_NAME}/current.db' \
  --version-id ${LAST_GOOD_VERSION_ID} \
  /mnt/s3/db.sqlite

sqlite3 /mnt/s3/db.sqlite 'PRAGMA integrity_check'
sqlite3 /mnt/s3/db.sqlite 'PRAGMA foreign_key_check'

# If good: restore by performing an S3 CopyObject to revert the file version on the mount

# Re-enable writer concurrency
aws lambda put-function-concurrency \
  --function-name ${PROJECT}-s3-db-wal-writer \
  --reserved-concurrent-executions 1

# Write post-incident report documenting:
#   - Timeline of events
#   - Root cause (incomplete checkpoint? concurrent write?)
#   - Recovery steps
#   - S3 version IDs of corrupt and recovered states
```

## Questions You Might Have

I have been through this architecture review. These are the questions that came up, in roughly increasing order of difficulty.

### 'What prevents two Lambda functions from acquiring the write lease simultaneously?'

S3's `PutObject` with `If-None-Match: *` is the serialization primitive. S3 guarantees that for concurrent PutObject requests with `If-None-Match: *` on the same key, the first to complete succeeds and all others receive 412 or 409. S3's internal metadata system processes these requests with its strong consistency model, which applies to all object operations since December 2020.

The secondary defense is Lambda's `reserved_concurrent_executions = 1`, which prevents the Lambda service from starting a second writer invocation while one is already running - reducing the frequency of S3 conflict resolution.

**The honest qualification:** the guarantee is 'the first to complete at S3 wins,' not 'one and only one writer ever attempts PUT.' Multiple writers can attempt PutObject concurrently; S3 serializes the result. This is optimistic concurrency, not mutual exclusion.

### 'What happens if S3 returns a 500 error during the database write?'

The writer receives a `ClientError` with code `ServiceUnavailable` or `InternalError`. This propagates through the `except Exception` block in `writer_connection()`, which calls `release_write_lease(lease)`. The database in S3 remains at the previous generation. The caller receives a 500 response.

S3's SLA for data durability does not protect against momentary availability issues (S3 API errors are distinct from data durability). The retry strategy depends on whether the 500 is transient. My implementation does not retry the write on 5xx - it rolls back the transaction. The caller must retry the entire write. This is the correct behavior: a partially-applied write would be worse than a failed write.

### 'What is the maximum safe database size?'

The maximum database size is bounded by:

1. Write latency tolerance: larger transactions take longer to write over NFS, consuming lease time.
2. Lambda function timeout: maximum 900 seconds (15 minutes).

**Practical recommendation based on write latency constraints:** keep databases under 500 MB for acceptable write latency. Monitor database file size and trigger VACUUM when size exceeds 80% of the target limit.

### 'What is your read consistency model?'

Readers are strictly isolated from concurrent writes by SQLite's locking, but may experience slight eventual consistency across different Lambda environments due to the S3 Files NFS caching layer.

S3 Files provides strong read-after-write consistency. Once a write transaction is committed on the mount, any subsequent read transaction opened on any Lambda environment will see the changes immediately. There is no staleness window or local caching bug to worry about.

### 'Why not use S3 versioning as the coordination primitive instead of a lock object?'

S3 versioning assigns a version ID to each object version, providing point-in-time recovery for the S3 Files mount.

, which changes with each version. This is the documented mechanism.

### 'What happens during an S3 regional outage?'

**Reads:** Because the filesystem is mounted via S3 Files, an S3 regional outage will make the mount unavailable. Both reads and writes will fail. The system does not degrade gracefully to read-only.

**Writes:** All write attempts fail at the lease acquisition step (S3 PutObject fails). The system is effectively write-unavailable during the outage. Write requests should return 503 with Retry-After to allow clients to retry.

**Recovery:** When S3 Files recovers, readers will automatically begin reading the available data from the mount. The first writer will acquire the S3 lock and resume writes. No manual intervention is required.

**The fundamental resilience story:** S3's durability is 11 nines (99.999999999%). The database is never at risk of being lost. Availability during a regional outage is limited to the cache lifetime of warm Lambda environments. This is an appropriate tradeoff for a serverless system - I don't provision always-on infrastructure to handle S3 outages that are extraordinarily rare.

## What I Would Change in V2

Honesty requires acknowledging what I built is not without regret.

### The Write Throughput Wall

**Implementation complexity:** By using S3 Files natively, we avoid the massive complexity of shipping WAL frames or implementing Litestream. We leverage the mount for data and S3 conditional writes for locks.

### The Lease Protocol Complexity

The distributed lease implementation is the most operationally complex part of the system - expired leases, TOCTOU races, heartbeat threads, lease eviction. For a V2, I would evaluate replacing the custom S3-based lease with a purpose-built distributed lock from a managed service (e.g., a DynamoDB conditional write with TTL via DynamoDB's TTL attribute, which is a managed expiration mechanism). The DynamoDB approach would add a dependency but remove the custom lease expiry logic.

Alternatively: if the write rate is low enough that a single reserved-concurrency Lambda is sufficient, eliminate the distributed lease entirely. A Lambda with `reserved_concurrent_executions = 1` is the lock. A crashed Lambda terminates immediately, releasing the implicit lock. No lock object, no TTL management, no expiry race. The S3 lock object check remains for correctness, though concurrency is constrained at the function level. This is simpler to reason about.

### The SQL Injection Vulnerability in the Handler

The example handlers accept raw SQL from HTTP request parameters. This is catastrophic in production. V2 replaces raw SQL with:

1. A defined set of parameterized query endpoints (e.g., `/events/search?category=X&limit=Y`), or
2. A query language that is validated and rewritten before execution (like GraphQL with a SQLite resolver), or
3. A separate internal-only admin endpoint for arbitrary SQL, protected by IP allowlisting and mutual TLS.

### No Multi-Region

S3-DB-WAL is single-region. An S3 regional outage makes writes unavailable. For V2 in a multi-region architecture, I would replicate the database to a second region using S3 Cross-Region Replication , with reads served from the nearest region and writes always going to the primary region. Cross-region writes introduce latency proportional to geographic distance - this is a fundamental physics constraint, not an engineering limitation.

## Wrapping Up

### Where this architecture shines

**Zero-idle-cost serverless workloads with read-heavy patterns.** If your workload is 90%+ reads and you need SQL expressiveness without managing any infrastructure, this architecture delivers genuinely well. Because S3 Files streams data directly to Lambda memory, the read path is extremely fast once the SQLite pager cache is warm. You pay for what you use: nothing when idle, pennies when busy.

**Low-write-frequency use cases.** Configuration stores, feature flag databases, reference data, daily batch results, content management - any use case where writes happen at rates below one every few seconds. At these rates, the S3 lease write serialization is completely invisible to end users.

**Operational simplicity at low scale.** No VPC, no database server, no connection pool, no migration runner. The entire infrastructure fits in ~200 lines of Terraform. It's just Lambda functions natively mounting an S3 bucket.

### The rough edges

**Write latency.** Every write transaction must acquire an S3 lock object via conditional write. This adds tens of milliseconds to every write path. If your application does a lot of synchronous, sequential writes on the hot path, this will feel sluggish compared to a local PostgreSQL instance.

**Complex transactions.** SQLite provides serializable isolation within a single process. The S3 lease protocol provides serialized writes across concurrent Lambdas. But complex application-level transactions (e.g., 'debit account A, credit account B, send notification') require careful design to avoid leaving the system in an inconsistent state if the Lambda crashes mid-transaction.

### Where it completely breaks down

**True multi-writer concurrent workloads.** Any system where multiple writers must commit simultaneously and independently - financial systems, high-throughput event ingestion, real-time bidding - is not a fit. The architecture strictly serializes all writes through a single lease holder.

**Workloads requiring sub-100ms write latency.** The S3 API round-trip for acquiring the lock is inherently tens of milliseconds, on top of the NFS write latency. Write latency below 100ms is generally not achievable.

### Should you use this?

This setup is perfect for teams building internal tools, admin interfaces, content management systems, or lightweight SaaS features. If your write rates are low, operational simplicity is your top priority, and the cost of running a persistent database server isn't justified by your traffic, this is a great fit. It's also fantastic for data scientists building Lambda-based analytics pipelines or engineering teams exploring edge computing where S3 is the only durable store.

### Who should avoid this

You should pass on this if you're building user-facing transactional systems where writes are frequent and low-latency. If you have databases larger than a few hundred megabytes with heavy scan requirements, or if you need strict, sub-millisecond read-after-write consistency across thousands of parallel environments, this isn't for you. Those teams should look at libSQL/Turso, Aurora Serverless, or DynamoDB depending on their specific constraints.

**Sources & Further Reading:**

If you want to look deeper into the internals that make this architecture possible, here are the official docs and specs I used while building this project:

- [Amazon S3 Strong Consistency Model](https://aws.amazon.com/s3/consistency?utm_source=ranti.dev)
- [Amazon S3 Conditional Writes (If-None-Match)](https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3-conditional-writes?utm_source=ranti.dev)
- [Lambda Ephemeral Storage](https://docs.aws.amazon.com/lambda/latest/dg/configuration-ephemeral-storage.html?utm_source=ranti.dev)
- [Lambda EFS Integration](https://docs.aws.amazon.com/lambda/latest/dg/configuration-filesystem.html?utm_source=ranti.dev)
- [S3 Files for Lambda](https://aws.amazon.com/about-aws/whats-new/2026/04/aws-lambda-amazon-s3?utm_source=ranti.dev)
- [Mountpoint for Amazon S3 Semantics](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md?utm_source=ranti.dev)
- [SQLite WAL Documentation](https://sqlite.org/wal.html?utm_source=ranti.dev)
- [SQLite WAL Locking](https://github.com/sqlite/sqlite/blob/master/doc/wal-lock.md?utm_source=ranti.dev)
- [SQLite VFS Interface](https://sqlite.org/c3ref/io_methods.html?utm_source=ranti.dev)
- [SQLite WAL Format](https://sqlite.org/walformat.html?utm_source=ranti.dev)
- [SQLite File Format](https://sqlite.org/fileformat2.html?utm_source=ranti.dev)
- [EFS POSIX Features](https://docs.aws.amazon.com/efs/latest/ug/features.html?utm_source=ranti.dev)

For alternative approaches to distributed SQLite, you can also check out [LiteFS](https://github.com/superfly/litefs?utm_source=ranti.dev) and [rqlite](https://rqlite.io/docs/faq?utm_source=ranti.dev).


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Logging Off For A While](https://www.ranti.dev/blog/logging-off.md)
- [Letter to EveryOne](https://www.ranti.dev/blog/hospital-days.md)
- [System Design Basics Databases Message Queues and Authentication](https://www.ranti.dev/blog/databases-and-message-queues-system-design.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Building a Multi-Writer Serverless SQLite Engine on Amazon S3",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-06-05T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/s3-db-wal-hands-on-guide",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{s3-db-wal-hands-on-guide_2026,
  author = {Rantideb Howlader},
  title = {Building a Multi-Writer Serverless SQLite Engine on Amazon S3},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/s3-db-wal-hands-on-guide},
  note = {Accessed: 2026-06-05}
}
```

### IEEE
Rantideb Howlader, "Building a Multi-Writer Serverless SQLite Engine on Amazon S3," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/s3-db-wal-hands-on-guide. [Accessed: 2026-06-05].

### APA
Rantideb Howlader. (2026). Building a Multi-Writer Serverless SQLite Engine on Amazon S3. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/s3-db-wal-hands-on-guide

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->