---
title: "Running Local LLM Agents in Kubernetes: A Practitioner's Guide to vLLM on EKS"
author: "Rantideb Howlader"
date: "2026-06-12T00:00:00.000Z"
canonical_url: "https://www.ranti.dev/blog/vllm-on-eks"
license: "CC-BY-4.0"
---


For the last two years, "adding AI" to a product meant one thing. You grabbed an API key, called OpenAI or Anthropic, and shipped. That worked. It still works. But the ground is shifting under our feet, and if you run infrastructure for a living, you can feel it.

Teams are now pulling inference inside their own walls. They are running open-source models like Llama 3 and Mistral on their own clusters. Sometimes the reason is privacy. Legal will not let customer data leave the VPC, full stop. Sometimes it is cost. If your agents make thousands of tool calls per hour, per-token API pricing turns into a line item your CFO circles in red. And sometimes it is control. Agentic workflows hammer a model with long, repetitive, structured prompts. When you own the serving stack, you can tune for exactly that traffic shape instead of paying retail for a general-purpose endpoint.

Here is the good news. If you already run Kubernetes, you already know 80 percent of what you need. A model server is just another Deployment. A GPU is just another resource type. The remaining 20 percent is where people get burned, and that 20 percent is what this post covers.

We are going to deploy [vLLM](https://docs.vllm.ai/?utm_source=ranti.dev) on Amazon EKS, end to end. Real manifests, real gotchas, real bills.

## Why vLLM and Not Just `transformers` in a Flask App

You can serve a model with a 30-line Python script. People do. Then they put it behind ten concurrent users and watch it fall over.

The problem is memory, not compute. When an LLM generates text, it keeps a cache of attention keys and values for every token in every active request. This is the KV cache. Naive servers allocate one giant contiguous chunk of GPU memory per request, sized for the worst case. Most of that memory sits empty. Research from the vLLM team found that this fragmentation wastes a large share of GPU memory in conventional serving systems, which caps how many requests you can serve at once.

vLLM fixed this with **PagedAttention**. The idea is borrowed straight from operating systems. Instead of one contiguous allocation per request, the KV cache is split into small fixed-size blocks, like memory pages. Blocks are allocated on demand and freed the moment a request finishes. Fragmentation drops to nearly zero. The practical result is that the same GPU can hold far more concurrent requests, and vLLM batches them together continuously instead of waiting for a full batch to form.

```mermaid
graph LR
    subgraph Naive Serving
        N1[Request 1] --> |Reserves 2GB| M1[Contiguous Block]
        N2[Request 2] --> |Reserves 2GB| M2[Contiguous Block]
        M1 -.- |Wasted Space| W1[Empty VRAM]
    end
    subgraph vLLM PagedAttention
        P1[Request 1] --> |Maps to| B1[Block 1]
        P1 --> |Maps to| B2[Block 2]
        P2[Request 2] --> |Maps to| B3[Block 3]
        B1 & B2 & B3 -.- |No Wasted Space| DynamicPool[Dynamic KV Cache Pool]
    end
```

vLLM also gives you something operationally priceless: an **OpenAI-compatible API**. Your application code does not change. You point the OpenAI SDK at your own Service URL, swap the model name, and everything keeps working. Migration becomes a config change, not a rewrite.

There are other serving engines. TGI from Hugging Face is solid. TensorRT-LLM squeezes out more raw speed on NVIDIA hardware but demands more from you at build time. For most teams, vLLM hits the sweet spot of throughput, model support, and ease of operation. That is why it has become the default answer.

## Architecture Overview

Before we touch a terminal, here is the full picture of what we are building.

```mermaid
flowchart TD
    Client["Client / Agent (OpenAI SDK)"]

    subgraph EKS["EKS Cluster"]
        ALB["ALB Ingress"]
        SVC["Service (ClusterIP)"]

        subgraph NodeGroup["GPU Managed Node Group (g5)"]
            VLLM["vLLM Deployment (Port 8000)"]
            Plugin["NVIDIA Device Plugin"]
            DCGM["DCGM Exporter"]
        end
    end

    Weights[("Model Weights (HF Hub / S3)")]

    Client -->|HTTPS| ALB
    ALB -->|Port 80| SVC
    SVC -->|Port 8000| VLLM
    VLLM -.->|Downloads| Weights
    Plugin -.->|Registers GPU| VLLM
```

The pieces, top to bottom:

1. **An EKS cluster.** Nothing exotic. Your existing cluster works. The networking layer matters more than you might think once model weights and large payloads start moving around. If you want a refresher on how pods get IPs and how the VPC CNI behaves under pressure, I wrote about it in [EKS Networking and the VPC CNI](/blog/eks-networking-vpc-cni).
2. **A GPU managed node group.** We will use `g5` instances, which carry NVIDIA A10G GPUs with 24 GB of VRAM each. They are the workhorse for 7B to 13B parameter models. The bigger `p4d` instances pack eight A100s and cost more per hour than some people's rent. Reach for those when you serve 70B-class models or need tensor parallelism. Start with `g5`.
3. **The NVIDIA device plugin.** Kubernetes has no native idea what a GPU is. This DaemonSet teaches it.
4. **The vLLM container.** One pod, one GPU, one model. It exposes an OpenAI-compatible HTTP API on port 8000.
5. **A Service and an Ingress.** Standard Kubernetes traffic plumbing, with a few LLM-specific timeout tweaks.
6. **Storage for weights.** Llama 3 8B is roughly 16 GB of weights. You do not want to download that from the internet on every pod restart. We will fix that in the Day 2 section.

One design principle before we start. Treat the model server like a stateless web service with a very expensive warm-up. Everything you know about Deployments, probes, and rollouts still applies. The differences are all about startup time, memory, and money.

## Step 1: Provisioning the GPU Node Group

First, the unglamorous truth. Your fresh AWS account almost certainly cannot launch a GPU instance. AWS gates G and P instances behind service quotas, and the default quota in many accounts is zero. Request an increase for "Running On-Demand G and VT instances" in the Service Quotas console before you do anything else. Approval can take a few hours or a few days. File it now, read the rest of this post while you wait.

Second, the money talk. A single `g5.xlarge` (one A10G, 4 vCPUs, 16 GB RAM) costs about a dollar an hour on demand in us-east-1. That is roughly 740 dollars a month if you leave it running. A `g5.2xlarge` gives you the same GPU with more CPU and RAM for around 1.2 dollars an hour. A `p4d.24xlarge` is over 32 dollars an hour. Check the current numbers on the [EC2 pricing page](https://aws.amazon.com/ec2/pricing/on-demand/?utm_source=ranti.dev) because they shift, but the shape of the problem does not. GPU nodes are the most expensive thing in your cluster by a wide margin. Scale-to-zero is not a nice-to-have here. It is a survival skill, and we will cover it in the autoscaling section.

Now the actual provisioning. The fastest path is `eksctl`:

```bash
# Create a GPU node group on an existing cluster.
# eksctl auto-detects the GPU instance type and selects
# an EKS-optimized AMI with NVIDIA drivers preinstalled.
eksctl create nodegroup \
  --cluster my-cluster \
  --region us-east-1 \
  --name vllm-gpu-nodes \
  --node-type g5.2xlarge \
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 3 \
  --node-volume-size 200 \
  --node-labels "workload=llm-inference" \
  --node-taints "nvidia.com/gpu=present:NoSchedule"
```

Three flags deserve a closer look.

**`--node-volume-size 200`.** The default root volume is 20 GB. The vLLM container image alone is over 8 GB. Add a 16 GB model, container layers, and logs, and a default disk fills up fast. Disk pressure evictions on a GPU node are a special kind of pain. Give yourself room.

**The taint.** GPU nodes should only run GPU workloads. Without a taint, your cluster scheduler will happily place a log shipper or a cron job on your dollar-an-hour node, and then the autoscaler cannot scale it down. The taint keeps freeloaders out. Your vLLM pods will carry a matching toleration.

**The AMI.** eksctl picks the EKS-optimized accelerated AMI automatically for GPU instance types. It ships with NVIDIA drivers and the container toolkit baked in. This saves you from the driver-version-mismatch hellscape that anyone who has hand-built GPU machines will remember with a shudder.

If your shop runs Terraform, here is the same thing using the AWS provider with the popular EKS module:

```hcl
# GPU node group for LLM inference workloads.
eks_managed_node_groups = {
  vllm_gpu = {
    name           = "vllm-gpu-nodes"
    instance_types = ["g5.2xlarge"]

    # AL2023 accelerated AMI: NVIDIA drivers preinstalled
    ami_type = "AL2023_x86_64_NVIDIA"

    min_size     = 0
    desired_size = 1
    max_size     = 3

    # Room for the container image and model weights
    block_device_mappings = {
      root = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size = 200
          volume_type = "gp3"
        }
      }
    }

    labels = {
      workload = "llm-inference"
    }

    taints = {
      gpu = {
        key    = "nvidia.com/gpu"
        value  = "present"
        effect = "NO_SCHEDULE"
      }
    }
  }
}
```

Apply, wait a few minutes, and confirm the node joined:

```bash
kubectl get nodes -l workload=llm-inference
```

You should see your `g5.2xlarge` in `Ready` state. The GPU inside it, however, is still invisible to Kubernetes. That brings us to step two.

## Step 2: The NVIDIA Device Plugin

```mermaid
sequenceDiagram
    participant User
    participant EKS
    participant AWS as AWS EC2
    participant Plugin as Device Plugin

    User->>EKS: Apply NodeGroup (g5.2xlarge)
    EKS->>AWS: Request instance with Accelerated AMI
    AWS-->>EKS: Node joins cluster
    User->>EKS: Apply NVIDIA Plugin DaemonSet
    EKS->>Plugin: Start pod on GPU Node
    Plugin->>EKS: Register nvidia.com/gpu capacity
    EKS-->>User: Node ready for LLM workloads
```

Run `kubectl describe node` on your shiny new GPU node and look at the `Capacity` section. You will see CPU and memory. You will not see a GPU. As far as the Kubernetes scheduler is concerned, that A10G does not exist.

This is by design. Kubernetes only understands CPU, memory, and ephemeral storage natively. Everything else, GPUs included, goes through the **device plugin** framework. A device plugin is a DaemonSet that runs on each node, talks to the kubelet over a local socket, and says "this node has 1 unit of `nvidia.com/gpu`, here is how to wire it into a container." Once that registration happens, `nvidia.com/gpu` becomes a schedulable resource you can request in a pod spec, just like CPU.

Install the [NVIDIA device plugin](https://github.com/NVIDIA/k8s-device-plugin?utm_source=ranti.dev) with Helm:

```bash
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

# Install the device plugin. The tolerations let it land
# on our tainted GPU nodes, where it actually needs to run.
helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set-json 'tolerations=[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}]'
```

A note on the toleration. The plugin's DaemonSet must run on the GPU nodes themselves, and we tainted those nodes in step one. Without the toleration, the plugin gets locked out of the only nodes it exists to serve. This is one of the most common "why does my GPU not show up" mistakes, and it produces zero error messages. The DaemonSet just quietly schedules nowhere useful.

Verify the GPU is now visible:

```bash
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# Expected output on the GPU node: "1"
```

If you see `"1"`, Kubernetes now knows about your GPU and can schedule against it.

Two things worth knowing before we move on.

First, GPUs are requested in whole numbers. You cannot ask for `0.5` of a GPU the way you ask for `500m` of CPU. One pod, one GPU is the default model. NVIDIA offers time-slicing and MIG to split GPUs between pods, but for LLM serving you almost never want that. vLLM is designed to consume an entire GPU and squeeze value out of it through batching. Sharing the card just adds contention.

Second, if your platform grows, look at the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html?utm_source=ranti.dev) instead of the standalone plugin. It manages drivers, the container toolkit, the device plugin, and monitoring as one bundle. On EKS with the accelerated AMI, the standalone plugin is enough to start, because the AMI already handles drivers for you.

## Step 3: Deploying vLLM

Now the main event. We will deploy `meta-llama/Meta-Llama-3-8B-Instruct`, an 8 billion parameter model that fits comfortably on a single A10G and punches well above its weight for agent-style tasks.

One prerequisite: Llama 3 is a gated model. You need to accept Meta's license on the [Hugging Face model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct?utm_source=ranti.dev) and create a Hugging Face access token. Store the token as a Kubernetes Secret:

```bash
kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here
```

Here is the Deployment. Read the comments. Every line in here earned its place by breaking for someone first.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b
  labels:
    app: vllm-llama3-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-8b
  # Model servers take minutes to start. Surge, never kill first.
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: vllm-llama3-8b
    spec:
      # Match the taint we put on the GPU node group
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      # Make sure we land on the GPU nodes, not just tolerate them
      nodeSelector:
        workload: llm-inference
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.5
          args:
            - "--model"
            - "meta-llama/Meta-Llama-3-8B-Instruct"
            # Cap context length. Llama 3 supports 8192, and
            # capping it keeps KV cache memory predictable.
            - "--max-model-len"
            - "8192"
            # Fraction of GPU VRAM vLLM may claim for weights
            # plus KV cache. 0.90 leaves headroom for CUDA overhead.
            - "--gpu-memory-utilization"
            - "0.90"
            # One GPU, so no tensor parallelism needed
            - "--tensor-parallel-size"
            - "1"
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            requests:
              # The GPU request is what gets this pod scheduled
              # onto a GPU node. Requests and limits must match
              # for extended resources like GPUs.
              nvidia.com/gpu: "1"
              cpu: "4"
              memory: "16Gi"
            limits:
              nvidia.com/gpu: "1"
              memory: "16Gi"
          # vLLM loads 16GB of weights into VRAM at startup.
          # The startup probe gives it up to 15 minutes before
          # the liveness probe is allowed to judge it.
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            failureThreshold: 90
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
          volumeMounts:
            # Shared memory for tensor operations. The default
            # 64Mi /dev/shm will crash vLLM under load.
            - name: shm
              mountPath: /dev/shm
            # Cache downloaded weights on the node disk so a
            # pod restart does not re-download 16GB.
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "8Gi"
        - name: model-cache
          hostPath:
            path: /var/lib/vllm-cache
            type: DirectoryOrCreate
```

And the Service:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b
spec:
  type: ClusterIP
  selector:
    app: vllm-llama3-8b
  ports:
    - name: http
      port: 80
      targetPort: 8000
```

Apply both, then settle in:

```bash
kubectl apply -f vllm-deployment.yaml -f vllm-service.yaml
kubectl logs -f deploy/vllm-llama3-8b
```

The first startup is slow, and you should know why so you do not panic. Three things happen in sequence. The node pulls the vLLM image, which is over 8 GB. Then vLLM downloads roughly 16 GB of model weights from Hugging Face. Then it loads those weights into VRAM, builds its CUDA graphs, and pre-allocates the KV cache. On a fresh node this whole dance can take 10 to 15 minutes. Go get coffee. This is the modern version of "compiling," and yes, there is an xkcd for it.

```mermaid
stateDiagram-v2
    [*] --> ContainerCreating: Pod Scheduled

    state ContainerCreating {
        [*] --> PullImage: Pull 8GB Image
        PullImage --> StartContainer
    }

    StartContainer --> Init_vLLM

    state Init_vLLM {
        [*] --> DownloadWeights: Fetch 16GB Model
        DownloadWeights --> LoadVRAM: Copy to GPU
        LoadVRAM --> BuildCUDAGraphs: Optimize Execution
        BuildCUDAGraphs --> PreAllocateKV: Reserve 90% VRAM
    }

    Init_vLLM --> Ready: Startup Complete
    Ready --> [*]
```

The startup probe in the manifest exists for exactly this reason. Without it, the liveness probe would start failing during weight loading, Kubernetes would kill the pod, and you would enter an infinite restart loop where the model never finishes loading. If you take one probe lesson from this post, take that one.

A few manifest choices worth defending:

**The `/dev/shm` mount.** vLLM and PyTorch use shared memory for inter-process communication. Kubernetes gives containers a 64 MB `/dev/shm` by default. vLLM will crash into it almost immediately. The `emptyDir` with `medium: Memory` fixes this. This single line has saved more on-call hours than I can count.

**`--gpu-memory-utilization 0.90`.** vLLM pre-allocates VRAM for the KV cache up front. On a 24 GB A10G, the model weights take about 16 GB, and the rest of the budget becomes cache for concurrent requests. Push this to 0.98 and you flirt with CUDA out-of-memory crashes from driver overhead. Drop it to 0.70 and you starve your own throughput. 0.90 is the boring, correct default.

**Requests equal limits for the GPU.** Extended resources like `nvidia.com/gpu` do not support overcommit. The API server rejects manifests where they differ. Memory should also match here, because an OOM-killed model server mid-generation is a bad day.

Once the logs show `Application startup complete`, smoke-test it from inside the cluster:

```bash
kubectl run curl-test --rm -it --image=curlimages/curl --restart=Never -- \
  curl -s http://vllm-llama3-8b/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [{"role": "user", "content": "Say hello in five words."}],
      "max_tokens": 32
    }'
```

If JSON with a chat completion comes back, you are serving a frontier-adjacent language model from your own cluster. Take a moment. Then keep reading, because we have not exposed it to anything yet.

## Step 4: Exposing the OpenAI-Compatible API

This is the part that makes the whole exercise pay off. vLLM speaks the OpenAI API dialect natively. The endpoints are `/v1/chat/completions`, `/v1/completions`, `/v1/models`, and `/v1/embeddings`. Any client built for OpenAI, including the official SDKs, LangChain, LlamaIndex, and most agent frameworks, works against vLLM with two config changes: the base URL and the model name.

### Option A: Internal LoadBalancer (the quick path)

```mermaid
graph TD
    subgraph Option A: Internal VPC
        A1[Internal Microservice] --> A2[Internal NLB]
        A2 --> A3[vLLM Service]
    end

    subgraph Option B: External Clients
        B1[Public Agent / User] --> B2[ALB Ingress]
        B2 --> B3[vLLM Service]
        style B2 stroke:#FF5A5F,stroke-width:2px
    end
```

If your only clients are inside your VPC, an internal Network Load Balancer is the simplest route:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b-lb
  annotations:
    # AWS Load Balancer Controller annotations
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    # Internal: do not expose your GPU to the public internet
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
spec:
  type: LoadBalancer
  selector:
    app: vllm-llama3-8b
  ports:
    - port: 80
      targetPort: 8000
```

### Option B: ALB Ingress (the proper path)

For TLS, path routing, and a stable hostname, use the [AWS Load Balancer Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/?utm_source=ranti.dev) with an Ingress:

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert
    # LLM responses can take a while. The ALB default idle
    # timeout of 60s will cut off long generations.
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=300
spec:
  ingressClassName: alb
  rules:
    - host: llm.internal.ranti.dev
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-llama3-8b
                port:
                  number: 80
```

That idle timeout annotation is not decoration. A long completion on a busy GPU can stream for minutes. The ALB default of 60 seconds will sever the connection mid-sentence, and your client will report a vague network error that sends you debugging the wrong layer for an afternoon. Raise it.

Two more notes on exposure. First, vLLM ships with no authentication by default. Anyone who can reach the endpoint can run inference on your dollar-an-hour GPU. Pass `--api-key` to the vLLM container, or put auth at the gateway layer. Second, keep the load balancer internal unless you have a very good reason. A public LLM endpoint without auth gets discovered by scanners shockingly fast.

### Pointing your application at it

Here is the payoff in code. Any service in the VPC can now do this:

```python
from openai import OpenAI

# Same SDK you already use. Different base_url. That is the
# entire migration.
client = OpenAI(
    base_url="https://llm.internal.ranti.dev/v1",
    api_key="your-vllm-api-key",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Summarize this incident report in 3 bullets."}
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)
```

This compatibility matters most for agentic systems. An agent loop fires dozens or hundreds of model calls per task: plan, call a tool, read the result, plan again. I broke down that pattern in [What Is Agent Looping?](/blog/what-is-agent-looping) if you want the full picture. The short version is that agents are token furnaces. They are exactly the workload where per-token API pricing hurts and where a self-hosted, continuously batched vLLM endpoint shines. vLLM 0.8+ also supports OpenAI-style function calling for Llama 3 via `--enable-auto-tool-choice`, so tool-using agents work against it directly.

The same logic applies to AI coding tools. Several of the editors I compared in [Kiro vs Cursor vs Windsurf](/blog/kiro-vs-cursor-vs-windsurf) can point at any OpenAI-compatible endpoint, which means your team's coding assistant can run against models that never leave your network.

## Day 2 Operations and Gotchas

Getting vLLM running is an afternoon. Keeping it running well is the actual job. This section is the stuff I wish someone had told me before my first GPU-shaped AWS bill.

### Gotcha 1: Everything about this workload is huge

The vLLM container image is 8+ GB. The model weights are 16 GB for an 8B model and over 140 GB for a 70B model. Every operation that "just works" with a 50 MB Go service becomes a logistics problem at this scale.

The first symptom you will notice is pod startup time on fresh nodes. Image pull alone can take several minutes. Then the weight download starts. There are four strategies, in rough order of effort:

**Strategy 1: Node-local caching (what we did above).** The `hostPath` volume keeps downloaded weights on the node disk. Pod restarts on the same node become fast. New nodes still pay full price. This is fine for a steady single-replica setup and costs you nothing extra.

```mermaid
flowchart LR
    subgraph Cold Start Pain
        HF[Hugging Face Hub] -- 10+ mins --> Pod[vLLM Pod]
    end

    subgraph Strategy 1: HostPath
        Disk[(Node Disk)] -- 10 seconds --> Pod2[vLLM Pod]
    end

    subgraph Strategy 3: S3 CSI
        S3[Amazon S3 Bucket] -- 1-2 mins --> Pod3[vLLM Pod]
    end

    style Cold Start Pain stroke:#ff0000,stroke-width:2px
```

**Strategy 2: Mirror the image to ECR.** Pull the vLLM image once, push it to a private ECR repo in your region. In-region ECR pulls are dramatically faster and more reliable than pulling from Docker Hub, and you stop being subject to Docker Hub rate limits at the worst possible moment.

```bash
docker pull vllm/vllm-openai:v0.8.5
docker tag vllm/vllm-openai:v0.8.5 \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/vllm-openai:v0.8.5
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/vllm-openai:v0.8.5
```

**Strategy 3: Weights on S3 with the Mountpoint CSI driver.** Upload the model files to an S3 bucket once. Mount the bucket into pods as a read-only volume using the [Mountpoint for Amazon S3 CSI driver](https://github.com/awslabs/mountpoint-s3-csi-driver?utm_source=ranti.dev), then point vLLM at the mount path instead of a Hugging Face model ID. Downloads from in-region S3 saturate the instance network and beat Hugging Face Hub by a wide margin. This also removes your startup-path dependency on an external service. When Hugging Face has a bad day, your pods should not care.

**Strategy 4: Bake weights into a custom AMI or EBS snapshot.** Maximum speed, maximum maintenance. New model version means a new image build. I would only go here if pod cold-start time is a hard SLO.

My recommendation: ECR mirror plus S3 weights. It is the best effort-to-payoff ratio, and it turns a 12-minute cold start into roughly 3 to 4 minutes.

One more disk note. If you cache multiple models per node, that 200 GB root volume fills up. Watch `node_filesystem_avail_bytes` in your existing node monitoring. The kubelet's disk-pressure eviction does not know that one of those "evictable" pods took ten minutes to warm up.

### Gotcha 2: Autoscaling GPUs is a different sport

Horizontal Pod Autoscaler logic you know from web services breaks down here in three ways.

First, CPU utilization is a useless signal. A vLLM pod under heavy load might show 30 percent CPU while the GPU is pinned at 100. You need to scale on GPU-native or queue-native metrics: `vllm:num_requests_waiting` (requests queued, exposed by vLLM's own `/metrics` endpoint) is the single best scaling signal. KEDA or HPA with a Prometheus adapter can consume it.

Second, scale-up is slow. Adding a replica means provisioning a GPU node (2 to 4 minutes), pulling images, and loading weights. End to end, expect 5 to 10 minutes even with the caching strategies above. Reactive autoscaling cannot absorb a traffic spike that peaks in 60 seconds. Plan capacity for your p95 load and let autoscaling handle the slow daily curve, not the bursts.

```mermaid
sequenceDiagram
    participant Clients
    participant vLLM
    participant KEDA as Metrics Adapter
    participant Karpenter

    Clients->>vLLM: Burst of 500 requests
    vLLM->>vLLM: Queue fills up
    KEDA->>vLLM: Scrape vllm:num_requests_waiting
    KEDA->>Kubernetes: Scale up HPA
    Kubernetes->>Karpenter: Pending GPU Pod detected
    Karpenter->>AWS: Provision Spot Instance
    AWS-->>Kubernetes: New Node Ready
    Kubernetes->>vLLM: Schedule new replica
```

Third, the node-level autoscaler choice matters more than usual. The two options:

**Cluster Autoscaler** works with the managed node group we built. It scales predefined groups up and down. It is fine, but it is rigid. You declared `g5.2xlarge` in the node group, so `g5.2xlarge` is what you get, even when a different instance type is cheaper or more available in another AZ.

**[Karpenter](https://karpenter.sh/?utm_source=ranti.dev)** provisions nodes directly from pod requirements, no node groups needed. For GPU workloads it has three concrete advantages. It picks from a list of instance types you allow, so when `g5.2xlarge` capacity is tight in your AZ (and GPU capacity is tight more often than you would like), it can grab a `g5.4xlarge` or a `g6` instead of leaving your pod Pending. It handles Spot interruption gracefully, which matters because Spot pricing on G instances often runs far below on-demand. And its consolidation logic is aggressive about deleting idle expensive nodes, which on GPU fleets is where the real money is.

A Karpenter NodePool for this workload looks like:

```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        workload: llm-inference
    spec:
      taints:
        - key: nvidia.com/gpu
          value: present
          effect: NoSchedule
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"] # prefer spot, fall back
      expireAfter: 720h
  # Tear down idle GPU nodes fast. Every idle hour is real money.
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
  limits:
    nvidia.com/gpu: 4 # hard ceiling on spend
```

If you take the Spot route, accept the trade honestly. Spot GPU nodes get reclaimed with two minutes of notice, and a replacement pod needs several minutes to warm up. Run at least one on-demand replica as a floor, and let Spot replicas handle the elastic top. For dev and batch workloads, pure Spot is a great deal.

My honest recommendation: if you are starting fresh, use Karpenter. If you already run Cluster Autoscaler everywhere, it will work, just pad your `max` sizes and expect more Pending pods during capacity crunches.

### Gotcha 3: You cannot see GPU problems without GPU metrics

Standard node metrics tell you nothing about the GPU. You can have a node at 10 percent CPU while the GPU is thermally throttling, out of VRAM, or sitting idle while you pay for it. The fix is NVIDIA's [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter?utm_source=ranti.dev), a DaemonSet that exposes GPU telemetry in Prometheus format:

```bash
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring --create-namespace \
  --set-json 'tolerations=[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}]'
```

The four metrics that matter on day one:

- `DCGM_FI_DEV_GPU_UTIL`: GPU compute utilization. If this sits low while requests queue, your bottleneck is elsewhere (often CPU-bound tokenization or network).
- `DCGM_FI_DEV_FB_USED`: VRAM in use. Pair it with `FB_FREE` to see how close you fly to the OOM sun.
- `DCGM_FI_DEV_GPU_TEMP`: temperature. Sustained high temps mean throttling and slower tokens.
- `DCGM_FI_DEV_POWER_USAGE`: a surprisingly good "is it actually doing work" sanity check.

Scrape vLLM's own `/metrics` endpoint too. `vllm:time_to_first_token_seconds` and `vllm:time_per_output_token_seconds` are your user-facing latency truth, and `vllm:gpu_cache_usage_perc` tells you when the KV cache is the constraint. A Grafana dashboard with DCGM on one row and vLLM metrics on the next answers 90 percent of "why is it slow" questions before anyone opens a terminal.

A cheap, boring, high-value alert to set this week: GPU utilization under 5 percent for 2+ hours during business hours. That is not an outage. That is a 740-dollar-a-month machine doing nothing, and finance will find it eventually. Better that you find it first.

### Gotcha 4: The KV cache OOM, a rite of passage

At some point your pod will die with a CUDA out-of-memory error, probably during a demo. The usual cause is a mismatch between `--max-model-len`, concurrency, and `--gpu-memory-utilization`. Longer contexts mean bigger per-request KV cache footprints, which means fewer concurrent requests fit. If your agents only need 8k context, do not run the server at 32k "just in case." You would be paying VRAM rent on empty rooms. Set `--max-model-len` to what you actually need, watch `vllm:gpu_cache_usage_perc`, and raise limits with data instead of vibes.

If you need more headroom on the same hardware, quantization is the next lever. An FP8 or AWQ-quantized variant of the same model roughly halves the weight footprint, freeing VRAM for cache and concurrency, usually with minor quality loss. vLLM supports the popular formats out of the box. Test quality on your own evals before and after. Trust, but verify.

## What This Actually Costs

Let me put rough numbers on the table, because "is this cheaper than the API" is the question your manager will ask first. These are us-east-1 ballpark figures. Verify current prices before you quote them in a planning doc, since both AWS and API vendors adjust pricing regularly.

A single on-demand `g5.2xlarge` runs around 870 dollars a month if it never scales down. The same instance on Spot has historically run at a steep discount, often well under half the on-demand rate, though Spot prices float with demand. Add the small fixed costs: the EKS control plane at 0.10 dollars an hour, an internal ALB, EBS volumes, and some CloudWatch. Call the all-in floor for a serious single-GPU setup roughly 400 to 1,000 dollars a month depending on your Spot luck and scale-to-zero discipline.

Now the other side of the ledger. An A10G serving Llama 3 8B with vLLM can sustain on the order of a couple thousand output tokens per second under continuous batching with realistic concurrent traffic. Your exact number depends on context lengths and request mix, so benchmark it yourself with vLLM's bundled `benchmark_serving` script rather than trusting anyone's blog, including this one. But even with conservative assumptions, a saturated GPU produces a very large monthly token volume at a flat price.

The honest summary has three parts:

1. **At low, spiky volume, the API wins.** If you generate a few million tokens a month, pay per token and move on. The engineering time alone outweighs any savings.
2. **At high, steady volume, self-hosting wins, often by a lot.** Agent fleets, batch document processing, and internal tools with constant traffic are the sweet spot. A saturated GPU is one of the cheapest token sources you can buy.
3. **Utilization is the whole game.** A GPU at 10 percent utilization costs the same as one at 90. Continuous batching, queue-based autoscaling, and ruthless scale-down are not optimizations. They are the business case.

There is also a cost the spreadsheet misses: the option value. Once the serving path exists, trying a new open model is a one-line change to the Deployment. That iteration speed compounds.

## Wrapping Up: Your Cluster Is the Platform

Step back and look at what we built. A model server with an industry-standard API, running on hardware you control, scaled by tooling you already operate, monitored by the Prometheus stack you already run. No new platform. No new vendor dashboard. The skills that made you good at Kubernetes made you good at this.

That is the real story of this industry shift. AI infrastructure is converging on regular infrastructure. The models get the headlines, but the winners of the next few years will be the teams who can serve them reliably, cheaply, and inside their own security boundary. GPUs will keep getting more available, open models will keep closing the quality gap, and tools like vLLM will keep abstracting the serving layer. The Kubernetes operators of the world are quietly inheriting the AI platform job, whether the org chart says so yet or not.

Where to go from here:

1. **Run the numbers.** Take one month of your OpenAI or Anthropic usage and price it against a Spot `g5` fleet. The answer is not always "self-host," and that is fine. Now you decide with data.
2. **Start with one internal workload.** A summarization endpoint, a log-triage agent, an internal coding assistant. Low blast radius, real learning.
3. **Treat the model like a dependency.** Pin versions, eval before upgrading, keep a rollback path. Everything you learned from years of "the new library version broke prod" applies directly.

I am planning follow-ups on multi-GPU tensor parallelism for 70B models and on benchmarking vLLM against TGI and TensorRT-LLM on identical hardware. If one of those would help you more than the other, tell me.

And a question for you: what is keeping your team on closed APIs today? Is it model quality, ops capacity, or just inertia? Reply, or find me on the socials linked below. The best gotchas in my next post will come from your war stories.


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Kiro IDE: Building a Production API With Spec-Driven AI (Hands-On Tutorial)](https://www.ranti.dev/blog/kiro-ide-spec-driven-development.md)
- [Kiro vs Cursor vs Windsurf vs Claude Code vs Codex vs Antigravity: What I Actually Use as an SRE](https://www.ranti.dev/blog/kiro-vs-cursor-vs-windsurf-vs-claude-vs-codex-vs-antigravity.md)
- [Part 1 - The S3 Files EC2 Infrastructure Handbook Manual Configuration & Architecture](https://www.ranti.dev/blog/amazon-s3-files-ec2-linux.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Running Local LLM Agents in Kubernetes: A Practitioner's Guide to vLLM on EKS",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-06-12T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/vllm-on-eks",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{vllm-on-eks_2026,
  author = {Rantideb Howlader},
  title = {Running Local LLM Agents in Kubernetes: A Practitioner's Guide to vLLM on EKS},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/vllm-on-eks},
  note = {Accessed: 2026-06-24}
}
```

### IEEE
Rantideb Howlader, "Running Local LLM Agents in Kubernetes: A Practitioner's Guide to vLLM on EKS," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/vllm-on-eks. [Accessed: 2026-06-24].

### APA
Rantideb Howlader. (2026). Running Local LLM Agents in Kubernetes: A Practitioner's Guide to vLLM on EKS. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/vllm-on-eks

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->