
## Introduction: The Murder Mystery

Debugging a Monolith is easy.
You look at the `server.log`. You see the Stack Trace. You fix the bug.

Debugging Microservices is a Murder Mystery.
User Alice clicks "Buy."
Service A calls Service B.
Service B calls Service C.
Service C calls the Database.
The Database times out.
Service C returns 500.
Service B retries... and fails.
Service A shows "Error: Unknown."

You look at Service A's logs. It says "Error calling B".
You look at Service B's logs. It says "Error calling C".
You look at Service C's logs. It says "DB Timeout".

But here is the catch:
There are 1,000 requests per second.
Which log line belongs to Alice?
Service A logged at 12:00:01.
Service C logged at 12:00:02.
Are they related? Or is it a coincidence?

Without **Observability**, you are guessing.
You are frantically grepping logs while the CTO breathes down your neck.

Let's fix this.
We will move beyond "Monitoring" (Is it up?) to "Observability" (Why is it weird?).
We will master the **Three Pillars**: Metrics, Logs, and Traces.
And we will learn about **OpenTelemetry**, the standard that binds them all.

---

## Monitoring vs Observability

- **Monitoring**: "The CPU is at 90%."
  - It answers "Known Unknowns". (I know CPU can get high, so I watch it).
- **Observability**: "The CPU is at 90% because user 123 sent a malformed JSON payload that triggered an infinite regex loop in the payment library."
  - It answers "Unknown Unknowns". (I didn't know that could happen).

If your dashboard is just red/green lights, you have Monitoring.
If your dashboard lets you click into a spike and find the specific user who caused it, you have Observability.

---

## The Three Pillars

### 1. Metrics (The Dashboard)

Metrics are numbers. Aggregations.

- `http_requests_total = 500`
- `cpu_usage = 80%`
- **Pros**: Cheap. You can store 10 years of metrics.
- **Cons**: No context. "Error rate is 5%". Okay... which 5%? Is it iPhone users? Is it the /admin page? Metrics don't tell you.

### 2. Logs (The Story)

Logs are text.

- `2023-10-01 ERROR: NullPointerException in User.java:50`
- **Pros**: Infinite detail.
- **Cons**: Expensive. Logging every request at scale costs a fortune. Hard to search (needle in a haystack).

### 3. Traces (The Map)

Traces are the glue.
A Trace follows a single request as it jumps between services.

- `TraceID: abc-123`
  - `Span 1`: Service A (took 50ms)
  - `Span 2`: Service B (took 200ms)
    - `Span 3`: Database Query (took 190ms - **Here is the problem!**)

---

## Metrics Deep Dive (Prometheus)

Prometheus is the king of metrics.
It uses a "Pull" model. It scrapes your app (`/metrics`) every 15 seconds.

**Key Concept: Labels (Dimensions)**
Old way: `metric_name: cpu_usage`
Prometheus way: `cpu_usage{host="server-1", env="prod", app="payment"}`

**The Cardinality Explosion (The Trap)**:
Labels are great. But use them wisely.
If you add a label `user_id`...
And you have 1 million users...
Prometheus creates 1 million separate time series.
Memory usage explodes. Prometheus crashes.
**Rule**: Never put high-cardinality data (IDs, Emails, UUIDs) in Metrics. Put them in Logs or Traces.

---

## Structured Logging (JSON)

Stop logging text.
`logger.info("User " + user + " logged in")` -> This is garbage. You cannot query it.

Start logging JSON.

```json
{
  "level": "info",
  "msg": "User logged in",
  "user_id": "123",
  "ip": "10.0.0.1",
  "duration_ms": 45
}
```

Now you can run queries in CloudWatch/Splunk:
`filter duration_ms > 500 and ip = "10.0.0.1"`

**Context Propagation**:
Every log line must have a `trace_id`.
This is how you link Logs to Traces.
When you see a slow trace in Jaeger, you copy the ID, paste it into your logs, and see exactly what happened.

---

## Distributed Tracing (The "Ah-Ha" Moment)

You install an agent (OpenTelemetry or X-Ray).
The agent automatically injects headers into your HTTP calls.

- `X-Trace-Id: abc-123`

Service B sees this header, uses the same ID, and passes it to Service C.
The backend (Tempo/Jaeger/X-Ray) stitches them together visually.
You see a Waterfall graph.
You instantly see the long bar.
"Oh, the Redis call took 2 seconds."
Case closed.

---

## OpenTelemetry (The Standard)

In the past, you used the Datadog Agent, or the New Relic Agent. You were locked in.
If you wanted to switch to Prometheus, you had to rewrite code.

**OpenTelemetry (OTel)** is an open standard (CNCF).

1.  Use the OTel SDK in your code.
2.  It sends data to the **OTel Collector** (a proxy).
3.  The Collector sends Metrics to Prometheus, Logs to Loki, and Traces to Jaeger.

If you want to switch vendors? Just change the config in the Collector. No code changes.
**This is the future.** Implement OTel today.

---

## The USE Method (Brendan Gregg)

How do you start? What dashboard do you build first?
Use the **USE Method** for every resource (CPU, Disk, Memory):

1.  **Utilization**: How busy is it? (e.g., CPU 90%).
2.  **Saturation**: Is work queuing up? (e.g., Load Average, Disk Queue Length).
3.  **Errors**: Are there hardware/software errors?

If Utilization is high but Saturation is low -> You are fine.
If Saturation is high -> You have a bottleneck. Performance will degrade non-linearly.

---

## The RED Method (Tom Wilkie)

For Microservices (HTTP APIs), use **RED**:

1.  **Rate**: Requests per second (Traffic).
2.  **Errors**: Failed requests per second.
3.  **Duration**: Latency (p50, p90, p99).

**Why p99?**:
"Average Latency" is a lie.
If 99 users get 10ms, and 1 user gets 10 seconds (timeout).
Average = ~100ms. Looks okay.
But that 1 user is angry. And that 1 user might be your biggest customer.
Optimize for the 99th percentile (p99).

---

## Observability Glossary for SREs

- **Cardinality**: The number of unique values in a set. (Low: Status codes. High: User IDs).
- **Sampling**: You can't trace 100% of requests (too much data). You measure 1% (Head Sampling) or keeps the "interesting" ones (Tail Sampling).
- **Span**: A single unit of work in a trace.
- **Exemplar**: Linking a specific Trace ID to a specific Metric bucket. ("Show me a trace that represents this p99 latency spike").
- **SLO (Service Level Objective)**: The target reliability (e.g., "99.9% success").
- **SLA (Service Level Agreement)**: The contract (If we miss 99.9%, we owe you money).

## OpenTelemetry Config Deep Dive

The **OTel Collector** is the "Swiss Army Knife" of observability.
It has 3 parts:

1.  **Receivers**: "Listen on Port 4317 for incoming data."
2.  **Processors**: "Clean up the data."
3.  **Exporters**: "Send it to Grafana Cloud."

**Example Config (`config.yaml`)**:

```yaml
receivers:
  otlp:
    protocols:
      grpc:

processors:
  batch:
  # The Cool Part: Redacting Passwords
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["db.statement"], "password='.*'", "password='***'")

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, transform]
      exporters: [prometheus]
```

This processor layer is powerful. You can drop expensive traces, redact PII (GDPR compliance), or enrich data with K8s metadata before it leaves your network.

---

## Logs (Loki vs Elasticsearch)

**Elasticsearch (ELK)**: "Index Everything."

- Pros: Fast search.
- Cons: Massive storage cost. Indexes are huge.

**Loki (Grafana)**: "Index Metadata Only."
Loki doesn't index the log text. It only indexes the labels (`{app="frontend"}`).
To find a word, it "greps" the logs in real-time.

- Pros: 90% cheaper storage (S3).
- Cons: "Slow" queries if you don't use labels properly.

**LogQL (Query Language)**:
`{app="frontend"} |= "error" | json | latency > 500ms`
This pipeline tells Loki:

1.  Find logs for frontend.
2.  Grep for error.
3.  Parse the JSON line.
4.  Filter where `latency` field is > 500.

---

## The Math of SLOs (Service Level Objectives)

SLAs are contracts. SLOs are internal goals.
How do you calculate them?

**The Error Budget**:
If your SLO is 99.9% Availability.
You have `100% - 99.9% = 0.1%` Error Budget.
In a month (43,000 minutes), you are allowed **43 minutes** of downtime.

**Burn Rate Alerts**:
Don't page me if I have 1 error.
Page me if I am "Burning" my budget too fast.
"At this rate, we will exhaust our monthly budget in 4 hours." -> **CRITICAL ALERT**.
"At this rate, we will exhaust it in 3 days." -> **TICKET (Work on it tomorrow)**.

This prevents "Alert Fatigue."

---

## High Cardinality (The Danger Zone)

What happens if you accidentally add `user_id` to a metric label?
`http_requests_total{user_id="123"}`

If you have 1 million users...
Prometheus tries to create 1 million buckets.
Each bucket takes RAM.
Prometheus runs out of RAM (OOM Kill).
Your monitoring goes down.

**The Fix**:

1.  **Drop the label**: In OTel Processor, `action: delete_key, key: user_id`.
2.  **Use Logs/Traces**: It's okay to have high cardinality in Loki or Jaeger. Just not in Prometheus.

---

## SRE Toolbox Glossary

- **Instrumentation**: The code you add to your app to emit telemetry. (Manual vs Auto).
- **Head Sampling**: Deciding to keep a trace at the start of the request (Random 1%).
- **Tail Sampling**: Deciding to keep a trace at the end (Keep only if Error). This is better but expensive (needs to buffer all traces in memory).
- **Span Context**: The hidden ID passed between services headers.
- **Baggage**: Data passed alongside the trace (e.g., `CustomerId=123`) that every service can read.

## Conclusion: Driving with Eyes Open

Running a distributed system without Observability is like driving a car with the windshield painted black.
You might be moving, but you are going to crash.

You don't need fancy tools. Start simple:

1.  Structured Logs (JSON).
2.  Standard Metrics (RED Method).
3.  Basic Tracing.

Once you have visibility, debugging becomes fun again. You stop guessing and start solving.

### Further Reading

- [Google SRE Book (Monitoring Chapter)](https://sre.google/sre-book/monitoring-distributed-systems/)
- [OpenTelemetry Documentation](https://opentelemetry.io/)
- [The USE Method (Brendan Gregg)](https://www.brendangregg.com/usemethod.html)


---

<!-- METADATA_START -->
## Metadata & Citations

### Further Reading
- [Next.js 15 on Azure Container Apps: A Production-Ready Deployment Guide](https://www.ranti.dev/blog/nextjs-15-azure-container-apps-guide.md)
- [How I Made My Next.js Portfolio Actually Production-Ready (For $0)](https://www.ranti.dev/blog/production-ready-nextjs-ci-cd-edge-firewall.md)
- [Kiro IDE: Building a Production API With Spec-Driven AI (Hands-On Tutorial)](https://www.ranti.dev/blog/kiro-ide-spec-driven-development.md)

### Navigation
- [Back to Bio Hub](https://www.ranti.dev/.md)
- [Full Site Manifest](https://www.ranti.dev/llms.txt)

---
title: Observability 101: Why Logs Are Not Enough
author: Rantideb Howlader
date: 2026-01-17T00:00:00.000Z
canonical_url: https://www.ranti.dev/blog/observability-metrics-tracing
license: CC-BY-4.0
---
```json
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Observability 101: Why Logs Are Not Enough",
  "author": {
    "@type": "Person",
    "name": "Rantideb Howlader"
  },
  "datePublished": "2026-01-17T00:00:00.000Z",
  "url": "https://www.ranti.dev/blog/observability-metrics-tracing",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true
}
```

### BibTeX
```bibtex
@article{observability-metrics-tracing_2026,
  author = {Rantideb Howlader},
  title = {Observability 101: Why Logs Are Not Enough},
  journal = {Rantideb Howlader Portfolio},
  year = {2026},
  url = {https://www.ranti.dev/blog/observability-metrics-tracing},
  note = {Accessed: 2026-05-14}
}
```

### IEEE
Rantideb Howlader, "Observability 101: Why Logs Are Not Enough," Rantideb Howlader Portfolio, 2026. [Online]. Available: https://www.ranti.dev/blog/observability-metrics-tracing. [Accessed: 2026-05-14].

### APA
Rantideb Howlader. (2026). Observability 101: Why Logs Are Not Enough. Rantideb Howlader. Retrieved from https://www.ranti.dev/blog/observability-metrics-tracing

--- 
*This content is provided in research-grade Markdown format. Required Attribution: Cite as Rantideb Howlader (2026).*
<!-- METADATA_END -->