Introduction: The Murder Mystery

Debugging a Monolith is easy. You look at the server.log. You see the Stack Trace. You fix the bug.

Debugging Microservices is a Murder Mystery. User Alice clicks "Buy." Service A calls Service B. Service B calls Service C. Service C calls the Database. The Database times out. Service C returns 500. Service B retries... and fails. Service A shows "Error: Unknown."

You look at Service A's logs. It says "Error calling B". You look at Service B's logs. It says "Error calling C". You look at Service C's logs. It says "DB Timeout".

But here is the catch: There are 1,000 requests per second. Which log line belongs to Alice? Service A logged at 12:00:01. Service C logged at 12:00:02. Are they related? Or is it a coincidence?

Without Observability, you are guessing. You are frantically grepping logs while the CTO breathes down your neck.

In this guide, we are going to fix this. We will move beyond "Monitoring" (Is it up?) to "Observability" (Why is it weird?). We will master the Three Pillars: Metrics, Logs, and Traces. And we will learn about OpenTelemetry, the standard that binds them all.

Part 1: Monitoring vs Observability

Monitoring: "The CPU is at 90%."
- It answers "Known Unknowns". (I know CPU can get high, so I watch it).
Observability: "The CPU is at 90% because user 123 sent a malformed JSON payload that triggered an infinite regex loop in the payment library."
- It answers "Unknown Unknowns". (I didn't know that could happen).

If your dashboard is just red/green lights, you have Monitoring. If your dashboard lets you click into a spike and find the specific user who caused it, you have Observability.

Part 2: The Three Pillars

1. Metrics (The Dashboard)

Metrics are numbers. Aggregations.

http_requests_total = 500
cpu_usage = 80%
Pros: Cheap. You can store 10 years of metrics.
Cons: No context. "Error rate is 5%". Okay... which 5%? Is it iPhone users? Is it the /admin page? Metrics don't tell you.

2. Logs (The Story)

Logs are text.

2023-10-01 ERROR: NullPointerException in User.java:50
Pros: Infinite detail.
Cons: Expensive. Logging every request at scale costs a fortune. Hard to search (needle in a haystack).

3. Traces (The Map)

Traces are the glue. A Trace follows a single request as it jumps between services.

TraceID: abc-123
- Span 1: Service A (took 50ms)
- Span 2: Service B (took 200ms)
  - Span 3: Database Query (took 190ms - Here is the problem!)

Part 3: Metrics Deep Dive (Prometheus)

Prometheus is the king of metrics. It uses a "Pull" model. It scrapes your app (/metrics) every 15 seconds.

Key Concept: Labels (Dimensions) Old way: metric_name: cpu_usage Prometheus way: cpu_usage{host="server-1", env="prod", app="payment"}

The Cardinality Explosion (The Trap): Labels are great. But use them wisely. If you add a label user_id... And you have 1 million users... Prometheus creates 1 million separate time series. Memory usage explodes. Prometheus crashes. Rule: Never put high-cardinality data (IDs, Emails, UUIDs) in Metrics. Put them in Logs or Traces.

Part 4: Structured Logging (JSON)

Stop logging text. logger.info("User " + user + " logged in") -> This is garbage. You cannot query it.

Start logging JSON.

{
  "level": "info",
  "msg": "User logged in",
  "user_id": "123",
  "ip": "10.0.0.1",
  "duration_ms": 45
}

Now you can run queries in CloudWatch/Splunk: filter duration_ms > 500 and ip = "10.0.0.1"

Context Propagation: Every log line must have a trace_id. This is how you link Logs to Traces. When you see a slow trace in Jaeger, you copy the ID, paste it into your logs, and see exactly what happened.

Part 5: Distributed Tracing (The "Ah-Ha" Moment)

You install an agent (OpenTelemetry or X-Ray). The agent automatically injects headers into your HTTP calls.

X-Trace-Id: abc-123

Service B sees this header, uses the same ID, and passes it to Service C. The backend (Tempo/Jaeger/X-Ray) stitches them together visually. You see a Waterfall graph. You instantly see the long bar. "Oh, the Redis call took 2 seconds." Case closed.

Part 6: OpenTelemetry (The Standard)

In the past, you used the Datadog Agent, or the New Relic Agent. You were locked in. If you wanted to switch to Prometheus, you had to rewrite code.

OpenTelemetry (OTel) is an open standard (CNCF).

Use the OTel SDK in your code.
It sends data to the OTel Collector (a proxy).
The Collector sends Metrics to Prometheus, Logs to Loki, and Traces to Jaeger.

If you want to switch vendors? Just change the config in the Collector. No code changes. This is the future. Implement OTel today.

Part 7: The USE Method (Brendan Gregg)

How do you start? What dashboard do you build first? Use the USE Method for every resource (CPU, Disk, Memory):

Utilization: How busy is it? (e.g., CPU 90%).
Saturation: Is work queuing up? (e.g., Load Average, Disk Queue Length).
Errors: Are there hardware/software errors?

If Utilization is high but Saturation is low -> You are fine. If Saturation is high -> You have a bottleneck. Performance will degrade non-linearly.

Part 8: The RED Method (Tom Wilkie)

For Microservices (HTTP APIs), use RED:

Rate: Requests per second (Traffic).
Errors: Failed requests per second.
Duration: Latency (p50, p90, p99).

Why p99?: "Average Latency" is a lie. If 99 users get 10ms, and 1 user gets 10 seconds (timeout). Average = ~100ms. Looks okay. But that 1 user is angry. And that 1 user might be your biggest customer. Optimize for the 99th percentile (p99).

Part 9: Glossary for the SRE

Cardinality: The number of unique values in a set. (Low: Status codes. High: User IDs).
Sampling: You can't trace 100% of requests (too much data). You measure 1% (Head Sampling) or keeps the "interesting" ones (Tail Sampling).
Span: A single unit of work in a trace.
Exemplar: Linking a specific Trace ID to a specific Metric bucket. ("Show me a trace that represents this p99 latency spike").
SLO (Service Level Objective): The target reliability (e.g., "99.9% success").
SLA (Service Level Agreement): The contract (If we miss 99.9%, we owe you money).

Part 10: OpenTelemetry Config Deep Dive

The OTel Collector is the "Swiss Army Knife" of observability. It has 3 parts:

Receivers: "Listen on Port 4317 for incoming data."
Processors: "Clean up the data."
Exporters: "Send it to Grafana Cloud."

Example Config (config.yaml):

receivers:
  otlp:
    protocols:
      grpc:
 
processors:
  batch:
  # The Cool Part: Redacting Passwords
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["db.statement"], "password='.*'", "password='***'")
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, transform]
      exporters: [prometheus]

This processor layer is powerful. You can drop expensive traces, redact PII (GDPR compliance), or enrich data with K8s metadata before it leaves your network.

Part 11: Logs (Loki vs Elasticsearch)

Elasticsearch (ELK): "Index Everything."

Pros: Fast search.
Cons: Massive storage cost. Indexes are huge.

Loki (Grafana): "Index Metadata Only." Loki doesn't index the log text. It only indexes the labels ({app="frontend"}). To find a word, it "greps" the logs in real-time.

Pros: 90% cheaper storage (S3).
Cons: "Slow" queries if you don't use labels properly.

LogQL (Query Language): {app="frontend"} |= "error" | json | latency > 500ms This pipeline tells Loki:

Find logs for frontend.
Grep for error.
Parse the JSON line.
Filter where latency field is > 500.

Part 12: The Math of SLOs (Service Level Objectives)

SLAs are contracts. SLOs are internal goals. How do you calculate them?

The Error Budget: If your SLO is 99.9% Availability. You have 100% - 99.9% = 0.1% Error Budget. In a month (43,000 minutes), you are allowed 43 minutes of downtime.

Burn Rate Alerts: Don't page me if I have 1 error. Page me if I am "Burning" my budget too fast. "At this rate, we will exhaust our monthly budget in 4 hours." -> CRITICAL ALERT. "At this rate, we will exhaust it in 3 days." -> TICKET (Work on it tomorrow).

This prevents "Alert Fatigue."

Part 13: High Cardinality (The Danger Zone)

What happens if you accidentally add user_id to a metric label? http_requests_total{user_id="123"}

If you have 1 million users... Prometheus tries to create 1 million buckets. Each bucket takes RAM. Prometheus runs out of RAM (OOM Kill). Your monitoring goes down.

The Fix:

Drop the label: In OTel Processor, action: delete_key, key: user_id.
Use Logs/Traces: It's okay to have high cardinality in Loki or Jaeger. Just not in Prometheus.

Part 14: Glossary for the SRE

Instrumentation: The code you add to your app to emit telemetry. (Manual vs Auto).
Head Sampling: Deciding to keep a trace at the start of the request (Random 1%).
Tail Sampling: Deciding to keep a trace at the end (Keep only if Error). This is better but expensive (needs to buffer all traces in memory).
Span Context: The hidden ID passed between services headers.
Baggage: Data passed alongside the trace (e.g., CustomerId=123) that every service can read.

Conclusion: Driving with Eyes Open

Running a distributed system without Observability is like driving a car with the windshield painted black. You might be moving, but you are going to crash.

You don't need fancy tools. Start simple:

Structured Logs (JSON).
Standard Metrics (RED Method).
Basic Tracing.

Once you have visibility, debugging becomes fun again. You stop guessing and start solving.