Observability Strategy for Modern Applications: Logs, Metrics, Traces, and Business Signals

Most teams conflate monitoring with observability and build systems that can only answer questions they thought to ask in advance. That gap gets expensive when production incidents occur and the only available information is a dashboard that shows something is red.
Monitoring tells you when a known condition is true. Observability tells you what is happening inside a system you have never seen fail before. The distinction matters because modern distributed applications fail in novel ways, and the tools that work well for a single server fall apart for a fleet of microservices.
Why Most Observability Efforts Fail
The most common failure pattern looks like this:
- Engineers add logging because something broke and there was no record
- A monitoring tool gets connected to alert on CPU and memory
- A distributed tracing library gets added when latency becomes undebuggable
- A business intelligence tool measures revenue and conversion separately
- None of the four systems share a common identifier or data model
The result: an incident starts, three separate dashboards have to be correlated manually, and it takes four engineers two hours to find a single slow database query that cascaded into a customer-facing outage.
Observability strategy is not about adding more tools. It is about connecting four distinct signal types into a coherent picture so that a single engineer can navigate from a customer complaint to a root cause in minutes.
The Four Signal Types
Modern observability rests on four types of signals, each answering a different question:
| Signal | Question Answered | Primary Consumer |
|---|---|---|
| Logs | What happened? | Engineers debugging incidents |
| Metrics | How is the system performing? | Engineers, SREs, platform teams |
| Traces | Where did the time go? | Engineers debugging latency |
| Business Signals | What did it cost the business? | Engineering leaders, product, finance |
Most strategies invest heavily in logs and metrics while underinvesting in traces and business signals. That leaves the two hardest questions unanswered.
Logs: Structured, Purposeful, Queryable
Logs are the oldest observability signal and the most misused. The common failure is treating logs as print statements: unstructured text dumped to a file, inconsistently formatted, and impossible to query at scale.
Structure First
Every log line should be machine-readable JSON. Free-text logs are expensive to parse and impossible to aggregate reliably.
{
"timestamp": "2025-05-20T14:32:01.234Z",
"level": "error",
"service": "payments-api",
"trace_id": "4bf92f3577b34da6",
"span_id": "00f067aa0ba902b7",
"user_id": "usr_8f2k9x",
"event": "payment_processing_failed",
"error_code": "CARD_DECLINED",
"amount_cents": 4999,
"duration_ms": 312,
"message": "Payment declined by processor"
}
Two fields are non-negotiable: trace_id and span_id. These are how logs
connect to traces, which is what enables cross-signal correlation at incident
time.
Log Level Discipline
| Level | Use Case | Example |
|---|---|---|
| ERROR | Requires immediate attention; something failed that should not have | Payment processor unreachable |
| WARN | Degraded but recoverable; needs investigation | Cache miss rate above threshold |
| INFO | Normal operational events worth recording | User authenticated, order placed |
| DEBUG | Detailed diagnostic information; disabled in production by default | SQL query executed, cache key checked |
The most common log hygiene failure is logging everything at INFO, which buries real signals in noise. If your logs require grep to find the actual errors, the level discipline has broken down.
What Not to Log
- Passwords, tokens, API keys, or PII in plaintext
- Every function entry and exit at INFO level
- Repetitive high-frequency events without sampling (health check pings, polling loops)
- Raw request/response bodies in production without redaction
Uncontrolled log volume is expensive. At scale, noisy logging costs more in storage and ingestion than the signal is worth.
Metrics: Instrument What Drives Decisions
Metrics are aggregated numerical measurements over time. They are the fastest signal to query and the cheapest to store, which makes them ideal for alerting and dashboards.
Three Instrumentation Frameworks
Three complementary frameworks help decide what to instrument:
The RED Method (for services handling requests):
| Metric | What to Measure |
|---|---|
| Rate | Requests per second |
| Errors | Error rate as a percentage of requests |
| Duration | Latency distribution (p50, p95, p99) |
The USE Method (for infrastructure resources):
| Metric | What to Measure |
|---|---|
| Utilization | Percentage of resource capacity in use |
| Saturation | Queue depth or work waiting to be processed |
| Errors | Error count for the resource |
The Four Golden Signals (for user-facing services, from Google SRE):
- Latency - How long requests take (successful and failed separately)
- Traffic - How much demand is on the system
- Errors - Rate of requests that fail
- Saturation - How full the system is
Apply RED to every service endpoint. Apply USE to every infrastructure component. Apply the Four Golden Signals to every user-facing surface. That coverage catches the majority of production problems before customers notice.
Histogram vs Gauge vs Counter
| Type | Use For | Example |
|---|---|---|
| Counter | Monotonically increasing totals | Total requests, total errors |
| Gauge | Point-in-time values that go up and down | Active connections, queue depth |
| Histogram | Distribution of values over time | Request latency, payload size |
Use histograms for latency, never averages. An average p50 latency of 120ms can coexist with a p99 of 4,200ms. The users experiencing the 4,200ms requests are not visible in the average.
Traces: Follow the Request
Distributed tracing is the observability signal that modern architectures need most and implement least. In a microservices system, a single user request may touch ten services, three databases, and a message queue. Without tracing, a latency problem in that chain is nearly impossible to diagnose.
A trace represents the full lifecycle of one request. It is composed of spans, where each span represents one unit of work in one service.

The trace immediately surfaces two slow operations: a Postgres query in cart-service and an external call to Stripe in payments-api. Without the trace, the only visible signal would be high checkout latency with no indication of which service is responsible.
What Each Span Should Capture
| Attribute | Description | Example |
|---|---|---|
trace_id | Unique identifier shared across all spans in a trace | 4bf92f3577b34da6 |
span_id | Unique identifier for this span | 00f067aa0ba902b7 |
parent_span_id | Links to the parent span | a2fb4a1d1a96d312 |
service.name | The service that generated the span | cart-service |
operation.name | What the span represents | postgres.query |
duration_ms | How long the operation took | 89 |
status | OK, ERROR, or UNSET | OK |
http.status_code | For HTTP spans | 200 |
db.statement | For database spans (sanitized) | SELECT * FROM carts WHERE user_id = ? |
Sampling Strategy
Tracing every request in a high-throughput system is expensive. Three sampling strategies balance coverage against cost:
| Strategy | How It Works | Best For |
|---|---|---|
| Head sampling | Decision made at trace start; percentage of requests sampled | Low-cost baseline coverage |
| Tail sampling | Decision made after trace completes; keep slow or errored traces | Capturing all anomalies |
| Adaptive sampling | Rate adjusts based on traffic and error conditions | Production systems with variable load |
Tail sampling is the most valuable: it guarantees that every slow request and every error is captured while keeping storage costs manageable.
OpenTelemetry as the Foundation
OpenTelemetry has become the standard instrumentation library for logs, metrics, and traces. It is vendor-neutral, covers most languages and frameworks, and produces data that any modern observability backend can ingest.
Use OpenTelemetry for instrumentation. Choose your storage and visualization backend separately. This decouples the work of instrumenting your application from the work of choosing a vendor.
Business Signals: The Missing Fourth Pillar
Business signals are the observability layer that most engineering teams skip. They connect technical system behavior to business outcomes, answering questions that neither logs, metrics, nor traces can answer alone.
What Business Signals Look Like
| Technical Signal | Business Signal |
|---|---|
| Checkout API error rate: 2.3% | Failed checkouts: 47 orders per hour, $8,200 in blocked revenue |
| p99 search latency: 3.4s | Search abandonment rate: 18% above baseline |
| Payment processor timeout rate: 0.8% | Payment failures: 12 per hour, 4.1% potential churn risk |
| Cart service pod restarts: 3 in 1 hour | Cart loss events: 89 users affected |
Business signals require two things: an understanding of the unit economics of your application, and instrumentation that tracks business events alongside technical events.
How to Define Business Signals
For each critical user journey, answer:
- What is the unit of value? (completed order, activated user, submitted document)
- What is the unit worth? (average order value, LTV, contract value)
- What technical failures degrade or block this unit? (payment errors, timeout, failed validation)
- What is the degradation rate? (errors per hour as a fraction of total volume)
With those four inputs, every technical alert can carry a business impact estimate. A p99 latency alert becomes "this latency profile historically correlates with a 12% abandonment increase, estimated $4,400 per hour."
That number changes how fast incidents get escalated, how much engineering time gets spent on reliability improvements, and whether the board understands why platform investment matters.
The Correlation Layer
The four signals only become a coherent observability system when they share a common correlation identifier. Without correlation, each signal is a separate investigation.

This navigation flow, from business signal to metrics to traces to logs, is only
possible when every signal carries the same trace_id. Without it, the
correlation is manual and the four-minute resolution becomes forty minutes.
The Observability Maturity Model
Use this model to assess where your platform stands and prioritize the next investment:
| Level | Capability | Typical State |
|---|---|---|
| 1 - Reactive | Unstructured logs, basic uptime monitoring, alerts on crashes only | Most incidents discovered by users |
| 2 - Aware | Structured logs, RED metrics on key services, basic dashboards | Engineers can confirm an incident is happening |
| 3 - Informed | Distributed tracing on critical paths, correlated log and trace IDs | Engineers can identify which service is responsible |
| 4 - Proactive | Business signals connected to technical metrics, SLOs defined and measured | Engineering can quantify business impact and catch degradation before users notice |
| 5 - Optimized | Anomaly detection, automated correlation, incident response integrated with observability data | MTTR under 15 minutes for most incidents, reliability investment justified by data |
Most organizations sit at Level 2. The move from Level 2 to Level 3 requires distributed tracing. The move from Level 3 to Level 4 requires business signal instrumentation. Both are high-ROI investments.
Tooling Overview
The observability tooling landscape is large. Choosing based on features before choosing based on architectural fit leads to underutilized platforms and redundant tooling.
| Category | Open Source Options | Commercial Options |
|---|---|---|
| Instrumentation | OpenTelemetry | OpenTelemetry (vendor-neutral) |
| Log aggregation | Loki, OpenSearch, Fluentd | Datadog, Splunk, Elastic Cloud |
| Metrics storage | Prometheus, VictoriaMetrics | Datadog, Grafana Cloud, New Relic |
| Trace storage | Jaeger, Tempo | Datadog, Honeycomb, Lightstep |
| Visualization | Grafana | Grafana Cloud, Datadog, Dynatrace |
| Alerting | Alertmanager, PagerDuty | PagerDuty, OpsGenie, Datadog |
Three architectural patterns dominate:
-
Open source stack: OpenTelemetry + Prometheus + Loki + Tempo + Grafana. Maximum control, highest operational overhead. Best for large platform teams.
-
Managed OSS: Same stack hosted on Grafana Cloud or similar. Reduced operational burden, OSS economics.
-
Commercial platform: Datadog, New Relic, or Dynatrace. Fastest time to value, highest per-seat cost, best cross-signal correlation out of the box.
For organizations without a dedicated platform team, starting with a commercial platform and migrating toward open source as the team grows is a lower-risk approach than building an open source stack without the operational capacity to maintain it.
Where to Start
A common mistake is attempting to instrument everything at once. The first 20% of observability investment delivers 80% of the incident response improvement. Prioritize in this order:
Phase 1: Structured Logs and Core Metrics (Weeks 1-4)
- Add structured JSON logging to all services that currently use free-text logs
- Instrument every HTTP service with RED metrics (rate, errors, duration)
- Add USE metrics to database and cache layers
- Connect an alerting layer to p99 latency and error rate
This phase alone eliminates the most common "flying blind" incident scenarios.
Phase 2: Distributed Tracing (Weeks 5-10)
- Instrument the five to ten most critical user journeys with OpenTelemetry
- Propagate trace context across all service calls in those journeys
- Connect trace IDs to log lines
- Configure tail sampling to capture all errors and all requests over p95 latency
After this phase, a single engineer can navigate from an alert to a root cause without escalating to a senior engineer.
Phase 3: Business Signals and SLOs (Weeks 11-16)
- Define the three to five most critical business events (checkout, activation, document submission)
- Instrument those events with both technical and business attributes
- Define Service Level Objectives (SLOs) tied to business thresholds
- Connect business signal dashboards to engineering incident response
After this phase, reliability decisions are driven by quantified business impact rather than engineering instinct.
Common Mistakes
Mistake 1: Alerting on Everything
More alerts do not produce better incident response. Teams with hundreds of alerts experience alert fatigue and start ignoring notifications. Every alert should require a documented response action.
Better approach: Define alerts that require human action. Everything else is a dashboard metric.
Mistake 2: Ignoring Cardinality
High-cardinality labels in metrics (user IDs, request IDs, session tokens as metric labels) cause metrics databases to explode in size and degrade query performance.
Better approach: High-cardinality data belongs in traces and logs, not metrics. Metrics labels should be low-cardinality: service name, environment, endpoint pattern, status code.
Mistake 3: No SLOs, Only Alerts
Alerting without Service Level Objectives produces reactive operations. Engineers respond to the loudest alert, not the most impactful problem.
Better approach: Define SLOs before defining alerts. Alerts should fire when error budget consumption is accelerating, not on every threshold breach.
Mistake 4: Traces Without Sampling Strategy
Tracing every request without a sampling strategy in a high-throughput system can cost more in storage than the entire rest of the observability stack.
Better approach: Start with 1-5% head sampling for baseline coverage. Layer in tail sampling for errors and high-latency requests. Tune the sampling rate as you understand your traffic profile.
Mistake 5: No Business Signal Ownership
Business signals require both engineering instrumentation and product or finance input on what to measure. When ownership is unclear, they never get built.
Better approach: Assign joint ownership between the engineering team and a product or operations stakeholder for each critical user journey.
Key Takeaways
- Observability is not monitoring: Monitoring confirms known failure modes. Observability enables discovery of unknown failure modes.
- All four signals are required: Logs, metrics, and traces without business signals leave the most important question unanswered.
- Correlation is the value: Isolated signals are four separate debugging tools. Connected signals with shared trace IDs are a navigable system.
- Start with structure: Structured JSON logs with trace IDs are the highest-leverage first investment.
- Instrument the critical path first: Full coverage is a multi-year journey. Instrument the journeys that generate the most business value first.
- Build SLOs before dashboards: SLOs give dashboards meaning. Dashboards without SLOs produce information without decisions.
- Tail sampling beats head sampling for quality: Capturing every error and every slow trace is more valuable than a uniform sample of all traffic.
Modern distributed systems generate more signals than any team can manually review. The goal of an observability strategy is not to see everything. It is to be able to find anything when it matters.
Ready to build an observability strategy for your platform? Contact EGI Consulting for an observability assessment and a roadmap tailored to your architecture, team, and reliability goals.
Related Articles
Continue reading with hand-picked articles on similar topics.

Learn how to build Internal Developer Platforms (IDPs) that boost engineering productivity by 30%+. Includes IDP architecture, golden paths, and implementation playbook.

Kubernetes has won the container war, but running it in production is still hard. Learn battle-tested patterns for resource management, cost optimization, security, and day-2 operations.

Learn how to design cloud architecture that scales with your startup's growth. From MVP to millions of users-practical strategies for AWS, Azure, and GCP that won't break the bank.