Skip to main content

Observability Strategy for Modern Applications: Logs, Metrics, Traces, and Business Signals

Marcus Webb
13 min read
Observability Strategy for Modern Applications: Logs, Metrics, Traces, and Business Signals

Most teams conflate monitoring with observability and build systems that can only answer questions they thought to ask in advance. That gap gets expensive when production incidents occur and the only available information is a dashboard that shows something is red.

Monitoring tells you when a known condition is true. Observability tells you what is happening inside a system you have never seen fail before. The distinction matters because modern distributed applications fail in novel ways, and the tools that work well for a single server fall apart for a fleet of microservices.

Why Most Observability Efforts Fail

The most common failure pattern looks like this:

  1. Engineers add logging because something broke and there was no record
  2. A monitoring tool gets connected to alert on CPU and memory
  3. A distributed tracing library gets added when latency becomes undebuggable
  4. A business intelligence tool measures revenue and conversion separately
  5. None of the four systems share a common identifier or data model

The result: an incident starts, three separate dashboards have to be correlated manually, and it takes four engineers two hours to find a single slow database query that cascaded into a customer-facing outage.

Observability strategy is not about adding more tools. It is about connecting four distinct signal types into a coherent picture so that a single engineer can navigate from a customer complaint to a root cause in minutes.

The Four Signal Types

Modern observability rests on four types of signals, each answering a different question:

SignalQuestion AnsweredPrimary Consumer
LogsWhat happened?Engineers debugging incidents
MetricsHow is the system performing?Engineers, SREs, platform teams
TracesWhere did the time go?Engineers debugging latency
Business SignalsWhat did it cost the business?Engineering leaders, product, finance

Most strategies invest heavily in logs and metrics while underinvesting in traces and business signals. That leaves the two hardest questions unanswered.

Logs: Structured, Purposeful, Queryable

Logs are the oldest observability signal and the most misused. The common failure is treating logs as print statements: unstructured text dumped to a file, inconsistently formatted, and impossible to query at scale.

Structure First

Every log line should be machine-readable JSON. Free-text logs are expensive to parse and impossible to aggregate reliably.

{
  "timestamp": "2025-05-20T14:32:01.234Z",
  "level": "error",
  "service": "payments-api",
  "trace_id": "4bf92f3577b34da6",
  "span_id": "00f067aa0ba902b7",
  "user_id": "usr_8f2k9x",
  "event": "payment_processing_failed",
  "error_code": "CARD_DECLINED",
  "amount_cents": 4999,
  "duration_ms": 312,
  "message": "Payment declined by processor"
}

Two fields are non-negotiable: trace_id and span_id. These are how logs connect to traces, which is what enables cross-signal correlation at incident time.

Log Level Discipline

LevelUse CaseExample
ERRORRequires immediate attention; something failed that should not havePayment processor unreachable
WARNDegraded but recoverable; needs investigationCache miss rate above threshold
INFONormal operational events worth recordingUser authenticated, order placed
DEBUGDetailed diagnostic information; disabled in production by defaultSQL query executed, cache key checked

The most common log hygiene failure is logging everything at INFO, which buries real signals in noise. If your logs require grep to find the actual errors, the level discipline has broken down.

What Not to Log

  • Passwords, tokens, API keys, or PII in plaintext
  • Every function entry and exit at INFO level
  • Repetitive high-frequency events without sampling (health check pings, polling loops)
  • Raw request/response bodies in production without redaction

Uncontrolled log volume is expensive. At scale, noisy logging costs more in storage and ingestion than the signal is worth.

Metrics: Instrument What Drives Decisions

Metrics are aggregated numerical measurements over time. They are the fastest signal to query and the cheapest to store, which makes them ideal for alerting and dashboards.

Three Instrumentation Frameworks

Three complementary frameworks help decide what to instrument:

The RED Method (for services handling requests):

MetricWhat to Measure
RateRequests per second
ErrorsError rate as a percentage of requests
DurationLatency distribution (p50, p95, p99)

The USE Method (for infrastructure resources):

MetricWhat to Measure
UtilizationPercentage of resource capacity in use
SaturationQueue depth or work waiting to be processed
ErrorsError count for the resource

The Four Golden Signals (for user-facing services, from Google SRE):

  1. Latency - How long requests take (successful and failed separately)
  2. Traffic - How much demand is on the system
  3. Errors - Rate of requests that fail
  4. Saturation - How full the system is

Apply RED to every service endpoint. Apply USE to every infrastructure component. Apply the Four Golden Signals to every user-facing surface. That coverage catches the majority of production problems before customers notice.

Histogram vs Gauge vs Counter

TypeUse ForExample
CounterMonotonically increasing totalsTotal requests, total errors
GaugePoint-in-time values that go up and downActive connections, queue depth
HistogramDistribution of values over timeRequest latency, payload size

Use histograms for latency, never averages. An average p50 latency of 120ms can coexist with a p99 of 4,200ms. The users experiencing the 4,200ms requests are not visible in the average.

Traces: Follow the Request

Distributed tracing is the observability signal that modern architectures need most and implement least. In a microservices system, a single user request may touch ten services, three databases, and a message queue. Without tracing, a latency problem in that chain is nearly impossible to diagnose.

A trace represents the full lifecycle of one request. It is composed of spans, where each span represents one unit of work in one service.

Distributed trace waterfall showing checkout-request spans across api-gateway, cart-service, inventory-service, and payments-api with slow postgres query and stripe-api spans highlighted in red

The trace immediately surfaces two slow operations: a Postgres query in cart-service and an external call to Stripe in payments-api. Without the trace, the only visible signal would be high checkout latency with no indication of which service is responsible.

What Each Span Should Capture

AttributeDescriptionExample
trace_idUnique identifier shared across all spans in a trace4bf92f3577b34da6
span_idUnique identifier for this span00f067aa0ba902b7
parent_span_idLinks to the parent spana2fb4a1d1a96d312
service.nameThe service that generated the spancart-service
operation.nameWhat the span representspostgres.query
duration_msHow long the operation took89
statusOK, ERROR, or UNSETOK
http.status_codeFor HTTP spans200
db.statementFor database spans (sanitized)SELECT * FROM carts WHERE user_id = ?

Sampling Strategy

Tracing every request in a high-throughput system is expensive. Three sampling strategies balance coverage against cost:

StrategyHow It WorksBest For
Head samplingDecision made at trace start; percentage of requests sampledLow-cost baseline coverage
Tail samplingDecision made after trace completes; keep slow or errored tracesCapturing all anomalies
Adaptive samplingRate adjusts based on traffic and error conditionsProduction systems with variable load

Tail sampling is the most valuable: it guarantees that every slow request and every error is captured while keeping storage costs manageable.

OpenTelemetry as the Foundation

OpenTelemetry has become the standard instrumentation library for logs, metrics, and traces. It is vendor-neutral, covers most languages and frameworks, and produces data that any modern observability backend can ingest.

Use OpenTelemetry for instrumentation. Choose your storage and visualization backend separately. This decouples the work of instrumenting your application from the work of choosing a vendor.

Business Signals: The Missing Fourth Pillar

Business signals are the observability layer that most engineering teams skip. They connect technical system behavior to business outcomes, answering questions that neither logs, metrics, nor traces can answer alone.

What Business Signals Look Like

Technical SignalBusiness Signal
Checkout API error rate: 2.3%Failed checkouts: 47 orders per hour, $8,200 in blocked revenue
p99 search latency: 3.4sSearch abandonment rate: 18% above baseline
Payment processor timeout rate: 0.8%Payment failures: 12 per hour, 4.1% potential churn risk
Cart service pod restarts: 3 in 1 hourCart loss events: 89 users affected

Business signals require two things: an understanding of the unit economics of your application, and instrumentation that tracks business events alongside technical events.

How to Define Business Signals

For each critical user journey, answer:

  1. What is the unit of value? (completed order, activated user, submitted document)
  2. What is the unit worth? (average order value, LTV, contract value)
  3. What technical failures degrade or block this unit? (payment errors, timeout, failed validation)
  4. What is the degradation rate? (errors per hour as a fraction of total volume)

With those four inputs, every technical alert can carry a business impact estimate. A p99 latency alert becomes "this latency profile historically correlates with a 12% abandonment increase, estimated $4,400 per hour."

That number changes how fast incidents get escalated, how much engineering time gets spent on reliability improvements, and whether the board understands why platform investment matters.

The Correlation Layer

The four signals only become a coherent observability system when they share a common correlation identifier. Without correlation, each signal is a separate investigation.

Four-step correlation flow: Business Signal to Metrics to Traces to Logs, root cause identified in 4 minutes

This navigation flow, from business signal to metrics to traces to logs, is only possible when every signal carries the same trace_id. Without it, the correlation is manual and the four-minute resolution becomes forty minutes.

The Observability Maturity Model

Use this model to assess where your platform stands and prioritize the next investment:

LevelCapabilityTypical State
1 - ReactiveUnstructured logs, basic uptime monitoring, alerts on crashes onlyMost incidents discovered by users
2 - AwareStructured logs, RED metrics on key services, basic dashboardsEngineers can confirm an incident is happening
3 - InformedDistributed tracing on critical paths, correlated log and trace IDsEngineers can identify which service is responsible
4 - ProactiveBusiness signals connected to technical metrics, SLOs defined and measuredEngineering can quantify business impact and catch degradation before users notice
5 - OptimizedAnomaly detection, automated correlation, incident response integrated with observability dataMTTR under 15 minutes for most incidents, reliability investment justified by data

Most organizations sit at Level 2. The move from Level 2 to Level 3 requires distributed tracing. The move from Level 3 to Level 4 requires business signal instrumentation. Both are high-ROI investments.

Tooling Overview

The observability tooling landscape is large. Choosing based on features before choosing based on architectural fit leads to underutilized platforms and redundant tooling.

CategoryOpen Source OptionsCommercial Options
InstrumentationOpenTelemetryOpenTelemetry (vendor-neutral)
Log aggregationLoki, OpenSearch, FluentdDatadog, Splunk, Elastic Cloud
Metrics storagePrometheus, VictoriaMetricsDatadog, Grafana Cloud, New Relic
Trace storageJaeger, TempoDatadog, Honeycomb, Lightstep
VisualizationGrafanaGrafana Cloud, Datadog, Dynatrace
AlertingAlertmanager, PagerDutyPagerDuty, OpsGenie, Datadog

Three architectural patterns dominate:

  1. Open source stack: OpenTelemetry + Prometheus + Loki + Tempo + Grafana. Maximum control, highest operational overhead. Best for large platform teams.

  2. Managed OSS: Same stack hosted on Grafana Cloud or similar. Reduced operational burden, OSS economics.

  3. Commercial platform: Datadog, New Relic, or Dynatrace. Fastest time to value, highest per-seat cost, best cross-signal correlation out of the box.

For organizations without a dedicated platform team, starting with a commercial platform and migrating toward open source as the team grows is a lower-risk approach than building an open source stack without the operational capacity to maintain it.

Where to Start

A common mistake is attempting to instrument everything at once. The first 20% of observability investment delivers 80% of the incident response improvement. Prioritize in this order:

Phase 1: Structured Logs and Core Metrics (Weeks 1-4)

  • Add structured JSON logging to all services that currently use free-text logs
  • Instrument every HTTP service with RED metrics (rate, errors, duration)
  • Add USE metrics to database and cache layers
  • Connect an alerting layer to p99 latency and error rate

This phase alone eliminates the most common "flying blind" incident scenarios.

Phase 2: Distributed Tracing (Weeks 5-10)

  • Instrument the five to ten most critical user journeys with OpenTelemetry
  • Propagate trace context across all service calls in those journeys
  • Connect trace IDs to log lines
  • Configure tail sampling to capture all errors and all requests over p95 latency

After this phase, a single engineer can navigate from an alert to a root cause without escalating to a senior engineer.

Phase 3: Business Signals and SLOs (Weeks 11-16)

  • Define the three to five most critical business events (checkout, activation, document submission)
  • Instrument those events with both technical and business attributes
  • Define Service Level Objectives (SLOs) tied to business thresholds
  • Connect business signal dashboards to engineering incident response

After this phase, reliability decisions are driven by quantified business impact rather than engineering instinct.

Common Mistakes

Mistake 1: Alerting on Everything

More alerts do not produce better incident response. Teams with hundreds of alerts experience alert fatigue and start ignoring notifications. Every alert should require a documented response action.

Better approach: Define alerts that require human action. Everything else is a dashboard metric.

Mistake 2: Ignoring Cardinality

High-cardinality labels in metrics (user IDs, request IDs, session tokens as metric labels) cause metrics databases to explode in size and degrade query performance.

Better approach: High-cardinality data belongs in traces and logs, not metrics. Metrics labels should be low-cardinality: service name, environment, endpoint pattern, status code.

Mistake 3: No SLOs, Only Alerts

Alerting without Service Level Objectives produces reactive operations. Engineers respond to the loudest alert, not the most impactful problem.

Better approach: Define SLOs before defining alerts. Alerts should fire when error budget consumption is accelerating, not on every threshold breach.

Mistake 4: Traces Without Sampling Strategy

Tracing every request without a sampling strategy in a high-throughput system can cost more in storage than the entire rest of the observability stack.

Better approach: Start with 1-5% head sampling for baseline coverage. Layer in tail sampling for errors and high-latency requests. Tune the sampling rate as you understand your traffic profile.

Mistake 5: No Business Signal Ownership

Business signals require both engineering instrumentation and product or finance input on what to measure. When ownership is unclear, they never get built.

Better approach: Assign joint ownership between the engineering team and a product or operations stakeholder for each critical user journey.

Key Takeaways

  1. Observability is not monitoring: Monitoring confirms known failure modes. Observability enables discovery of unknown failure modes.
  2. All four signals are required: Logs, metrics, and traces without business signals leave the most important question unanswered.
  3. Correlation is the value: Isolated signals are four separate debugging tools. Connected signals with shared trace IDs are a navigable system.
  4. Start with structure: Structured JSON logs with trace IDs are the highest-leverage first investment.
  5. Instrument the critical path first: Full coverage is a multi-year journey. Instrument the journeys that generate the most business value first.
  6. Build SLOs before dashboards: SLOs give dashboards meaning. Dashboards without SLOs produce information without decisions.
  7. Tail sampling beats head sampling for quality: Capturing every error and every slow trace is more valuable than a uniform sample of all traffic.

Modern distributed systems generate more signals than any team can manually review. The goal of an observability strategy is not to see everything. It is to be able to find anything when it matters.


Ready to build an observability strategy for your platform? Contact EGI Consulting for an observability assessment and a roadmap tailored to your architecture, team, and reliability goals.

Related Articles

Continue reading with hand-picked articles on similar topics.

View all articles