Observability: Logs, Metrics, and Traces — The Three Pillars in Practice
Concept
Monitoring is a statement about the past: you define thresholds, you alert when they're breached, you react. Observability is a property of the system itself: the ability to ask arbitrary questions about internal state by examining external outputs, without deploying new instrumentation code.
The distinction matters because in a distributed system, the failure modes you will encounter in production are not the ones you anticipated during design. Monitoring covers known unknowns. Observability is how you handle unknown unknowns.
The three signals — logs, metrics, and traces — are not interchangeable. Each answers a different class of question:
Logs answer: "What happened at this moment?" They are discrete, high-cardinality events with arbitrary context. Structured logs (JSON over plaintext) are queryable. Unstructured logs are archaeology.
Metrics answer: "What is the system's behavior over time?" They are numeric, low-cardinality, time-series data. They're cheap to store at scale (a counter is 8 bytes; a log line is 200+). They're the signal you alert on.
Traces answer: "Why did this specific request behave this way?" A distributed trace follows a single request across every service it touches, recording timing, errors, and contextual metadata at each hop. Without traces, debugging latency in a service mesh is like diagnosing a patient by interviewing individual cells.
OpenTelemetry as the Unifying Standard
OpenTelemetry (OTel) is the CNCF project that standardizes how instrumentation is written and exported, independent of the backend (Jaeger, Tempo, Honeycomb, Datadog, etc.). It is the most important observability development of the last decade precisely because it eliminates vendor lock-in at the instrumentation layer.
In .NET, OpenTelemetry integrates through System.Diagnostics.ActivitySource (traces), System.Diagnostics.Metrics (metrics), and Microsoft.Extensions.Logging (logs), all wired through the OTel SDK.
W3C TraceContext and traceparent Propagation
The W3C TraceContext specification defines the traceparent HTTP header that carries a trace ID and span ID across service boundaries. The format is:
traceparent: 00-{trace-id}-{parent-span-id}-{trace-flags}
Where trace-id is a 128-bit identifier unique to the originating request, and parent-span-id is the 64-bit identifier of the calling span. Every service that receives a request and propagates this header participates in the same distributed trace, even if it's implemented in a different language or framework.
Without propagation, you have a collection of disconnected per-service traces. With propagation, you have a causal graph of the entire request's journey.
Constraints
Cardinality vs. Cost
Metrics have a cardinality ceiling defined by your time-series backend. Prometheus degrades significantly above ~10M unique time series. If you create a metric with a label per user ID, per order ID, or per session ID, you'll DoS your monitoring infrastructure before you finish the sprint. High-cardinality data belongs in traces (indexed per request) and logs (searchable by field), not in metrics.
The engineering discipline: metrics labels must be bounded cardinality. Status codes (200, 404, 500 — finite), service names (finite), HTTP methods (finite). Never request IDs, never user IDs.
Sampling Pressure
Recording 100% of traces in a high-throughput system is economically infeasible. A service processing 10,000 RPS generates 10,000 spans per second. At 1KB per span, that's 10MB/s — 864GB/day — before considering distributed fan-out.
Sampling strategies exist on a spectrum:
- Head-based sampling: The decision is made at the trace origin. Simple to implement, cheap. Downside: you sample out rare errors before you know they're errors.
- Tail-based sampling: The decision is made after the trace completes, based on outcome (was there an error? was it slow?). Captures all interesting traces. Requires a trace buffer (OpenTelemetry Collector tail-sampling processor). More complex.
- Adaptive sampling: Rate-limiting per service, per operation, per error type. The most nuanced approach; used in production at scale.
Log Volume and Verbosity Calibration
Log levels exist for a reason. Debug in production is an operational liability: it floods your log aggregator, drives storage costs, and buries signal in noise. The correct discipline: Information for business-relevant events (order placed, payment processed), Warning for degraded-but-recoverable conditions, Error for failures requiring attention, and Critical for service-threatening conditions.
Trade-offs
RED vs. USE Metrics Frameworks
Two competing frameworks define which metrics matter:
RED (Rate, Errors, Duration) — coined by Tom Wilkie at Weaveworks, tailored to request-driven services:
- Rate: Requests per second
- Errors: Failed requests per second (or error %)
- Duration: Distribution of request latencies (p50, p95, p99)
USE (Utilization, Saturation, Errors) — coined by Brendan Gregg, tailored to infrastructure and resource-constrained components:
- Utilization: How busy is the resource? (CPU %, memory %)
- Saturation: How much work is queued/waiting? (thread pool queue depth, connection pool exhaustion)
- Errors: Hardware or OS-level errors
For .NET microservices, the correct application is:
- Apply RED to every HTTP endpoint, gRPC method, and message consumer
- Apply USE to the thread pool, connection pools (DB, HTTP, Redis), and GC pressure
Neither framework alone is complete. A service can have excellent RED metrics (fast, low-error) while its thread pool is saturated and one noisy-neighbor request will tip it into cascading failure.
Push vs. Pull Metrics Collection
Prometheus uses a pull model: it scrapes /metrics endpoints on a configured interval. This is simple and works well for services with stable network topology. It breaks for short-lived services (Lambda, jobs) and services behind strict network boundaries.
OpenTelemetry supports both pull (Prometheus exporter) and push (OTLP to a collector). For Kubernetes-native services: Prometheus pull. For serverless or cross-VPC services: OTLP push. The collector can translate between them, giving you flexibility without re-instrumenting.
Code
The following wires up a complete OpenTelemetry instrumentation stack in a .NET 8 service — traces, metrics, and logs unified under a single configuration:
// Program.cs — OpenTelemetry full-stack setup for a .NET 8 minimal API service
using OpenTelemetry;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using System.Diagnostics;
using System.Diagnostics.Metrics;
var builder = WebApplication.CreateBuilder(args);
// Define the service resource — propagated to all telemetry signals
var serviceResource = ResourceBuilder.CreateDefault()
.AddService(
serviceName: "order-service",
serviceVersion: "2.4.1",
serviceInstanceId: Environment.MachineName);
// Custom ActivitySource for manual instrumentation
var activitySource = new ActivitySource("OrderService.Activities", "2.4.1");
// Custom Meter for business metrics
var meter = new Meter("OrderService.Metrics", "2.4.1");
var ordersProcessedCounter = meter.CreateCounter<long>(
"orders.processed.total",
description: "Total number of orders processed");
var orderProcessingDuration = meter.CreateHistogram<double>(
"orders.processing.duration.ms",
unit: "ms",
description: "End-to-end order processing duration");
builder.Services.AddSingleton(activitySource);
builder.Services.AddSingleton(meter);
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.SetResourceBuilder(serviceResource)
.AddAspNetCoreInstrumentation(opts =>
{
// Enrich spans with HTTP request metadata
opts.EnrichWithHttpRequest = (activity, request) =>
{
activity.SetTag("http.client_ip",
request.HttpContext.Connection.RemoteIpAddress?.ToString());
activity.SetTag("app.tenant_id",
request.Headers["X-Tenant-Id"].FirstOrDefault());
};
})
.AddEntityFrameworkCoreInstrumentation(opts =>
{
opts.SetDbStatementForText = true; // Include SQL in spans (dev/staging only!)
})
.AddSource(activitySource.Name)
.AddOtlpExporter(otlp =>
{
otlp.Endpoint = new Uri(builder.Configuration["Otel:Endpoint"]!);
}))
.WithMetrics(metrics => metrics
.SetResourceBuilder(serviceResource)
.AddAspNetCoreInstrumentation()
.AddRuntimeInstrumentation() // GC, thread pool, memory
.AddProcessInstrumentation() // CPU, handles
.AddMeter(meter.Name)
.AddPrometheusExporter()); // Scrape endpoint at /metrics
// Logging: structured JSON, correlated with trace IDs automatically
builder.Logging.AddOpenTelemetry(logging =>
{
logging.SetResourceBuilder(serviceResource);
logging.AddOtlpExporter(otlp =>
{
otlp.Endpoint = new Uri(builder.Configuration["Otel:Endpoint"]!);
});
logging.IncludeScopes = true;
logging.IncludeFormattedMessage = true;
});
var app = builder.Build();
app.MapPrometheusScrapingEndpoint(); // GET /metrics for Prometheus
app.Run();
The second example shows manual span creation with rich semantic attributes, demonstrating how to instrument a domain operation with the correct W3C-compliant trace context:
// OrderProcessingService.cs — manual instrumentation with spans and RED metrics
public class OrderProcessingService
{
private readonly ActivitySource _activitySource;
private readonly Counter<long> _ordersProcessed;
private readonly Histogram<double> _processingDuration;
private readonly IOrderRepository _orderRepository;
private readonly IPaymentGateway _paymentGateway;
private readonly ILogger<OrderProcessingService> _logger;
public OrderProcessingService(
ActivitySource activitySource,
Meter meter,
IOrderRepository orderRepository,
IPaymentGateway paymentGateway,
ILogger<OrderProcessingService> logger)
{
_activitySource = activitySource;
_ordersProcessed = meter.CreateCounter<long>("orders.processed.total");
_processingDuration = meter.CreateHistogram<double>("orders.processing.duration.ms");
_orderRepository = orderRepository;
_paymentGateway = paymentGateway;
_logger = logger;
}
public async Task<OrderResult> ProcessOrderAsync(
PlaceOrderCommand command,
CancellationToken cancellationToken = default)
{
var stopwatch = Stopwatch.StartNew();
// Create a child span — parent is the HTTP span from ASP.NET instrumentation
using var activity = _activitySource.StartActivity(
"ProcessOrder",
ActivityKind.Internal);
// Semantic attributes — queryable in Jaeger/Tempo/Honeycomb
activity?.SetTag("order.id", command.OrderId.ToString());
activity?.SetTag("order.customer_id", command.CustomerId.ToString());
activity?.SetTag("order.item_count", command.Items.Count);
activity?.SetTag("order.total_amount", command.TotalAmount);
try
{
using var paymentSpan = _activitySource.StartActivity(
"AuthorizePayment", ActivityKind.Client);
paymentSpan?.SetTag("payment.gateway", "stripe");
paymentSpan?.SetTag("payment.currency", command.Currency);
var paymentResult = await _paymentGateway.AuthorizeAsync(
command.CustomerId,
command.TotalAmount,
command.Currency,
cancellationToken);
if (!paymentResult.IsAuthorized)
{
activity?.SetStatus(ActivityStatusCode.Error, "Payment authorization failed");
activity?.SetTag("order.failure_reason", paymentResult.DeclineCode);
_ordersProcessed.Add(1,
new KeyValuePair<string, object?>("status", "payment_declined"),
new KeyValuePair<string, object?>("currency", command.Currency));
_logger.LogWarning(
"Order {OrderId} payment declined — code: {DeclineCode}",
command.OrderId, paymentResult.DeclineCode);
return OrderResult.PaymentDeclined(paymentResult.DeclineCode);
}
var order = Order.Create(command);
await _orderRepository.SaveAsync(order, cancellationToken);
activity?.SetStatus(ActivityStatusCode.Ok);
_ordersProcessed.Add(1,
new KeyValuePair<string, object?>("status", "success"),
new KeyValuePair<string, object?>("currency", command.Currency));
_logger.LogInformation(
"Order {OrderId} processed successfully — amount: {Amount} {Currency}",
command.OrderId, command.TotalAmount, command.Currency);
return OrderResult.Success(order.Id, paymentResult.TransactionId);
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
_ordersProcessed.Add(1,
new KeyValuePair<string, object?>("status", "exception"));
_logger.LogError(ex,
"Unhandled exception processing order {OrderId}", command.OrderId);
throw;
}
finally
{
stopwatch.Stop();
_processingDuration.Record(stopwatch.Elapsed.TotalMilliseconds,
new KeyValuePair<string, object?>("order.outcome",
activity?.Status.ToString() ?? "unknown"));
}
}
}
Further Reading
- Module 3 – Distributed Systems Fundamentals — failure modes that observability must cover
- Module 13 – Reliability Engineering & SLOs — RED/USE metrics as the foundation for SLO definitions
- Module 7 – Event-Driven Architecture — tracing across message bus hops with traceparent propagation
- Module 16 – Governance — organizational standards for telemetry naming conventions
External references:
- OpenTelemetry .NET SDK: https://opentelemetry.io/docs/languages/dotnet/
- W3C TraceContext Specification: https://www.w3.org/TR/trace-context/
- Gregg, B. (2013). "USE Method." https://brendangregg.com/usemethod.html
- Wilkie, T. (2018). "The RED Method." KubeCon keynote.