Blog/Mutual TLS and Service Meshes: Zero Trust in Practice

securitymtlsservice-meshzero-trustenvoy

Mutual TLS and Service Meshes: Zero Trust in Practice

March 20, 2024·11 min read·by Bishwambhar Sen

Two service pods with Envoy sidecar proxies exchanging mTLS-encrypted traffic, with a control plane issuing certificates

Concept

Zero Trust architecture begins with a single axiom: trust nothing by default, verify everything explicitly. The traditional perimeter model assumed that traffic inside the network boundary was safe. Cloud-native deployments demolished that assumption: microservices run in shared multi-tenant clusters, traffic traverses software-defined network overlays, and a compromised pod inside the cluster can reach any other pod unless explicitly prevented.

Mutual TLS (mTLS) is the cryptographic mechanism that enforces Zero Trust at the network level in a service mesh. Unlike standard TLS where only the server presents a certificate (proving its identity to the client), mTLS requires both parties to present and verify certificates before any data is exchanged. In a service mesh context, each service's certificate is the cryptographic proof of its identity.

The mTLS Handshake in Detail

A standard TLS 1.3 handshake proceeds:

Client sends ClientHello with supported cipher suites and key share
Server responds with ServerHello, its certificate, and CertificateVerify (signature over handshake hash)
Client verifies the server's certificate against a trusted CA
Client sends Finished; session is established

Mutual TLS adds steps 2b and 3b: 2b. Server sends CertificateRequest to the client 3b. Client sends its own certificate and CertificateVerify 3c. Server verifies the client's certificate against a trusted CA

The result: both parties have cryptographically proven their identity before any application data is transmitted. The certificate's Subject Alternative Name (SAN) field encodes the service identity — typically as a SPIFFE ID: spiffe://cluster.local/ns/default/sa/order-service.

Envoy Sidecar Architecture

In a service mesh (Istio, Linkerd, Consul Connect), the mTLS termination is handled by a sidecar proxy injected alongside every application pod. The canonical implementation uses Envoy Proxy.

The traffic flow for a request from Order Service to Inventory Service:

Order Service sends an HTTP request to localhost:8080/api/inventory — it speaks plaintext to its own sidecar on loopback.
The Envoy sidecar intercepts the outbound request, looks up the destination's identity in the service registry, retrieves its own certificate from the CA, and establishes an mTLS connection to the destination sidecar.
The Inventory Service's Envoy sidecar terminates the mTLS connection, verifies the caller's certificate (confirming it's the Order Service), and forwards a plaintext HTTP request to localhost:8080 (the Inventory Service process).
The response takes the reverse path.

The application code sees only plaintext localhost communication. The mTLS ceremony, certificate management, and policy enforcement are entirely in the data plane — not in the application.

This architecture has a profound implication: the application code does not need to change to gain mTLS. The mesh upgrades the security posture of the entire cluster transparently.

SPIFFE and SVID: The Identity Standard

The Secure Production Identity Framework For Everyone (SPIFFE) defines a workload identity specification. A SPIFFE Verifiable Identity Document (SVID) is an X.509 certificate with a SPIFFE ID in the SAN field. The format: spiffe://{trust-domain}/{path}.

In Kubernetes with Istio: spiffe://cluster.local/ns/{namespace}/sa/{service-account}. Every pod running with a specific Kubernetes ServiceAccount gets a certificate with that SPIFFE ID. Policies are written against SPIFFE IDs, not IP addresses — which change with every pod restart.

Constraints

Certificate Rotation: The Operational Heartbeat

X.509 certificates expire. In a service mesh, workload certificates are typically issued with short TTLs (24 hours to 7 days) to limit the exposure window of a compromised private key. This means the mesh's control plane must continuously rotate certificates for every workload, at scale, without service interruption.

Rotation in Istio (using Citadel / istiod):

istiod detects a certificate approaching expiry (typically 80% of TTL elapsed)
istiod issues a new certificate to the affected workload via the xDS API
Envoy hot-reloads the new certificate without dropping active connections
The old certificate is revoked in the SPIFFE Trust Bundle

The constraint: certificate rotation must complete before the certificate expires. At scale (10,000 pods), this creates a thundering herd problem if certificate TTLs are set uniformly. Production meshes stagger expiry times by adding random jitter to the issuance timestamp.

Policy Enforcement: AuthorizationPolicy vs. Application-Level Auth

The service mesh can enforce coarse-grained communication policies: "only Order Service may call Inventory Service's /api/inventory endpoint." This is expressed as an AuthorizationPolicy in Istio:

A policy that allows Order Service to call Inventory Service would specify:

Source: principal matching spiffe://cluster.local/ns/orders/sa/order-service
Destination: Inventory Service's port 8080
Operation: GET /api/inventory*

This is infrastructure-level authorization — it controls which services can communicate. It is not a replacement for application-level authorization — which controls what operations a specific authenticated user may perform within the service.

The correct mental model: the mesh enforces service-to-service identity (who is calling). The application enforces user identity (who the end user is, what they're allowed to do). These are separate concerns at separate layers, and conflating them creates both security gaps and over-coupling.

Latency Overhead of mTLS

The mTLS handshake adds latency to connection establishment. Benchmarks from the Istio project show approximately 2–5ms overhead per new mTLS connection establishment, compared to plaintext TCP. For long-lived connections with connection reuse (gRPC, HTTP/2 keep-alive), this is amortized to near-zero per-request overhead — the handshake is paid once.

The concern is at connection churn — short-lived connections that do not reuse. HTTP/1.1 without keep-alive, or services that aggressively close connections after each request, will pay the mTLS handshake cost repeatedly. The fix is connection pooling and HTTP/2 stream multiplexing, not disabling mTLS.

Trade-offs

Full Mesh mTLS vs. Permissive Mode

Istio supports three modes:

DISABLE: No mTLS. Plaintext only.
PERMISSIVE: Accept both mTLS and plaintext. Used during migration.
STRICT: mTLS only. Reject all plaintext.

The PERMISSIVE mode trade-off: it allows gradual migration to mTLS without a big-bang cutover, but it means some traffic may still be unencrypted without detection. STRICT mode is the correct production posture; PERMISSIVE is a migration tool, not a steady state.

Sidecar Injection Overhead

Each Envoy sidecar consumes approximately 50–100MB of memory and 0.1–0.5 CPU cores at idle. For a cluster with 500 pods, this is 25–50GB of memory and 50–250 CPU cores dedicated exclusively to the proxy layer. This is the operational cost of a service mesh — it is non-trivial and must be budgeted.

For small clusters (< 50 pods) or teams with limited platform maturity, mTLS without a full service mesh is achievable through direct certificate management in the application code using .NET's SslStream and X509Certificate2 APIs. The operational burden is higher, but the resource overhead is lower.

Approach	mTLS Coverage	Operational Complexity	Resource Overhead
Service mesh (Istio/Linkerd)	All traffic, transparent	High (mesh ops)	High (sidecar per pod)
App-layer mTLS (manual)	Explicitly instrumented paths	Medium (cert mgmt)	Low
No mTLS (VPC only)	None	Low	None

Code

The following shows how to configure an ASP.NET Core service to require client certificates (application-level mTLS) without a service mesh — useful for direct service-to-service scenarios or mesh migration:

// Program.cs — ASP.NET Core with client certificate validation for mTLS
// Used when operating without a service mesh (manual mTLS)
using System.Security.Cryptography.X509Certificates;

var builder = WebApplication.CreateBuilder(args);

// Configure Kestrel to require client certificates on the secure port
builder.WebHost.ConfigureKestrel(kestrel =>
{
    kestrel.ListenAnyIP(8443, listenOptions =>
    {
        listenOptions.UseHttps(https =>
        {
            // Server certificate — issued by your internal PKI
            https.ServerCertificate = LoadCertificate(
                builder.Configuration["Mtls:ServerCertPath"],
                builder.Configuration["Mtls:ServerCertPassword"]);

            // Require and validate client certificate
            https.ClientCertificateMode = ClientCertificateMode.RequireCertificate;
            https.ClientCertificateValidation = (certificate, chain, errors) =>
                ValidateClientCertificate(certificate, chain, errors,
                    builder.Configuration["Mtls:TrustedCaThumbprint"]);
        });
    });
});

// Add certificate authentication middleware
builder.Services
    .AddAuthentication(CertificateAuthenticationDefaults.AuthenticationScheme)
    .AddCertificate(options =>
    {
        options.AllowedCertificateTypes = CertificateTypes.All;
        options.RevocationMode = X509RevocationMode.Online;
        options.Events = new CertificateAuthenticationEvents
        {
            OnCertificateValidated = context =>
            {
                // Extract SPIFFE ID from SAN — enforce service identity
                var spiffeId = ExtractSpiffeId(context.ClientCertificate);
                if (spiffeId is null || !IsAllowedCaller(spiffeId))
                {
                    context.Fail("Caller identity not permitted.");
                    return Task.CompletedTask;
                }

                // Add claims from certificate identity for downstream authorization
                var claims = new[]
                {
                    new Claim(ClaimTypes.NameIdentifier, spiffeId),
                    new Claim("service-name", ExtractServiceName(spiffeId))
                };
                context.Principal = new ClaimsPrincipal(
                    new ClaimsIdentity(claims, context.Scheme.Name));
                context.Success();
                return Task.CompletedTask;
            }
        };
    });

builder.Services.AddAuthorization();
var app = builder.Build();
app.UseAuthentication();
app.UseAuthorization();

static bool ValidateClientCertificate(
    X509Certificate2 certificate,
    X509Chain? chain,
    SslPolicyErrors errors,
    string trustedCaThumbprint)
{
    if (errors != SslPolicyErrors.None) return false;

    // Verify the certificate is signed by our internal CA
    return chain?.ChainElements
        .Any(e => string.Equals(
            e.Certificate.Thumbprint,
            trustedCaThumbprint,
            StringComparison.OrdinalIgnoreCase)) ?? false;
}

static string? ExtractSpiffeId(X509Certificate2 cert)
{
    // SPIFFE ID is in Subject Alternative Name, URI type
    var sanExtension = cert.Extensions["2.5.29.17"]; // OID for SAN
    if (sanExtension is null) return null;

    var sanData = sanExtension.Format(false);
    var uriPrefix = "URL=spiffe://";
    var start = sanData.IndexOf(uriPrefix, StringComparison.OrdinalIgnoreCase);
    if (start < 0) return null;

    var idStart = start + uriPrefix.Length - "spiffe://".Length;
    var idEnd = sanData.IndexOf(',', idStart);
    return idEnd < 0
        ? sanData[idStart..]
        : sanData[idStart..idEnd];
}

The second example shows an HttpClient configured to send a client certificate for outbound mTLS calls to another service:

// MtlsHttpClientFactory.cs — configuring outbound mTLS for service-to-service calls
public static class MtlsHttpClientExtensions
{
    public static IHttpClientBuilder AddMtlsHttpClient(
        this IServiceCollection services,
        string clientName,
        string baseAddress,
        string clientCertPath,
        string clientCertPassword,
        string serverCaThumbprint)
    {
        return services.AddHttpClient(clientName, client =>
        {
            client.BaseAddress = new Uri(baseAddress);
            client.DefaultRequestVersion = HttpVersion.Version20;
        })
        .ConfigurePrimaryHttpMessageHandler(() =>
        {
            var clientCert = new X509Certificate2(clientCertPath, clientCertPassword);

            var handler = new SocketsHttpHandler
            {
                PooledConnectionLifetime = TimeSpan.FromMinutes(10),
                SslOptions = new SslClientAuthenticationOptions
                {
                    // Present client certificate to the server
                    ClientCertificates = new X509CertificateCollection { clientCert },

                    // Validate the server's certificate against our internal CA
                    RemoteCertificateValidationCallback = (_, cert, chain, errors) =>
                    {
                        if (errors != SslPolicyErrors.None) return false;
                        if (cert is null) return false;

                        return chain?.ChainElements.Any(e =>
                            string.Equals(e.Certificate.Thumbprint,
                                serverCaThumbprint,
                                StringComparison.OrdinalIgnoreCase)) ?? false;
                    }
                }
            };

            return handler;
        });
    }
}

// Registration in Program.cs:
// services.AddMtlsHttpClient(
//     "inventory-client",
//     "https://inventory-service.internal:8443",
//     clientCertPath: "/var/secrets/certs/order-service.pfx",
//     clientCertPassword: config["Mtls:CertPassword"],
//     serverCaThumbprint: config["Mtls:ServerCaThumbprint"]);