Migrating to Mutual TLS: What Istio does not tell you

Chintan Betrabet
8 min readNov 18, 2020

Istio is a powerful tool to enforce secure communication but the path from development to production is not as simple as it sounds.

The article has been co-authored by Saurabh Shekhar

Enabling mTLS with Istio

There are multiple resources on the subject, you may explore the internet for the same or find a relevant article of mine here.

The summary is to use a combination of PeerAuthentication and Destination rules to configure and enforce mTLS on both the sender side as well as receiver side However, it may not always be as simple as that.

What to do when this does not work?

Most of the references and examples in the Istio documentation are largely toy programs intended to illustrate how mutual TLS (mTLS) works. However, there are many cases where this won't work out of the box for you. Once you have applied the STRICT policy for PeerAuthentication and ISTIO_MUTUAL traffic policy Destination Rule, these are multiple issues that could surface in a pre-existing setup.

Some common problems are documented here but there are many more which I have encountered while migrating to Istio. The most common is a 503 error returned from the envoy proxy when sending a request between two pods. The best way to identify the issue is to check the envoy proxy logs using:

kubectl logs PODNAME -c istio-proxy -n NAMESPACE

However, it's best to take note of the following known causes before you proceed further into the envoy logs.
1. Missing sidecars

This is one of the most common causes observed. When you try to communicate from a pod that does not have a sidecar to a pod having a sidecar with a strict authentication policy, all requests will be rejected as a mutual TLS handshake is not established in the absence of envoy sidecars.

You can use Kiali for visibility on your workloads. In this case, it will show a warning similar to this:

This can be fixed by enabling automatic sidecar injection through labels/annotations at the namespace or deployment level.

kind: Namespace
apiVersion: v1
metadata:
name: foo
labels:
sidecar.istio.io/inject: “true”

This will try to inject sidecars into all pods created in the foo namespace.

Note: To reflect changes, a redeploy is needed, existing pods are not affected by this change.

If you want to enable for only certain pods/deployments, you can modify the label at a deployment level:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: test-deploy
namespace: foo
labels:
name: test-deploy
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 80%
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
labels:
name: test-deploy
spec:
containers:
- name: test
image: test:latest

The section below is used to denote that sidecar should be injected in all the pods of this deployment, irrespective of whether it is dictated at the namespace level

template:
metadata:
annotations:
sidecar.istio.io/inject: "true"

Using these policies, you should be able to eradicate surprises introduced by new services that do not have sidecars injected.

2. Naming convention of service ports

Istio is extremely reliant on port naming to establish communication. Check the examples given below:

kind: Service
apiVersion: v1
metadata:
name: working-http-svc
namespace: foo
labels:
app: working-http-svc
spec:
selector:
name: working-http-svc
type: ClusterIP
ports:
- name: http-svc-port
protocol: TCP
port: 80
targetPort: 3000
---
apiVersion: v1
kind: Service
metadata:
labels:
app: not-working-http-svc
name: not-working-http-svc
name: not-working-http-svc
namespace: foo

spec:
ports:
- name: not-working-http-svc-port
port: 8125
protocol: TCP
targetPort: 28125
selector:
app: not-working-http-svc
type: ClusterIP

The correct naming convention is $protocol-$service as illustrated in working-http-svc where the port name is of the form: http-svc-port where we have HTTP as a prefix to denote the protocol.

The not-working-http-svc has a port name called not-working-http-svc-port which is missing the protocol prefix. The envoy on the client-side is unable to recognize “not” (prefix before the ‘-’) as a valid protocol for communication and hence aborts establishing a mutual TLS connection. This will cause all requests sent to this service to receive 503 errors.

This issue is thrown as a warning on Kiali as:

So the fix for this is simple. Rename the port in the configuration to tcp-app-port to resolve this issue.

3. Upstream naming and headers in Nginx

Nginx is a popular reverse proxy and is a common choice to behave as the ingress service for an application. In Kubernetes, it can be used to route requests to independent upstream microservices based on the path prefix. An upstream can be defined as follows:

upstream payments-backend {
server payments.foo.svc.cluster.local;
}

This can be referred to from a location block as:

location  /pay {
proxy_pass payments-backend;
}

Without an envoy proxy, this will work seamlessly. However, there are a few hiccups once the envoy proxy is introduced. While internally, Nginx is able to handle the routing from the name: payments-backend to the service: payments, Envoy is unable to identify this as a request to be routed within the Kubernetes cluster and routes the request via the Passthrough Cluster. Hence, it does not attempt to initiate mTLS. This will cause requests to be dropped by the receiving payments service as it has a strict authentication policy. To fix this, we need the sidecar to understand that the request is internal to the cluster and needs mTLS to be established. There are two ways to achieve this:

a) Host headers

In the above case, by default, envoy will consider the Host header as payments-backend and since it does not have a lookup entry corresponding to this name internal to the cluster, it treats it as an external request. We can explicitly set the host header as follows:

location  /pay {
proxy_set_header Host payments.foo.svc.cluster.local;
proxy_pass payments-backend;
}

This informs the proxy that the request is going to payments.foo.svc.cluster.local and since it deciphers that this is a valid service within the cluster, it honors the destination rule and initiates mTLS.

b) Renaming the upstream

Alternatively, you can rename the upstream to:

upstream payments {
server payments.foo.svc.cluster.local;
}

and the location to:

location  /pay {
proxy_pass payments;
}

This is cleaner than modifying headers and is more intuitive since the upstream name and the server for the upstream have the same name.

Proxy HTTP version directive

This however is not the end of the problem. Proxy pass needs an explicit directive to use HTTP 1.1 as envoy does not work with HTTP 1.0 connections. The directive is as follows:

location  /pay {
proxy_http_version 1.1;
proxy_pass payments-backend;
}

These suggested changes resolved the issues we faced while implementing mTLS with Nginx as a reverse proxy.

4. Graceful termination of Istio containers

As described in the architecture diagrams above, all requests routed to a pod are intercepted by the envoy sidecar and then proxied to the container running the HTTP service. In most cases, these services are stateless and its pods can be terminated instantly without any loss of data.

However, we may have use cases where a pod needs to gracefully handle termination such as:

a) Sending notifications to listeners or reporting cause of termination

b) Releasing locks on shared resources in a distributed system

c) Draining existing requests and sending callbacks on completion

All of these require communication with external services and will be routed via the envoy sidecar proxy.

These operations can be handled gracefully within the application using:

  1. Trapping Sigterm signals within the container
  2. Overriding terminationGracePeriodSeconds for a pod

However, when we introduce an additional hop through the envoy proxy, these external calls will fail as the envoy container will be terminated once Kubernetes sends a Sigterm signal to the pod. Envoy gracefully handles all the existing requests and allows them to drain but any new request, either ingress or egress will fail with 503 as envoy has been terminated and is unable to handle new requests.

Solution

The desired solution is to ensure that the proxy sidecar is not terminated before the main container running the service is terminated. The problem is tricky since sidecar injection is performed by Istio and we do not own the lifecycle of the istio-proxy container. There is however a way to tackle the problem.

Step 1: Disable the automatic injection of sidecars

This is essential as manual injection of sidecars allows us to control the final YAML which will be applied. To perform this, disable automatic sidecar injection at the namespace level:

kind: Namespace
apiVersion: v1
metadata:
name: foo
labels:
sidecar.istio.io/inject: “false”

You will accordingly need to add the annotations at a deployment level as:

template:
metadata:
annotations:
sidecar.istio.io/inject: "true"

Step 2: Inject the sidecars manually, along with correct Pre stop lifecycle hooks

First, generate the required YAML with istio proxy using:

istioctl kube-inject -f $your_yaml > $destination_yaml

After that, modify the lifecycle hook for the istio-proxy container as follows:

conatainers:
- name: istio-proxy
lifecycle:
preStop:
exec:
command:
- '/bin/sh'
- '-c'
- 'while [ 1==1 ]; do curl localhost:3000/health; if [ $? -gt 0 ]; then break; fi; sleep 1; done'

The above pre stop hook will wait as long as the process on port 3000 (main container process) return 200 OK for /health API. This ensures that while the main service is up and running, istio-proxy will not be terminated. It is assumed that the graceful termination behavior was already working for the main process and istio-proxy will honor the same.

Once done, you can use kubectl to apply this modified YAML configuration.

You may refer to an effort made to simplify the process here. The aim is to build a config driven tool which will generate the appropriate YAML files with pre stop hooks, both with and without istio-proxy injection.

5. Confirmation of encryption of traffic

The migration path from non-Istio to Istio with mTLS has the following phases:

i) Introduce sidecars and destination rules

ii) Enforce the Permissive mode of peer authentication to provide backward compatibility

iii) Verify that traffic encrypted using mTLS

iv) Enforce the Strict mode of peer authentication so that any new communication will not be allowed unless it is following mTLS

Step (iii) is the most critical since we need to ensure traffic is encrypted with mTLS before we enforce strict mode as without this, requests will fail and this is not acceptable in production environments.

Apart from manually verifying the N X (N-1) modes of combinations of service to service communication between N services, you can use Prometheus

Istio exposes a Prometheus dashboard which can be used to monitor all Istio requests. It can be accessed using:

istioctl dashboard prometheus

You can also use the Grafana dashboard with:

istioctl dashboard grafana

One key metric is:

istio_requests_total{connection_security_policy!=”mutual_tls”}

The metric should not increase with traffic once your services are expected to be moved to mTLS communication. If this is increasing, you should dig deeper and identify which of your communications are not secured. You can use the following tags:

a) destination_workload

b) source_workload

c) request_protocol

You can check the steps mentioned earlier in the article to ascertain the cause of the communication not being secured.

Conclusion

The path to productionizing mTLS with zero downtime is a tricky path. However, if you follow the steps mentioned here, the journey will certainly be smoother. Hope the experiences documented here can be used both by engineers presently working on migrating to istio as well as those aiming to do so in the future.

--

--