AKS with Linkerd Service Mesh

Securing internal traffic with Linkerd on AKS

Last Updated: 2021年9月4日 Illustrations and code snippets to gists

The two most often neglected domains of cloud operations are security and o11y (observability). This should come as no surprise because adding security, such as encryption-in-transit with mutual TLS (where both client and server verify each other), and adding traffic monitoring and tracing on short lived transitory pods is by its very nature complex.

What if you could add automation for both security and o11y in less than 15 minutes of effort?

The solution to all of this complexity involves deploying a service mesh, and as unbelievable as it seems, the above statement can really happen with Linkerd.

This article covers using the service mesh Linkerd installed into AKS (Azure Kubernetes Service) with an example application Dgraph.

Linkerd Architecture: Control Plane vs Data Plane

A service mesh can be logically organized into two primary layers:

a control plane layer that’s responsible for configuration and management, and a data plane layer that provides network functions valuable to distributed applications. (ref)

The service mesh consist of proxies and reverse proxies for every service that will be put into a network called a mesh. This allows you to secure and monitor traffic between all members within the mesh.

A proxy, for the uninitiated, is redirecting outbound web traffic to an intermediary web service that can apply security policies, such as blocking access to a malicious web site, before traffic is sent on its way to the destination.

The reverse proxy is is an intermediary web service that can secure and route inbound traffic based set of defined rules, such as rules based on an HTTP path, a destination hostname, and ports.

The combined proxy and reverse proxy on every member node in the mesh, affords a refined level of security where you can allow only designated services to access other designated services, which is particularly useful for isolating services in case one of them is compromised.

Articles in Series

This series shows how to both secure and load balance gRPC and HTTP traffic.

  1. AKS with Azure Container Registry
  2. AKS with Calico network policies
  3. AKS with Linkerd service mesh (this article)
  4. AKS with Istio service mesh

Previous Article

The previous article discussed Kubernetes network plugins and network policies with Azure CNI and Calico network plugins.

Requirements

For creation of Azure cloud resources, you will need to have a subscription that will allow you to create resources.

  • Azure CLI tool (az): command line tool that interacts with Azure API.
  • Kubernetes client tool (kubectl): command line tool that interacts with Kubernetes API
  • Helm (helm): command line tool for “templating and sharing Kubernetes manifests” (ref) that are bundled as Helm chart packages.
  • helm-diff plugin: allows you to see the changes made with helm or helmfile before applying the changes.
  • Helmfile (helmfile): command line tool that uses a “declarative specification for deploying Helm charts across many environments” (ref).
  • Linkerd CLI (linkerd): command line tool that can configure, deploy, verify linkerd environment and extensions.

Many of the tools such as grpcurl, curl, jq will be accessible from the Docker container. For building images and running scripts, I highly recommend these tools:

  • POSIX shell (sh) such as GNU Bash (bash) or Zsh (zsh): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux.
  • Docker (docker): command line tool to build, test, and push docker images.
  • SmallStep CLI (step): A zero trust swiss army knife for working with certificates

Project file structure

The following structure will be used:

~/azure_linkerd
├── certs
│ ├── ca.crt
│ ├── ca.key
│ ├── issuer.crt
│ └── issuer.key
├── env.sh
└── examples
├── dgraph
│ ├── helmfile.yaml
│ └── network_policy.yaml
└── pydgraph
├── Dockerfile
├── Makefile
├── helmfile.yaml
├── load_data.py
├── requirements.txt
├── sw.nquads.rdf
└── sw.schema

With either Bash or Zsh, you can create the file structure with the following commands:

Setup these environment variables below to keep a consistent environment amongst different tools used in this article. If you are using a POSIX shell, you can save these into a script and source that script whenever needed.

Copy this source script and save as env.sh:

NOTE: The default container registry for Linkerd uses GHCR (GitHub Container Registry). In my experiments, this has been a source of problems with AKS, so as an alternative, I recommend republishing the container images to another registry. Look for optional instructions below if you are interested in doing this as well.

Provision Azure resources

Azure cloud resources

Both AKS with Azure CNI and Calico network policies and ACR cloud resources can be provisioned with the following steps outlined in the script below.

NOTE: Though these instructions are oriented AKS and ACR, you can use any Kubernetes with the Calico network plugin installed for network policies, and you can use any container registry, as long as it is accessible from the Kubernetes cluster.

Verify that the AKS cluster was created and that you have a KUBCONFIG that is authorized to access the cluster by running the following:

source env.shkubectl get all --all-namespaces

The final results can should look something like this:

The Linkerd service mesh

Kubernetes Components

Linkerd can be installed using either the linkerd cli or through the linkerd2 helm chart. For this article, the linkerd command will be use to generate the Kubernetes manifests and then apply them with the kubectl command

Linkerd requires a trust anchor certificate and an issuer certificates with the corresponding key to support mutual TLS connections between meshed pods. All certificates require ECDSA P-256 algorithm, which is the default for the step command. You can alternatively use the openssl ecparam -name prime256v1 command.

To generate certificates using the step command for this, run the following commands.

Linkerd uses GitHub Container Registry, which has been consistently unreliable when fetching images from AKS (v1.19.11). This leads to deploys to take around 20 to 30 minutes due to these errors.

As an optional step, the container images can be republished to ACR, which can reduce deploy time significantly to about 3 minutes. In the script below, follow the steps to republish the images.

You can install linkerd with the generated certificates using the linkerd command line tool:

The linkerd command will generate Kubernetes manifests that are then piped to kubectl command.

When completed, run this command to verify the deployed Linkerd infrastructure:

kubectl get all --namespace linkerd

This will show something like the following below:

You can also run linkerd check to verify the health of Linkerd:

The Viz extension adds some graphical web dashboards, Prometheus metric system, and Grafana dashboards:

linkerd viz install | kubectl apply -f -

You can check the infrastructure with the following command:

kubectl get all --namespace linkerd-viz

This should show something like the following:

Additionally you can run linkerd viz check:

The Jaeger extension will install the Jaeger, a distributed tracing solution.

linkerd jaeger install | kubectl apply -f -

You can check up on the success of the deployment with the following command:

kubectl get all --namespace linkerd-jaeger

This should show something like the following:

Additionally you can run linkerd jaeger check:

You can port-forward to localhost with this command:

linkerd viz dashboard &

This should show something like the following

The Dgraph service

Dgraph is a distributed graph database consisting of three Dgraph Alpha member nodes, which host the graph data, and three Dgraph Zero nodes, which manage the state of Dgraph cluster including the timestamps. The Dgraph Alpha service supports both a HTTP (GraphQL) on port 8080 and gRPC interface on port 9080

Linkerd magic takes place by injecting a proxy sidecar container into each pod that will be a member of the service mesh. This can be configured by adding template annotations to a deployment or statefulset controller.

Dgraph will be deployed using the Dgraph chart, but instead of the normal route of installing Dgraph with helmfile apply, a Kubernetes manifest will be generated with helmfile template., so that the linkerd inject command can be used.

NOTE: Currently Dgraph does yet not have direct support modifying the template annotations in the statefulset. Recently, I added a pull request for this feature, and hopefully a new chart version will be published. In the mean time, helmfile template command will work.

Run these commands to deploy Dgraph with the injected Linkerd proxy side cars:

After about a minute, the Dgraph cluster will come up, you can verify this with the following command:

kubectl get all --namespace "dgraph"

We should see something like this:

Should you want to run a gRPC client to connect to Dgraph on the same Kubernetes cluster, you will want to generate a service profile. This will allow more evenly distributed traffic across Dgraph member nodes.

Generate and deploy a service profile:

The pydgraph client

In an earlier blog, I documented steps to build and release a pygraph-client image, and then deploy a pod that uses this image.

The pydgraph-client pod will have all the tools needed to test both HTTP and gRPC. We’ll use this client run though the following tests:

  1. Establish basic connectivity works (baseline)
  2. Apply a network policy to block all non-proxy traffic with Calico and verify connectivity no longer works.
  3. Inject a proxy into the pydgraph and verify connectivity through proxy works

Below is a script you can use to download the gists and populate the needed files run through these steps.

NOTE: These scripts and further details are covered in an earlier article (see AKS with Azure Container Registry).

Now that all the required source files are available, build the image:

After running kubectl get all --namespace pydgraph-client, this should result in something like the following:

For the next set of tests, you will need to log into the container. This can be done with the following commands:

PYDGRAPH_POD=$(kubectl get pods \
--namespace pydgraph-client \
--output name
)
kubectl exec -ti \
--namespace
pydgraph-client ${PYDGRAPH_POD}\
--container pydgraph-client -- bash

Test 0 (Baseline): No Proxy

Verify that the things are working without a proxy or network policies.

In this sanity check and proceeding tests, both HTTP (port 8080) and gRPC (port 9080) will be tested.

No proxy on pydgraph-client

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will be something similar to this:

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results will be something similar to this:

Test 1: Add a network policy

The goal of this next test is to deny all traffic that is outside of service mesh. his can be done by using network policies where only traffic from the service mesh is permitted.

After adding the policy, the expected results will timeouts as communicate from the pydgraph-client will be blocked.

Network Policy added to block traffic outside the mesh

This policy will deny all traffic to the Dgraph Alpha pods, except for traffic from the service mesh, or more explicitly, from any pod with the label linkerd.io/control-plane-ns=linkerd.

Dgraph Network Policy for Linkerd (made with https://editor.cilium.io)

Copy the following and save as examples/dgraph/network_policy.yaml:

When ready, apply this with the following command:

kubectl --filename ./examples/dgraph/network_policy.yaml apply

HTTP check (network policy applied)

Log into the pydgraph-client pod, and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health

The expected results in this case, after a very long wait (about 5 minutes) will be something similar to this:

gRPC check (network policy apply)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results for gRPC in about 10 seconds will be:

Test 2: Inject Linkerd proxy side car

Now that we verified that network connectivity is not possible, we can inject a proxy side car so that traffic will be permitted.

A new container linkerd-proxy is added to the pod:

View of containers (Lens tool https://k8slens.dev/)

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will look something similar to this:

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results will look something similar to this:

Test 3: Listening to traffic steams

For this step, we will monitor traffic as it goes through the proxy and then generate some traffic. For monitoring, we’ll use tap from the command line and the dashboard to listen to traffic streams.

In a separate terminal tab or window, run this command to monitor traffic:

linkerd viz tap namespace/pydgraph-client

We can also do the same thing in the Linkerd Viz Dashboard under Tap area:

  1. set the Namespace field to pydgraph-client
  2. set the Resource field to namespace/pydgraph-client
  3. click on the the [START] button

With this monitoring in place, log into the pydgraph-client pod and run these commands:

In the separate terminal tab or window, you should see output like this below:

NOTE: The colors were added manually to highlight traffic generated from the different commands.

In the Viz Dashboard, you should see something like this:

Cleanup

This will remove the AKS cluster as well as any provisioned resources from AKS including external volumes created through the Dgraph deployment.

az aks delete \
--resource-group $AZ_RESOURCE_GROUP \
--name $AZ_CLUSTER_NAME

Resources

Here are some links to topics, articles, and tools used in this article:

This is the source code related to this blog.

These are applications that can be used to walk through the features of a service mesh.

Topics on gRPC load balancing on Kubernetes.

The o11y or observability is a new term to distinguish observability around patterns in cloud native infrastructure.

Here are some topics around service mesh traffic access in the community, and related to upcoming feature in the stable-2.11 release.

  • 2021年9月4日 multiline code to gists, updated images
  • 2021年8月6日 Updated Linkerd architecture image

Conclusion

Linkerd is a breeze to setup and get off the ground despite the numerous components and processes happening behind the scenes.

Beyond attractive features like automation for o11y (cloud native observability) and encryption-in-transit with Mutual TLS, the one often overlooked feature are the load balancing features, for not only HTTP traffic, but gRPC as well. Why this is important is because of this:

gRPC also breaks the standard connection-level load balancing, including what’s provided by Kubernetes. This is because gRPC is built on HTTP/2, and HTTP/2 is designed to have a single long-lived TCP connection, across which all requests are multiplexed — meaning multiple requests can be active on the same connection at any point in time. (ref)

Using the default Kubernetes service resource (kube-proxy), connections are randomly selected, i.e. no load balancing, and for gRPC, this means a single node in your highly available cluster will suck up all the traffic. Thus, load balancing feature of Linkerd becomes, in my mind, one of the most essential features.

For security beyond encryption-in-transit with Mutual TLS, the restriction access to pods is also important. This area is called defense-in-depth, a layered approached to restrict which services should be able to connect to each other.

In this article, I touched on how to do a little of this with network policies using the Calico network plugin.

It would be really nice to have some policies that can be applied to mesh traffic as well. Well, this is happening in an upcoming stable-2.11 release, traffic access with two new CRDs: Server and ServerAuthorization.

Thank you for finishing this article; I hope that this helped you in your journey.

Linux NinjaPants Automation Engineering Mutant — exploring DevOps, o11y, k8s, progressive deployment (ci/cd), cloud native infra, infra as code

Love podcasts or audiobooks? Learn on the go with our new app.