AKS with Linkerd Service Mesh

Securing internal traffic with Linkerd on AKS

Last Updated: 2021年9月4日 Illustrations and code snippets to gists

The two most often neglected domains of cloud operations are security and (observability). This should come as no surprise because adding security, such as encryption-in-transit with mutual TLS (where both client and server verify each other), and adding traffic monitoring and tracing on short lived transitory pods is by its very nature complex.

What if you could add automation for both security and o11y in less than 15 minutes of effort?

The solution to all of this complexity involves deploying a , and as unbelievable as it seems, the above statement can really happen with .

This article covers using the service mesh installed into AKS (Azure Kubernetes Service) with an example application .

Architecture

Linkerd Architecture: Control Plane vs Data Plane

A service mesh can be logically organized into two primary layers:

a control plane layer that’s responsible for configuration and management, and a data plane layer that provides network functions valuable to distributed applications. ()

What is a service mesh?

The service mesh consist of and for every service that will be put into a network called a mesh. This allows you to secure and monitor traffic between all members within the mesh.

A , for the uninitiated, is redirecting outbound web traffic to an intermediary web service that can apply security policies, such as blocking access to a malicious web site, before traffic is sent on its way to the destination.

The is is an intermediary web service that can secure and route inbound traffic based set of defined rules, such as rules based on an HTTP path, a destination hostname, and ports.

The combined and on every member node in the mesh, affords a refined level of security where you can allow only designated services to access other designated services, which is particularly useful for isolating services in case one of them is compromised.

Articles in Series

This series shows how to both secure and load balance gRPC and HTTP traffic.

  1. (this article)

Previous Article

The previous article discussed and with and network plugins.

Requirements

For creation of cloud resources, you will need to have a subscription that will allow you to create resources.

Required Tools

  • (az): command line tool that interacts with Azure API.
  • (kubectl): command line tool that interacts with Kubernetes API
  • (helm): command line tool for “templating and sharing Kubernetes manifests” () that are bundled as Helm chart packages.
  • plugin: allows you to see the changes made with helm or helmfile before applying the changes.
  • (helmfile): command line tool that uses a “declarative specification for deploying Helm charts across many environments” ().
  • (linkerd): command line tool that can configure, deploy, verify linkerd environment and extensions.

Optional tools

Many of the tools such as grpcurl, curl, jq will be accessible from the container. For building images and running scripts, I highly recommend these tools:

  • (sh) such as (bash) or (zsh): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux.
  • (docker): command line tool to build, test, and push docker images.
  • (step): A zero trust swiss army knife for working with certificates

Project file structure

The following structure will be used:

~/azure_linkerd
├── certs
│ ├── ca.crt
│ ├── ca.key
│ ├── issuer.crt
│ └── issuer.key
├── env.sh
└── examples
├── dgraph
│ ├── helmfile.yaml
│ └── network_policy.yaml
└── pydgraph
├── Dockerfile
├── Makefile
├── helmfile.yaml
├── load_data.py
├── requirements.txt
├── sw.nquads.rdf
└── sw.schema

With either or , you can create the file structure with the following commands:

Project Environment Variables

Setup these environment variables below to keep a consistent environment amongst different tools used in this article. If you are using a , you can save these into a script and source that script whenever needed.

Copy this source script and save as env.sh:

NOTE: The default container registry for uses (). In my experiments, this has been a source of problems with , so as an alternative, I recommend republishing the container images to another registry. Look for optional instructions below if you are interested in doing this as well.

Provision Azure resources

Azure cloud resources

Both with and network policies and cloud resources can be provisioned with the following steps outlined in the script below.

NOTE: Though these instructions are oriented and , you can use any with the network plugin installed for network policies, and you can use any container registry, as long as it is accessible from the cluster.

Verify AKS and KUBCONFIG

Verify that the cluster was created and that you have a KUBCONFIG that is authorized to access the cluster by running the following:

source env.shkubectl get all --all-namespaces

The final results can should look something like this:

The Linkerd service mesh

Kubernetes Components

can be installed using either the or through the . For this article, the linkerd command will be use to generate the manifests and then apply them with the kubectl command

Generate Certificates

requires a trust anchor certificate and an issuer certificates with the corresponding key to support mutual TLS connections between meshed pods. All certificates require algorithm, which is the default for the step command. You can alternatively use the openssl ecparam -name prime256v1 command.

To generate certificates using the step command for this, run the following commands.

Republish Linkerd Images (optional)

Linkerd uses , which has been consistently unreliable when fetching images from (v1.19.11). This leads to deploys to take around 20 to 30 minutes due to these errors.

As an optional step, the container images can be republished to , which can reduce deploy time significantly to about 3 minutes. In the script below, follow the steps to republish the images.

Install Linkerd

You can install linkerd with the generated certificates using the linkerd command line tool:

The linkerd command will generate manifests that are then piped to kubectl command.

When completed, run this command to verify the deployed infrastructure:

kubectl get all --namespace linkerd

This will show something like the following below:

You can also run linkerd check to verify the health of :

Install the Viz extension

The extension adds some graphical web dashboards, metric system, and dashboards:

linkerd viz install | kubectl apply -f -

You can check the infrastructure with the following command:

kubectl get all --namespace linkerd-viz

This should show something like the following:

Additionally you can run linkerd viz check:

Install the Jaeger extension

The Jaeger extension will install the Jaeger, a distributed tracing solution.

linkerd jaeger install | kubectl apply -f -

You can check up on the success of the deployment with the following command:

kubectl get all --namespace linkerd-jaeger

This should show something like the following:

Additionally you can run linkerd jaeger check:

Access Viz Dashboard

You can port-forward to localhost with this command:

linkerd viz dashboard &

This should show something like the following

The Dgraph service

is a distributed graph database consisting of three Alpha member nodes, which host the graph data, and three Zero nodes, which manage the state of cluster including the timestamps. The Alpha service supports both a () on port 8080 and interface on port 9080

magic takes place by injecting a proxy sidecar container into each pod that will be a member of the service mesh. This can be configured by adding template annotations to a deployment or statefulset controller.

Deploy Dgraph with Linkerd

will be deployed using the , but instead of the normal route of installing with helmfile apply, a manifest will be generated with helmfile template., so that the linkerd inject command can be used.

NOTE: Currently does yet not have direct support modifying the template annotations in the statefulset. Recently, I added a for this feature, and hopefully a new chart version will be published. In the mean time, helmfile template command will work.

Run these commands to deploy with the injected proxy side cars:

After about a minute, the cluster will come up, you can verify this with the following command:

kubectl get all --namespace "dgraph"

We should see something like this:

Service Profile

Should you want to run a to connect to on the same cluster, you will want to generate a . This will allow more evenly distributed traffic across member nodes.

Generate and deploy a service profile:

The pydgraph client

In an , I documented steps to build and release a pygraph-client image, and then deploy a pod that uses this image.

The pydgraph-client pod will have all the tools needed to test both and . We’ll use this client run though the following tests:

  1. Establish basic connectivity works (baseline)
  2. Apply a network policy to block all non-proxy traffic with Calico and verify connectivity no longer works.
  3. Inject a proxy into the pydgraph and verify connectivity through proxy works

Fetch build and deploy scripts

Below is a script you can use to download the gists and populate the needed files run through these steps.

NOTE: These scripts and further details are covered in an earlier article (see ).

Build, push, and deploy the pydgraph client

Now that all the required source files are available, build the image:

After running kubectl get all --namespace pydgraph-client, this should result in something like the following:

Log into the pydgraph-client container

For the next set of tests, you will need to log into the container. This can be done with the following commands:

PYDGRAPH_POD=$(kubectl get pods \
--namespace pydgraph-client \
--output name
)
kubectl exec -ti \
--namespace
pydgraph-client ${PYDGRAPH_POD}\
--container pydgraph-client -- bash

Test 0 (Baseline): No Proxy

Verify that the things are working without a proxy or network policies.

In this sanity check and proceeding tests, both (port 8080) and (port 9080) will be tested.

No proxy on pydgraph-client

HTTP check (no proxy)

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will be something similar to this:

gRPC check (no proxy)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results will be something similar to this:

Test 1: Add a network policy

The goal of this next test is to deny all traffic that is outside of service mesh. his can be done by using where only traffic from the service mesh is permitted.

After adding the policy, the expected results will timeouts as communicate from the pydgraph-client will be blocked.

Network Policy added to block traffic outside the mesh

Adding a network policy

This policy will deny all traffic to the Alpha pods, except for traffic from the service mesh, or more explicitly, from any pod with the label linkerd.io/control-plane-ns=linkerd.

Dgraph Network Policy for Linkerd (made with )

Copy the following and save as examples/dgraph/network_policy.yaml:

When ready, apply this with the following command:

kubectl --filename ./examples/dgraph/network_policy.yaml apply

HTTP check (network policy applied)

Log into the pydgraph-client pod, and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health

The expected results in this case, after a very long wait (about 5 minutes) will be something similar to this:

gRPC check (network policy apply)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results for in about 10 seconds will be:

Test 2: Inject Linkerd proxy side car

Now that we verified that network connectivity is not possible, we can inject a proxy side car so that traffic will be permitted.

Inject the proxy in order to access Dgraph

A new container linkerd-proxy is added to the pod:

View of containers (Lens tool )

HTTP check (proxy)

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will look something similar to this:

gRPC check (proxy)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

The expected results will look something similar to this:

Test 3: Listening to traffic steams

For this step, we will monitor traffic as it goes through the proxy and then generate some traffic. For monitoring, we’ll use tap from the command line and the dashboard to listen to traffic streams.

Viz Tap from the CLI

In a separate terminal tab or window, run this command to monitor traffic:

linkerd viz tap namespace/pydgraph-client

Viz Tap from the dashboard

We can also do the same thing in the Linkerd Viz Dashboard under Tap area:

  1. set the Namespace field to pydgraph-client
  2. set the Resource field to namespace/pydgraph-client
  3. click on the the [START] button

Generate Traffic

With this monitoring in place, log into the pydgraph-client pod and run these commands:

Observe the resulting traffic

In the separate terminal tab or window, you should see output like this below:

NOTE: The colors were added manually to highlight traffic generated from the different commands.

In the Viz Dashboard, you should see something like this:

Cleanup

This will remove the cluster as well as any provisioned resources from including external volumes created through the deployment.

az aks delete \
--resource-group $AZ_RESOURCE_GROUP \
--name $AZ_CLUSTER_NAME

Resources

Here are some links to topics, articles, and tools used in this article:

Blog Source Code

This is the source code related to this blog.

Example Applications

These are applications that can be used to walk through the features of a service mesh.

General Service Mesh Articles

gRPC Load Balancing

Topics on gRPC load balancing on Kubernetes.

Linkerd Documentation

About o11y (cloud native observability)

The o11y or observability is a new term to distinguish observability around patterns in cloud native infrastructure.

Service Mesh Traffic Access

Here are some topics around service mesh traffic access in the community, and related to upcoming feature in the stable-2.11 release.

Document Changes for Blog

  • 2021年9月4日 multiline code to gists, updated images
  • 2021年8月6日 Updated Linkerd architecture image

Conclusion

is a breeze to setup and get off the ground despite the numerous components and processes happening behind the scenes.

Load Balancing

Beyond attractive features like automation for (cloud native observability) and encryption-in-transit with Mutual TLS, the one often overlooked feature are the load balancing features, for not only traffic, but as well. Why this is important is because of this:

gRPC also breaks the standard connection-level load balancing, including what’s provided by Kubernetes. This is because gRPC is built on HTTP/2, and HTTP/2 is designed to have a single long-lived TCP connection, across which all requests are multiplexed — meaning multiple requests can be active on the same connection at any point in time. ()

Using the default service resource (kube-proxy), connections are randomly selected, i.e. no load balancing, and for , this means a single node in your highly available cluster will suck up all the traffic. Thus, of becomes, in my mind, one of the most essential features.

Restricting Traffic

For security beyond encryption-in-transit with Mutual TLS, the restriction access to pods is also important. This area is called defense-in-depth, a layered approached to restrict which services should be able to connect to each other.

In this article, I touched on how to do a little of this with using the network plugin.

It would be really nice to have some policies that can be applied to mesh traffic as well. Well, this is happening in an upcoming stable-2.11 release, traffic access with two new CRDs: and .

Final Thoughts

Thank you for finishing this article; I hope that this helped you in your journey.

Linux NinjaPants Automation Engineering Mutant — exploring DevOps, o11y, k8s, progressive deployment (ci/cd), cloud native infra, infra as code

Linux NinjaPants Automation Engineering Mutant — exploring DevOps, o11y, k8s, progressive deployment (ci/cd), cloud native infra, infra as code