AKS with Linkerd Service Mesh

Securing internal traffic with Linkerd on AKS

14 min readJul 28, 2021

Last Updated: 2021年9月4日 Illustrations and code snippets to gists

The two most often neglected domains of cloud operations are security and o11y (observability). This should come as no surprise because adding security, such as encryption-in-transit with mutual TLS (where both client and server verify each other), and adding traffic monitoring and tracing on short lived transitory pods is by its very nature complex.

What if you could add automation for both security and o11y in less than 15 minutes of effort?

The solution to all of this complexity involves deploying a service mesh, and as unbelievable as it seems, the above statement can really happen with Linkerd.

This article covers using the service mesh Linkerd installed into AKS (Azure Kubernetes Service) with an example application Dgraph.

Architecture

A service mesh can be logically organized into two primary layers:

a control plane layer that’s responsible for configuration and management, and a data plane layer that provides network functions valuable to distributed applications. (ref)

What is a service mesh?

The service mesh consist of proxies and reverse proxies for every service that will be put into a network called a mesh. This allows you to secure and monitor traffic between all members within the mesh.

A proxy, for the uninitiated, is redirecting outbound web traffic to an intermediary web service that can apply security policies, such as blocking access to a malicious web site, before traffic is sent on its way to the destination.

The reverse proxy is is an intermediary web service that can secure and route inbound traffic based set of defined rules, such as rules based on an HTTP path, a destination hostname, and ports.

The combined proxy and reverse proxy on every member node in the mesh, affords a refined level of security where you can allow only designated services to access other designated services, which is particularly useful for isolating services in case one of them is compromised.

Articles in Series

This series shows how to both secure and load balance gRPC and HTTP traffic.

The previous article discussed Kubernetes network plugins and network policies with Azure CNI and Calico network plugins.

AKS with Calico Network Policies

Using Calico Network Policy with Azure Kubernetes Server

joachim8675309.medium.com

Requirements

For creation of Azure cloud resources, you will need to have a subscription that will allow you to create resources.

Required Tools

Azure CLI tool (az): command line tool that interacts with Azure API.
Kubernetes client tool (kubectl): command line tool that interacts with Kubernetes API
Helm (helm): command line tool for “templating and sharing Kubernetes manifests” (ref) that are bundled as Helm chart packages.
helm-diff plugin: allows you to see the changes made with helm or helmfile before applying the changes.
Helmfile (helmfile): command line tool that uses a “declarative specification for deploying Helm charts across many environments” (ref).
Linkerd CLI (linkerd): command line tool that can configure, deploy, verify linkerd environment and extensions.

Optional tools

Many of the tools such as grpcurl, curl, jq will be accessible from the Docker container. For building images and running scripts, I highly recommend these tools:

POSIX shell (sh) such as GNU Bash (bash) or Zsh (zsh): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux.
Docker (docker): command line tool to build, test, and push docker images.
SmallStep CLI (step): A zero trust swiss army knife for working with certificates

Project file structure

The following structure will be used:

~/azure_linkerd
├── certs
│   ├── ca.crt
│   ├── ca.key
│   ├── issuer.crt
│   └── issuer.key
├── env.sh
└── examples
    ├── dgraph
    │   ├── helmfile.yaml
    │   └── network_policy.yaml
    └── pydgraph
        ├── Dockerfile
        ├── Makefile
        ├── helmfile.yaml
        ├── load_data.py
        ├── requirements.txt
        ├── sw.nquads.rdf
        └── sw.schema

With either Bash or Zsh, you can create the file structure with the following commands:

Project Environment Variables

Setup these environment variables below to keep a consistent environment amongst different tools used in this article. If you are using a POSIX shell, you can save these into a script and source that script whenever needed.

Copy this source script and save as env.sh:

NOTE: The default container registry for Linkerd uses GHCR (GitHub Container Registry). In my experiments, this has been a source of problems with AKS, so as an alternative, I recommend republishing the container images to another registry. Look for optional instructions below if you are interested in doing this as well.

Provision Azure resources

Both AKS with Azure CNI and Calico network policies and ACR cloud resources can be provisioned with the following steps outlined in the script below.

NOTE: Though these instructions are oriented AKS and ACR, you can use any Kubernetes with the Calico network plugin installed for network policies, and you can use any container registry, as long as it is accessible from the Kubernetes cluster.

Verify AKS and KUBCONFIG

Verify that the AKS cluster was created and that you have a KUBCONFIG that is authorized to access the cluster by running the following:

source env.shkubectl get all --all-namespaces

The final results can should look something like this:

The Linkerd service mesh

Linkerd can be installed using either the linkerd cli or through the linkerd2 helm chart. For this article, the linkerd command will be use to generate the Kubernetes manifests and then apply them with the kubectl command

Generate Certificates

Linkerd requires a trust anchor certificate and an issuer certificates with the corresponding key to support mutual TLS connections between meshed pods. All certificates require ECDSA P-256 algorithm, which is the default for the step command. You can alternatively use the openssl ecparam -name prime256v1 command.

To generate certificates using the step command for this, run the following commands.

Republish Linkerd Images (optional)

Linkerd uses GitHub Container Registry, which has been consistently unreliable when fetching images from AKS (v1.19.11). This leads to deploys to take around 20 to 30 minutes due to these errors.

As an optional step, the container images can be republished to ACR, which can reduce deploy time significantly to about 3 minutes. In the script below, follow the steps to republish the images.

Install Linkerd

You can install linkerd with the generated certificates using the linkerd command line tool:

The linkerd command will generate Kubernetes manifests that are then piped to kubectl command.

When completed, run this command to verify the deployed Linkerd infrastructure:

kubectl get all --namespace linkerd

This will show something like the following below:

You can also run linkerd check to verify the health of Linkerd:

Install the Viz extension

The Viz extension adds some graphical web dashboards, Prometheus metric system, and Grafana dashboards:

linkerd viz install | kubectl apply -f -

You can check the infrastructure with the following command:

kubectl get all --namespace linkerd-viz

This should show something like the following:

Additionally you can run linkerd viz check:

Install the Jaeger extension

The Jaeger extension will install the Jaeger, a distributed tracing solution.

linkerd jaeger install | kubectl apply -f -

You can check up on the success of the deployment with the following command:

kubectl get all --namespace linkerd-jaeger

This should show something like the following:

Additionally you can run linkerd jaeger check:

Access Viz Dashboard

You can port-forward to localhost with this command:

linkerd viz dashboard &

This should show something like the following

The Dgraph service

Dgraph is a distributed graph database consisting of three Dgraph Alpha member nodes, which host the graph data, and three Dgraph Zero nodes, which manage the state of Dgraph cluster including the timestamps. The Dgraph Alpha service supports both a HTTP (GraphQL) on port 8080 and gRPC interface on port 9080

Linkerd magic takes place by injecting a proxy sidecar container into each pod that will be a member of the service mesh. This can be configured by adding template annotations to a deployment or statefulset controller.

Deploy Dgraph with Linkerd

Dgraph will be deployed using the Dgraph chart, but instead of the normal route of installing Dgraph with helmfile apply, a Kubernetes manifest will be generated with helmfile template., so that the linkerd inject command can be used.

NOTE: Currently Dgraph does yet not have direct support modifying the template annotations in the statefulset. Recently, I added a pull request for this feature, and hopefully a new chart version will be published. In the mean time, helmfile template command will work.

Run these commands to deploy Dgraph with the injected Linkerd proxy side cars:

After about a minute, the Dgraph cluster will come up, you can verify this with the following command:

kubectl get all --namespace "dgraph"

We should see something like this:

Service Profile

Should you want to run a gRPC client to connect to Dgraph on the same Kubernetes cluster, you will want to generate a service profile. This will allow more evenly distributed traffic across Dgraph member nodes.

Generate and deploy a service profile:

The pydgraph client

In an earlier blog, I documented steps to build and release a pygraph-client image, and then deploy a pod that uses this image.

The pydgraph-client pod will have all the tools needed to test both HTTP and gRPC. We’ll use this client run though the following tests:

Establish basic connectivity works (baseline)
Apply a network policy to block all non-proxy traffic with Calico and verify connectivity no longer works.
Inject a proxy into the pydgraph and verify connectivity through proxy works

Fetch build and deploy scripts

Below is a script you can use to download the gists and populate the needed files run through these steps.

NOTE: These scripts and further details are covered in an earlier article (see AKS with Azure Container Registry).

Build, push, and deploy the pydgraph client

Now that all the required source files are available, build the image:

After running kubectl get all --namespace pydgraph-client, this should result in something like the following:

Log into the pydgraph-client container

For the next set of tests, you will need to log into the container. This can be done with the following commands:

PYDGRAPH_POD=$(kubectl get pods \
  --namespace pydgraph-client \
  --output name
)kubectl exec -ti \
  --namespace pydgraph-client ${PYDGRAPH_POD}\
  --container pydgraph-client -- bash

Test 0 (Baseline): No Proxy

Verify that the things are working without a proxy or network policies.

In this sanity check and proceeding tests, both HTTP (port 8080) and gRPC (port 9080) will be tested.

HTTP check (no proxy)

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will be something similar to this:

gRPC check (no proxy)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
  ${DGRAPH_ALPHA_SERVER}:9080 \
  api.Dgraph/CheckVersion

The expected results will be something similar to this:

Test 1: Add a network policy

The goal of this next test is to deny all traffic that is outside of service mesh. his can be done by using network policies where only traffic from the service mesh is permitted.

After adding the policy, the expected results will timeouts as communicate from the pydgraph-client will be blocked.

Network Policy added to block traffic outside the mesh

Adding a network policy

This policy will deny all traffic to the Dgraph Alpha pods, except for traffic from the service mesh, or more explicitly, from any pod with the label linkerd.io/control-plane-ns=linkerd.

**Dgraph Network Policy for Linkerd (made with** https://editor.cilium.io)

Copy the following and save as examples/dgraph/network_policy.yaml:

When ready, apply this with the following command:

kubectl --filename ./examples/dgraph/network_policy.yaml apply

HTTP check (network policy applied)

Log into the pydgraph-client pod, and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health

The expected results in this case, after a very long wait (about 5 minutes) will be something similar to this:

gRPC check (network policy apply)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
  ${DGRAPH_ALPHA_SERVER}:9080 \
  api.Dgraph/CheckVersion

The expected results for gRPC in about 10 seconds will be:

Test 2: Inject Linkerd proxy side car

Now that we verified that network connectivity is not possible, we can inject a proxy side car so that traffic will be permitted.

Inject the proxy in order to access Dgraph

A new container linkerd-proxy is added to the pod:

**View of containers (Lens tool** https://k8slens.dev/)

HTTP check (proxy)

Log into the pydgraph-client pod and run this command:

curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq

The expected results will look something similar to this:

gRPC check (proxy)

Log into the pydgraph-client pod and run this command:

grpcurl -plaintext -proto api.proto \
  ${DGRAPH_ALPHA_SERVER}:9080 \
  api.Dgraph/CheckVersion

The expected results will look something similar to this:

Test 3: Listening to traffic steams

For this step, we will monitor traffic as it goes through the proxy and then generate some traffic. For monitoring, we’ll use tap from the command line and the dashboard to listen to traffic streams.

Viz Tap from the CLI

In a separate terminal tab or window, run this command to monitor traffic:

linkerd viz tap namespace/pydgraph-client

Viz Tap from the dashboard

We can also do the same thing in the Linkerd Viz Dashboard under Tap area:

set the Namespace field to pydgraph-client
set the Resource field to namespace/pydgraph-client
click on the the [START] button

Generate Traffic

With this monitoring in place, log into the pydgraph-client pod and run these commands:

Observe the resulting traffic

In the separate terminal tab or window, you should see output like this below:

NOTE: The colors were added manually to highlight traffic generated from the different commands.

In the Viz Dashboard, you should see something like this:

Cleanup

This will remove the AKS cluster as well as any provisioned resources from AKS including external volumes created through the Dgraph deployment.

az aks delete \
  --resource-group $AZ_RESOURCE_GROUP \
  --name $AZ_CLUSTER_NAME

Resources

Here are some links to topics, articles, and tools used in this article:

Blog Source Code

This is the source code related to this blog.

AKS with Linkerd Service Mesh: https://github.com/darkn3rd/blog_tutorials/tree/master/kubernetes/aks/series_2_network_mgmnt/part_3_linkerd

Example Applications

These are applications that can be used to walk through the features of a service mesh.

General Service Mesh Articles

The History of the Service Mesh by William Morgan, 13 Feb 2018.
Which Service Mesh Should I Use? by by George Miranda, 24 Apr 2018
Service Meshes in the Cloud Native World by Pavan Belagatti, 5 Apr 2021
What is a Service Mesh? Redhat, Accessed 1 Aug 2021.
Service Mesh, Wikipedia, Accessed 1 Aug 2021.

gRPC Load Balancing

Topics on gRPC load balancing on Kubernetes.

gRPC Load Balancing on Kubernetes without Tears: https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/
Linkerd Load Balancing: https://linkerd.io/2.10/features/load-balancing/

Linkerd Documentation

Generate Certificates: https://linkerd.io/2.10/tasks/generate-certificates/
Setting Up Service Profiles: https://linkerd.io/2.10/tasks/setting-up-service-profiles/

About o11y (cloud native observability)

The o11y or observability is a new term to distinguish observability around patterns in cloud native infrastructure.

Learn about observability by Honeycomb: https://docs.honeycomb.io/getting-started/learning-about-observability/
What is the difference between monitoring and observability? https://www.splunk.com/en_us/data-insider/what-is-observability.html

Service Mesh Traffic Access

Here are some topics around service mesh traffic access in the community, and related to upcoming feature in the stable-2.11 release.

Issue 3342 Provide Service-to-Service Authorization: https://github.com/linkerd/linkerd2/issues/3342
Issue 2746 Support Service Network Policies: https://github.com/linkerd/linkerd2/issues/2746
Linkerd Policy Exploration Design: https://github.com/linkerd/polixy/blob/main/DESIGN.md
OPA: https://www.openpolicyagent.org/
SMI: https://smi-spec.io/
SMI Spec on GitHub: https://github.com/servicemeshinterface/smi-spec
Envoy External Auth w/ OPA: https://blog.openpolicyagent.org/envoy-external-authorization-with-opa-578213ed567c

Document Changes for Blog

2021年9月4日 multiline code to gists, updated images
2021年8月6日 Updated Linkerd architecture image

Conclusion

Linkerd is a breeze to setup and get off the ground despite the numerous components and processes happening behind the scenes.

Load Balancing

Beyond attractive features like automation for o11y (cloud native observability) and encryption-in-transit with Mutual TLS, the one often overlooked feature are the load balancing features, for not only HTTP traffic, but gRPC as well. Why this is important is because of this:

gRPC also breaks the standard connection-level load balancing, including what’s provided by Kubernetes. This is because gRPC is built on HTTP/2, and HTTP/2 is designed to have a single long-lived TCP connection, across which all requests are multiplexed — meaning multiple requests can be active on the same connection at any point in time. (ref)

Using the default Kubernetes service resource (kube-proxy), connections are randomly selected, i.e. no load balancing, and for gRPC, this means a single node in your highly available cluster will suck up all the traffic. Thus, load balancing feature of Linkerd becomes, in my mind, one of the most essential features.

Restricting Traffic

For security beyond encryption-in-transit with Mutual TLS, the restriction access to pods is also important. This area is called defense-in-depth, a layered approached to restrict which services should be able to connect to each other.

In this article, I touched on how to do a little of this with network policies using the Calico network plugin.

It would be really nice to have some policies that can be applied to mesh traffic as well. Well, this is happening in an upcoming stable-2.11 release, traffic access with two new CRDs: Server and ServerAuthorization.

Final Thoughts

Thank you for finishing this article; I hope that this helped you in your journey.

AKS with Linkerd Service Mesh

Securing internal traffic with Linkerd on AKS

Architecture

What is a service mesh?

Articles in Series

Previous Article

AKS with Calico Network Policies

Using Calico Network Policy with Azure Kubernetes Server

Requirements

Required Tools

Optional tools

Project file structure

Project Environment Variables

Provision Azure resources

Verify AKS and KUBCONFIG

The Linkerd service mesh

Generate Certificates

Republish Linkerd Images (optional)

Install Linkerd

Install the Viz extension

Install the Jaeger extension

Access Viz Dashboard

The Dgraph service

Deploy Dgraph with Linkerd

Service Profile

The pydgraph client

Fetch build and deploy scripts

Build, push, and deploy the pydgraph client

Log into the pydgraph-client container

Test 0 (Baseline): No Proxy

HTTP check (no proxy)

gRPC check (no proxy)

Test 1: Add a network policy

Adding a network policy

HTTP check (network policy applied)

gRPC check (network policy apply)

Test 2: Inject Linkerd proxy side car

Inject the proxy in order to access Dgraph

HTTP check (proxy)

gRPC check (proxy)

Test 3: Listening to traffic steams

Viz Tap from the CLI

Viz Tap from the dashboard

Generate Traffic

Observe the resulting traffic

Cleanup

Resources

Blog Source Code

Example Applications

General Service Mesh Articles

gRPC Load Balancing

Linkerd Documentation

About o11y (cloud native observability)

Service Mesh Traffic Access

Document Changes for Blog

Conclusion

Load Balancing

Restricting Traffic

Final Thoughts

Written by Joaquín Menchaca (智裕)

No responses yet