AKS with Linkerd Service Mesh
Securing internal traffic with Linkerd on AKS
Last Updated: 2021年9月4日 Illustrations and code snippets to gists
The two most often neglected domains of cloud operations are security and o11y (observability). This should come as no surprise because adding security, such as encryption-in-transit with mutual TLS (where both client and server verify each other), and adding traffic monitoring and tracing on short lived transitory pods is by its very nature complex.
What if you could add automation for both security and o11y in less than 15 minutes of effort?
The solution to all of this complexity involves deploying a service mesh, and as unbelievable as it seems, the above statement can really happen with Linkerd.
This article covers using the service mesh Linkerd installed into AKS (Azure Kubernetes Service) with an example application Dgraph.
Architecture
A service mesh can be logically organized into two primary layers:
a control plane layer that’s responsible for configuration and management, and a data plane layer that provides network functions valuable to distributed applications. (ref)
What is a service mesh?
The service mesh consist of proxies and reverse proxies for every service that will be put into a network called a mesh. This allows you to secure and monitor traffic between all members within the mesh.
A proxy, for the uninitiated, is redirecting outbound web traffic to an intermediary web service that can apply security policies, such as blocking access to a malicious web site, before traffic is sent on its way to the destination.
The reverse proxy is is an intermediary web service that can secure and route inbound traffic based set of defined rules, such as rules based on an HTTP path, a destination hostname, and ports.
The combined proxy and reverse proxy on every member node in the mesh, affords a refined level of security where you can allow only designated services to access other designated services, which is particularly useful for isolating services in case one of them is compromised.
Articles in Series
This series shows how to both secure and load balance gRPC and HTTP traffic.
- AKS with Azure Container Registry
- AKS with Calico network policies
- AKS with Linkerd service mesh (this article)
- AKS with Istio service mesh
Previous Article
The previous article discussed Kubernetes network plugins and network policies with Azure CNI and Calico network plugins.
Requirements
For creation of Azure cloud resources, you will need to have a subscription that will allow you to create resources.
Required Tools
- Azure CLI tool (
az
): command line tool that interacts with Azure API. - Kubernetes client tool (
kubectl
): command line tool that interacts with Kubernetes API - Helm (
helm
): command line tool for “templating and sharing Kubernetes manifests” (ref) that are bundled as Helm chart packages. - helm-diff plugin: allows you to see the changes made with
helm
orhelmfile
before applying the changes. - Helmfile (
helmfile
): command line tool that uses a “declarative specification for deploying Helm charts across many environments” (ref). - Linkerd CLI (
linkerd
): command line tool that can configure, deploy, verify linkerd environment and extensions.
Optional tools
Many of the tools such as grpcurl
, curl
, jq
will be accessible from the Docker container. For building images and running scripts, I highly recommend these tools:
- POSIX shell (
sh
) such as GNU Bash (bash
) or Zsh (zsh
): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux. - Docker (
docker
): command line tool to build, test, and push docker images. - SmallStep CLI (
step
): A zero trust swiss army knife for working with certificates
Project file structure
The following structure will be used:
~/azure_linkerd
├── certs
│ ├── ca.crt
│ ├── ca.key
│ ├── issuer.crt
│ └── issuer.key
├── env.sh
└── examples
├── dgraph
│ ├── helmfile.yaml
│ └── network_policy.yaml
└── pydgraph
├── Dockerfile
├── Makefile
├── helmfile.yaml
├── load_data.py
├── requirements.txt
├── sw.nquads.rdf
└── sw.schema
With either Bash or Zsh, you can create the file structure with the following commands:
Project Environment Variables
Setup these environment variables below to keep a consistent environment amongst different tools used in this article. If you are using a POSIX shell, you can save these into a script and source that script whenever needed.
Copy this source script and save as env.sh
:
NOTE: The default container registry for Linkerd uses GHCR (GitHub Container Registry). In my experiments, this has been a source of problems with AKS, so as an alternative, I recommend republishing the container images to another registry. Look for optional instructions below if you are interested in doing this as well.
Provision Azure resources
Both AKS with Azure CNI and Calico network policies and ACR cloud resources can be provisioned with the following steps outlined in the script below.
NOTE: Though these instructions are oriented AKS and ACR, you can use any Kubernetes with the Calico network plugin installed for network policies, and you can use any container registry, as long as it is accessible from the Kubernetes cluster.
Verify AKS and KUBCONFIG
Verify that the AKS cluster was created and that you have a KUBCONFIG
that is authorized to access the cluster by running the following:
source env.shkubectl get all --all-namespaces
The final results can should look something like this:
The Linkerd service mesh
Linkerd can be installed using either the linkerd cli or through the linkerd2 helm chart. For this article, the linkerd
command will be use to generate the Kubernetes manifests and then apply them with the kubectl
command
Generate Certificates
Linkerd requires a trust anchor certificate and an issuer certificates with the corresponding key to support mutual TLS connections between meshed pods. All certificates require ECDSA P-256 algorithm, which is the default for the step
command. You can alternatively use the openssl ecparam -name prime256v1
command.
To generate certificates using the step
command for this, run the following commands.
Republish Linkerd Images (optional)
Linkerd uses GitHub Container Registry, which has been consistently unreliable when fetching images from AKS (v1.19.11
). This leads to deploys to take around 20 to 30 minutes due to these errors.
As an optional step, the container images can be republished to ACR, which can reduce deploy time significantly to about 3 minutes. In the script below, follow the steps to republish the images.
Install Linkerd
You can install linkerd with the generated certificates using the linkerd
command line tool:
The linkerd
command will generate Kubernetes manifests that are then piped to kubectl
command.
When completed, run this command to verify the deployed Linkerd infrastructure:
kubectl get all --namespace linkerd
This will show something like the following below:
You can also run linkerd check
to verify the health of Linkerd:
Install the Viz extension
The Viz extension adds some graphical web dashboards, Prometheus metric system, and Grafana dashboards:
linkerd viz install | kubectl apply -f -
You can check the infrastructure with the following command:
kubectl get all --namespace linkerd-viz
This should show something like the following:
Additionally you can run linkerd viz check
:
Install the Jaeger extension
The Jaeger extension will install the Jaeger, a distributed tracing solution.
linkerd jaeger install | kubectl apply -f -
You can check up on the success of the deployment with the following command:
kubectl get all --namespace linkerd-jaeger
This should show something like the following:
Additionally you can run linkerd jaeger check
:
Access Viz Dashboard
You can port-forward to localhost with this command:
linkerd viz dashboard &
This should show something like the following
The Dgraph service
Dgraph is a distributed graph database consisting of three Dgraph Alpha member nodes, which host the graph data, and three Dgraph Zero nodes, which manage the state of Dgraph cluster including the timestamps. The Dgraph Alpha service supports both a HTTP (GraphQL) on port 8080
and gRPC interface on port 9080
Linkerd magic takes place by injecting a proxy sidecar container into each pod that will be a member of the service mesh. This can be configured by adding template annotations to a deployment
or statefulset
controller.
Deploy Dgraph with Linkerd
Dgraph will be deployed using the Dgraph chart, but instead of the normal route of installing Dgraph with helmfile apply
, a Kubernetes manifest will be generated with helmfile template
., so that the linkerd inject
command can be used.
NOTE: Currently Dgraph does yet not have direct support modifying the template annotations in the statefulset
. Recently, I added a pull request for this feature, and hopefully a new chart version will be published. In the mean time, helmfile template
command will work.
Run these commands to deploy Dgraph with the injected Linkerd proxy side cars:
After about a minute, the Dgraph cluster will come up, you can verify this with the following command:
kubectl get all --namespace "dgraph"
We should see something like this:
Service Profile
Should you want to run a gRPC client to connect to Dgraph on the same Kubernetes cluster, you will want to generate a service profile. This will allow more evenly distributed traffic across Dgraph member nodes.
Generate and deploy a service profile:
The pydgraph client
In an earlier blog, I documented steps to build and release a pygraph-client
image, and then deploy a pod that uses this image.
The pydgraph-client
pod will have all the tools needed to test both HTTP and gRPC. We’ll use this client run though the following tests:
- Establish basic connectivity works (baseline)
- Apply a network policy to block all non-proxy traffic with Calico and verify connectivity no longer works.
- Inject a proxy into the pydgraph and verify connectivity through proxy works
Fetch build and deploy scripts
Below is a script you can use to download the gists and populate the needed files run through these steps.
NOTE: These scripts and further details are covered in an earlier article (see AKS with Azure Container Registry).
Build, push, and deploy the pydgraph client
Now that all the required source files are available, build the image:
After running kubectl get all --namespace pydgraph-client
, this should result in something like the following:
Log into the pydgraph-client container
For the next set of tests, you will need to log into the container. This can be done with the following commands:
PYDGRAPH_POD=$(kubectl get pods \
--namespace pydgraph-client \
--output name
)kubectl exec -ti \
--namespace pydgraph-client ${PYDGRAPH_POD}\
--container pydgraph-client -- bash
Test 0 (Baseline): No Proxy
Verify that the things are working without a proxy or network policies.
In this sanity check and proceeding tests, both HTTP (port 8080
) and gRPC (port 9080
) will be tested.
HTTP check (no proxy)
Log into the pydgraph-client
pod and run this command:
curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq
The expected results will be something similar to this:
gRPC check (no proxy)
Log into the pydgraph-client
pod and run this command:
grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion
The expected results will be something similar to this:
Test 1: Add a network policy
The goal of this next test is to deny all traffic that is outside of service mesh. his can be done by using network policies where only traffic from the service mesh is permitted.
After adding the policy, the expected results will timeouts as communicate from the pydgraph-client
will be blocked.
Adding a network policy
This policy will deny all traffic to the Dgraph Alpha pods, except for traffic from the service mesh, or more explicitly, from any pod with the label linkerd.io/control-plane-ns=linkerd
.
Copy the following and save as examples/dgraph/network_policy.yaml
:
When ready, apply this with the following command:
kubectl --filename ./examples/dgraph/network_policy.yaml apply
HTTP check (network policy applied)
Log into the pydgraph-client
pod, and run this command:
curl ${DGRAPH_ALPHA_SERVER}:8080/health
The expected results in this case, after a very long wait (about 5 minutes) will be something similar to this:
gRPC check (network policy apply)
Log into the pydgraph-client
pod and run this command:
grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion
The expected results for gRPC in about 10 seconds will be:
Test 2: Inject Linkerd proxy side car
Now that we verified that network connectivity is not possible, we can inject a proxy side car so that traffic will be permitted.
Inject the proxy in order to access Dgraph
A new container linkerd-proxy
is added to the pod:
HTTP check (proxy)
Log into the pydgraph-client
pod and run this command:
curl ${DGRAPH_ALPHA_SERVER}:8080/health | jq
The expected results will look something similar to this:
gRPC check (proxy)
Log into the pydgraph-client
pod and run this command:
grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion
The expected results will look something similar to this:
Test 3: Listening to traffic steams
For this step, we will monitor traffic as it goes through the proxy and then generate some traffic. For monitoring, we’ll use tap
from the command line and the dashboard to listen to traffic streams.
Viz Tap from the CLI
In a separate terminal tab or window, run this command to monitor traffic:
linkerd viz tap namespace/pydgraph-client
Viz Tap from the dashboard
We can also do the same thing in the Linkerd Viz Dashboard under Tap area:
- set the Namespace field to
pydgraph-client
- set the Resource field to
namespace/pydgraph-client
- click on the the
[START]
button
Generate Traffic
With this monitoring in place, log into the pydgraph-client
pod and run these commands:
Observe the resulting traffic
In the separate terminal tab or window, you should see output like this below:
NOTE: The colors were added manually to highlight traffic generated from the different commands.
In the Viz Dashboard, you should see something like this:
Cleanup
This will remove the AKS cluster as well as any provisioned resources from AKS including external volumes created through the Dgraph deployment.
az aks delete \
--resource-group $AZ_RESOURCE_GROUP \
--name $AZ_CLUSTER_NAME
Resources
Here are some links to topics, articles, and tools used in this article:
Blog Source Code
This is the source code related to this blog.
- AKS with Linkerd Service Mesh: https://github.com/darkn3rd/blog_tutorials/tree/master/kubernetes/aks/series_2_network_mgmnt/part_3_linkerd
Example Applications
These are applications that can be used to walk through the features of a service mesh.
- https://github.com/spikecurtis/yaobank
- https://github.com/BuoyantIO/emojivoto
- https://github.com/BuoyantIO/booksapp
- https://github.com/istio/istio/tree/master/samples/bookinfo
- https://github.com/argoproj/argocd-example-apps/tree/master/helm-guestbook
- https://github.com/dockersamples/example-voting-app
General Service Mesh Articles
- The History of the Service Mesh by William Morgan, 13 Feb 2018.
- Which Service Mesh Should I Use? by by George Miranda, 24 Apr 2018
- Service Meshes in the Cloud Native World by Pavan Belagatti, 5 Apr 2021
- What is a Service Mesh? Redhat, Accessed 1 Aug 2021.
- Service Mesh, Wikipedia, Accessed 1 Aug 2021.
gRPC Load Balancing
Topics on gRPC load balancing on Kubernetes.
- gRPC Load Balancing on Kubernetes without Tears: https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/
- Linkerd Load Balancing: https://linkerd.io/2.10/features/load-balancing/
Linkerd Documentation
- Generate Certificates: https://linkerd.io/2.10/tasks/generate-certificates/
- Setting Up Service Profiles: https://linkerd.io/2.10/tasks/setting-up-service-profiles/
About o11y (cloud native observability)
The o11y or observability is a new term to distinguish observability around patterns in cloud native infrastructure.
- Learn about observability by Honeycomb: https://docs.honeycomb.io/getting-started/learning-about-observability/
- What is the difference between monitoring and observability? https://www.splunk.com/en_us/data-insider/what-is-observability.html
Service Mesh Traffic Access
Here are some topics around service mesh traffic access in the community, and related to upcoming feature in the stable-2.11
release.
- Issue 3342 Provide Service-to-Service Authorization: https://github.com/linkerd/linkerd2/issues/3342
- Issue 2746 Support Service Network Policies: https://github.com/linkerd/linkerd2/issues/2746
- Linkerd Policy Exploration Design: https://github.com/linkerd/polixy/blob/main/DESIGN.md
- OPA: https://www.openpolicyagent.org/
- SMI: https://smi-spec.io/
- SMI Spec on GitHub: https://github.com/servicemeshinterface/smi-spec
- Envoy External Auth w/ OPA: https://blog.openpolicyagent.org/envoy-external-authorization-with-opa-578213ed567c
Document Changes for Blog
- 2021年9月4日 multiline code to gists, updated images
- 2021年8月6日 Updated Linkerd architecture image
Conclusion
Linkerd is a breeze to setup and get off the ground despite the numerous components and processes happening behind the scenes.
Load Balancing
Beyond attractive features like automation for o11y (cloud native observability) and encryption-in-transit with Mutual TLS, the one often overlooked feature are the load balancing features, for not only HTTP traffic, but gRPC as well. Why this is important is because of this:
gRPC also breaks the standard connection-level load balancing, including what’s provided by Kubernetes. This is because gRPC is built on HTTP/2, and HTTP/2 is designed to have a single long-lived TCP connection, across which all requests are multiplexed — meaning multiple requests can be active on the same connection at any point in time. (ref)
Using the default Kubernetes service resource (kube-proxy), connections are randomly selected, i.e. no load balancing, and for gRPC, this means a single node in your highly available cluster will suck up all the traffic. Thus, load balancing feature of Linkerd becomes, in my mind, one of the most essential features.
Restricting Traffic
For security beyond encryption-in-transit with Mutual TLS, the restriction access to pods is also important. This area is called defense-in-depth, a layered approached to restrict which services should be able to connect to each other.
In this article, I touched on how to do a little of this with network policies using the Calico network plugin.
It would be really nice to have some policies that can be applied to mesh traffic as well. Well, this is happening in an upcoming stable-2.11
release, traffic access with two new CRDs: Server
and ServerAuthorization
.
Final Thoughts
Thank you for finishing this article; I hope that this helped you in your journey.