GKE with Consul Service Mesh

Using Consul Connect and Envoy to build a service mesh

Joaquín Menchaca (智裕)
16 min readOct 12, 2022

--

This article shows how to set up and get started with CCSM (Consul Connect service mesh) or more recently called just Consul Service Mesh.

This article will cover how to install and configure services to use CCSM. An example application Dgraph, a distributed graph database, will be used as this demonstrates a real world application.

📔 NOTE: This was tested on following below and may not work if versions are significantly different.* Kubernetes API v1.22
* gcloud 402.0.0
* gsutil 5.13
* kubectl v1.22
* kustomize v4.5.4
* helm v3.8.2
* helmfile v0.144.0
* Docker 20.10.17
* Dgraph v21.03.2
* Consul 1.13.2

About Consul

Consul is a popular tool for service discovery and a key-value store that was released in April-2014.

Service discovery is important for clusters members or microservices as it provides “automatic detection of devices and services offered by these devices on a computer network (ref)”. This allows “applications and microservices locate different components on a network (ref)”.

The key-value store is a network database to store hash maps (also called associative arrays or dictionaries). This allows services create and retrieve configuration.

Building upon this, Hashicorp developed consul-template, which is essentially cloud native change configuration, and Consul Connect, now called Consul Service Mesh, which can automatically inject side-car proxy containers into your network, so that services can communicate securely.

About Service Mesh

A service mesh uses automation to secure internal network traffic between member nodes. It does this by inserting reverse-proxy sidecar containers to every pod that is apart of the service mesh.

With a network of side-car proxies installed, the traffic can be further secured using strict mTLS, where not only the client must authenticate the validity of the server, but the server must authenticate the validity of the client. This is on top of the encryption of traffic between all the members of the mesh.

A service mesh is divided into three planes (illustration below): control plane to manage the overall service mesh, a data plane that consists of the members within the mesh that are secured with a proxy, and observability plane to monitor traffic from within the mesh.

📔 NOTE: Observability is not supported for Consul Service Mesh with services that use multiple-ports.

Consul Connect leverages off of Consul to manage the connectivity through use of service discovery, health checks, and a service catalog. Envoy is the default proxy that is injected into each of the pods to create the service mesh. This proxy can be swapped for another proxy component, such as HAProxy or NGINX.

Requirements

These are the requirements to use this solution.

Accounts

No commercial licenses are needed for either Consul and Dgraph. All of the tools are accessible from the public Internet. For creating resources on Google Cloud, you will need to create an account.

  • Google Cloud account with ownership of a project where you can deploy resources (where billing account was linked to the project)

Knowledge

You should be familiar or have exposure to the following concepts to get more thorough understanding of this tutorial:

For Kubernetes, experience with deploying applications with service resources is useful, but even if you don’t have this, this guide will walk you through it. Configuring KUBECONFIG to access the Kubernetes cluster with Kubernetes client (kubectl) and using Helm (helm), so familiarity to this is useful.

For Google Cloud, you should be familiar Google Cloud SDK (gcloud tool) with setting up an account, project, and provisioning resources. This is important as there are cost factors involved in setting these things up.

Tools (Required)

  • Google Cloud SDK (gcloud command) to interact with Google Cloud
  • Kubernetes client (kubectl command) to interact with Kubernetes
  • Helm (helm command) to install Kubernetes packages
  • helm-diff plugin to see differences about what will be deployed.
  • helmfile (helmfile command) to automate installing many helm charts
  • Kustomize (kustomize command) to apply patches to existing Helm charts

Tools (Recommended)

These tools are useful in using the automation used form within this article.

  • POSIX shell (sh) such as GNU Bash (bash) or Zsh (zsh): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux.
  • GNU stream-editor (sed) and GNU grep (grep): scripts were tested with these tools. Note that BSD versions of these tools may NOT WORK, such as tools bundled with macOS or BSD.
  • Docker Engine (docker command) to automate building and pushing running pydgraph client to Google container registry.
  • git (git command) to download source code from git code repositories.

Project Setup

This will setup all the content for this tutorial.

Directory Structure

The directory structure should look like this:

~/projects/consul_connect
├── consul
│ └── helmfile.yaml
└── examples
└── dgraph
├── helmfile.yaml
└── pydgraph_client.yaml

In GNU Bash, you can create the above structure like this:

export PROJECT_DIR=~/projects/consul_connectmkdir -p $PROJECT_DIR/{examples/dgraph,consul}
cd $PROJECT_DIR
touch {consul,examples/dgraph}/helmfile.yaml \
examples/dgraph/pydgraph_client.yaml

Environment Variables

These environment variables will be used in this project. Create a file called env.sh with the contents below, changing values as appropriate, and then run source env.sh.

Google Project Setup

For this tutorial, we’ll need to setup a Google cloud project and provide access to allow use to create the necessary cloud resources. Here is an example of how you can set this up with gcloud:

Provision Cloud Resources

These instructions will create the necessary cloud resources for this project.

Provision Google Kubernetes Engine cluster

The steps below will allow you to bring up a Kubernetes cluster with 3 worker nodes.

📔 NOTE: This will deploy a robust 3 worker node Kubernetes cloud that is suitable for Consul.  This will create a principal identity (Google Service Account) with the minimal necessary privileges required to manage the Kubernetes nodes (GCE).📔 NOTE: For production environments, you will want to explore further security measures, such as private cluster, to block access from the public Internet.

You can test access to the cluster as well as the components installed with the following commands:

kubectl get nodes
kubectl
get all --all-namespaces

Another useful command to test a new cluster is to see how many resources are available and what is consumed in the new cluster:

kubectl top nodes
kubectl top pods --all-namespaces

Deploy Kubernetes Resources

This section covers deploying Kubernetes resources such as Deployment, StatefulSet, ServiceAccount, Service, and so on. This will cover installing the Consul Connect service mesh, Dgraph, and pydgraph-client to access Dgraph through the service mesh.

Deploy Consul Connect service mesh

This will deploy the Consul Connect Service mesh. Save the following code below as consul/helmfile.yaml:

This Helm chart configuration values will install Consul Connect service mesh with automatic injection enabled. When you deploy a pod with annotation of consul.hashicorp.com/connect-inject: "true", side-car containers will be installed to copy the consul binary into the container and setup and configure Envoy proxy. The service proxy resources will be used as a blueprint to register the service with Consul’s service catalog and configure the Envoy proxy.

Run the following to deploy the service mesh:

source env.shhelmfile --file ./consul/helmfile.yaml apply

You can check that everything is deployed with:

kubectl get all --namespace consul

This should show something like this:

Consul deployed Kubernetes resources
Consul deployed Kubernetes resources

Deploy Observability

Currently observability is not supported with multi-port services like Dgraph. Hopefully this will get fixed in the future.

For further information, see:

Deploy Dgraph

Dgraph is a distributed graph database communicates through both HTTP on port 8080 and gRPC on port 9080. Dgraph uses the DQL (Dgraph Query Language) through either gRPC or HTTP, and can also use GraphQL with HTTP. Dgraph supports administrative operations using GraphQL or REST.

For this reason, to fully use Dgraph on a service mesh, you have to use the recently added multi-port configuration with Consul Connect. This requires separating the single multi-port service into two separate services: one for gRPC (9080) and one for HTTP (8080).

Save the following helmfile config below as examples/dgraph/helmfile.yaml:

This helmfile config uses some advance features to make some necessary changes required by Consul Connect:

  • pre-install service accounts and new gRPC service all packaged up as dgraph-extras chart
  • render Dgraph resources with required annotations for consul
  • apply patches to add Dgraph headless service labels that instructs Consul to ignore these services when is configures the proxies.
  • remove gRPC port (9080) from the Dgraph Alpha service, as this was defined earlier as a separate gRPC service with the dgraph-extras chart.

Consul Connect will inject Envoy sidecar proxy containers. Dgraph Zero will get a sidecar for port 6080, while Dgraph Alpha will have two sidecar proxy containers per pod: one for gRPC at port 9080 and another one for HTTP at port 8080.

When ready to deploy all of this, run the following command:

source env.shhelmfile --file ./examples/dgraph/helmfile.yaml apply

You can check on the status using:

kubectl get all --namespace dgraph

This should show something like:

Dgraph deployed Kubernetes resources
Dgraph deployed Kubernetes resources

You notice the extra containers per pod in the ready state, which are the Envoy proxy sidecar containers.

Deploy Pydgraph client

The client is a small python script that can load data into Dgraph using gRPC, and the container also has some useful tools like curl, grpcurl, and jq.

Save the following below as examples/dgraph/pydgraph_client.yaml:

When ready to deploy this, you can run the following:

source env.sh# https://hub.docker.com/r/darknerd/pydgraph-client
export
DOCKER_REGISTRY=darknerd
export CCSM_ENABLED=true
helmfile --file ./examples/dgraph/pydgraph_client.yaml apply

You can check the deployment with the following:

kubectl get all --namespace pydgraph-client

This should result in something similar to the following:

Pydgraph-client deployed Kubernetes resources
Pydgraph-client deployed Kubernetes resources

Testing Upstream Traffic

Consul Connect will set up a tunnel between the upstream ports specified in the annotation to the ports that are serviced by Dgraph.

First remote into the client container:

CLIENT_NS="pydgraph-client"
PYDGRAPH_POD=$(kubectl get pods -n $CLIENT_NS --output name)
kubectl exec -ti -c "pydgraph-client" -n $CLIENT_NS \
${PYDGRAPH_POD} -- bash

One in the container, test that HTTP traffic is working:

curl --silent localhost:8080/health

For gRPC traffic, you can run the following:

grpcurl -plaintext -proto api.proto \
localhost:9080 api.Dgraph/CheckVersion

Also, you can try loading data:

python3 load_data.py \
--plaintext \
--alpha
localhost:9080 \
--files ./sw.nquads.rdf \
--schema ./sw.schema

These should work through the tunnel that is configured by Consul Connect using the Envoy proxy side-cars.

Dgraph Graphical Viewer: Ratel

Dgraph hosts an online graphical viewer at https://play.dgraph.io/. If you would like to access the data we deployed with load_data.py, you can run this in a new terminal tab:

kubectl port-forward svc/dgraph-dgraph-alpha -n dgraph 8080:8080

Now you can you can point the connection configuration in Ratel to http://localhost:8080:

Click on the Console and select Query and enter the following DQL:

Click Run to see the results of the query:

Consul User Interface

The Consul UI can be accessed by running this command in a new terminal tab:

source env.shkubectl port-forward service/consul-ui --namespace consul 8500:80

You can access the Consul UI through http://localhost:8500. The Consul UI should look like this below with other services appearing after Dgraph and pydgraph-client were deployed.

Consul UI

If you click on pydgraph-client, you can see the connections:

Consul UI: pydgraph-client

Cleanup

Kubernetes Resources

You can cleanup Kubernetes resources with the following:

It is important to delete the consul namespace if you intend to deploy new version of Consul Connect service mesh in the future. This is because there are secrets left behind that will break future installations, so deleting the namespace will avoid this scenario.

Cloud Resources

The Kubernetes cluster and the associated Google service account can be deleted with the following commands:

Addendum: Publishing Pygraph-Client Images

If you would like to publish the pydgraph-client images to an alternative registry, you can run the following steps below.

Download the source code

pushd examples
git clone \
--depth 1 \
--branch "consul" \
git@github.com:darkn3rd/pydgraph-client.git
popd

Publishing to GCR

If you wish to use Google Container Registry, you can run the following.

Publising to DockerHub

If you have an account on DockerHub, you can publish it there with these steps:

Resources

These are some resources and references that may be useful in using this solution.

Consul Documentation

Gateways and Ingress

These are links for north-south traffic into mesh.
I have not tested these solutions yet

These are links that cover integration of either ingress controllers or API gateways with Consul. This may be using Consul as a backend database or the Consul Connect service mesh itself.

📔 NOTE: I have not tested the content of this material, just documenting any material I find on the topic for later exploration.  If you find any useful material out there, please send me a note.

Tracing

Dgraph Documentation

Helmfile

Blog Source Code

This is some code that I developed when testing Consul Connect service mesh solution.

Conclusion

There you have it, a small (cough) overview how to get started with Consul Connect Service Mesh. In particular, here some of the takeaways:

  • Provisioning Kubenertes (GKE)
  • (addendum) Provisioning GCR and publishing images to GCR
  • Deploying Consul Connect Service Mesh on GKE
  • Deploying a server and a client with multiport support: HTTP and gRPC
  • Testing HTTP traffic with curl and gRPC traffic with gprcurl.
  • Limitations and Challenges with current multi-port scenarios

Additionally, here’s some extra takeaways beyond just using Consul Connect:

  • Using Helmfile to deploy Helm charts with templated chart config values, where values and branch logic is set by env vars.
  • Using Helmfile to patch using Kustomize merge and JSON Patch
  • Helm raw chart to package Kubernetes manifests as templated values
  • Introduction to Dgraph distributed graph database

The Challenges with Consul

You may have noticed that Consul is, dare I say, complex, beyond complex. The documentation is good, but perhaps maybe not all that well organized, with many missing things.

The underlying tool Consul is very powerful, and Consul Connect service mesh on top of this tool is quite robust and extremely flexible where you can swap out the default CA for other solutions, like Vault CA, and swap out the Envoy proxy for another solution, like NGINX or HAProxy. For ingress into the cluster, you can use Consul API Gateway, or another API Gateway or an ingress controller.

Consul Connect service mesh has some challenges or limitations (see below) when you have a service that supports multiple ports.

Complexity

I have experimented with other service meshes and I was able to get up to speed quickly: Linkerd = 1 day, Istio = 3 days, NGINX Service Mesh = 5 days, but Consul Connect service mesh took at least 11 days to get off the ground. This is by far the most complex solution available.

Unable to Update

If you need to update Consul Connect with a configuration change and use helm to update consul, the consul-server pods may not reach a healthy state. You may have to delete everything and recreate it from scratch.

Apparently there’s some way to ameliorate this by adding leave_on_terminate: true setting in the server.extraConfig (ref).

Higher Memory Footprint

Consul Connect service mesh has a higher memory footprint, so on a small cluster with e5-medium nodes (2 vCPUs, 4 GB memory), you will only be able to support a maximum of 6 side-car proxies. In order to get an application like Dgraph working, which will have 6 nodes (3 Dgraph Alpha pods and 3 Dgraph Zero pods) for high availability along with at least one client, a larger footprint with more robust Kubernetes worker nodes were required.

Requirement for Service Resource

One challenge to Consul Connect service mesh is that it configures the Envoy side-car proxy based on what you specify for a service. This added some challenges.

  • A pure client that is not listening on a port, still requires you to specify a service resource so that it can be added to the service mesh.
  • A StatefulSet that requires specifying a headless service in addition to service endpoint into the cluster will fail spectacularly if both service and headless service use the same port.

The docs explicitly note this:

Note: As of consul-k8s v0.26.0 and Consul Helm v0.32.0, having a Kubernetes service is required to run services on the Consul Service Mesh. (ref)

More Complexity with Multiport

The Kubernetes service API supports an array of ports that you can specify, but Consul Connect only supports a single port for transparent-proxy mode. This is very bizarre, because a service with multiple ports is quite common, such as an admin port vs API port, or scenarios where a service has both HTTP and gRPC interfaces.

This is also part of the Kubernetes service API specification, which Consul Connect reads to configure the Envoy proxy. So, in this sense, Kubernetes is not fully supported as far a parity with the service API.

Illustration: Consul requires one service per port, while other meshes support n ports per service
Consul requires one service per port, while other meshes support n ports per service

For the multi-port scenario, the following will need to be done on the server:

  • all services with multiple ports will need to be broken up into separate services with only one port
  • need to specify consul.hashicorp.com/connect-service annotation listing each of the services supported that will be mapped into consul.
  • need to specify consul.hashicorp.com/connect-service-port annotation listing ports that correspond to the previous above annotation
  • if ACLs are enabled, a serviceaccount needs to be specified corresponding to each service specified.
  • if ACLs are enabled and Kubernetes 1.24+ is used, a corresponding secret for the service token needs to be created as well.

The client will need the following in order to connect to the server:

  • specify consul.hashicorp.com/connect-service-upstreams annotation listing the consul service and outbound port to use from localhost.
  • if ACLs are enabled, a serviceaccount that corresponds to the service specified for the client.

The client is now required to connect to localhost at the target outbound port, not to the service endpont DNS name, such as mysvc.myns.svc.cluster.local. This will be the only way to use the service mesh. Directly connecting to the service endpoint, e.g. mysvc.myns.svc.cluster.local, will bypass the service mesh and thus will not be protected with encryption.

Insecurity with Multiport

When transparent-proxy is enabled, members can communicate using the DNS of the service endpoint, for example: mysvc.myns.svc.cluster.local. And when you use multi-port scenario, transparent-proxy is unfortunately disabled.

Illustration: Consul Connect service mesh members are vulnerable to outside attack
Consul Connect service mesh members are vulnerable to outside attack

Because of this situation, security through mTLS or ACLs (tokens) can be bypassed completely when multi-port services are configured. Any non-mesh member or mesh member that does not have access granted (through configuring an intention) can connect to the service endpoint, such as mysvc.myns.svc.cluster.local. The only thing ACLs offer at this point is blocking encrypted traffic through the mesh, and thus the ACL feature is pointless.

This issue can be ameliorated by configuring the service itself to only communicate through localhost, which forces it to use the service mesh, but then this poses problems, such as trying to use an ingress. Alternatively, you could use a firewall, such as a network policy. Ultimately, another non-Consul solution is needed.

Ingress Challenge with Multiport

An ingress controller is an interesting challenge to integrate to the service mesh, as annotations will be needed to put the ingress controller pods onto the service mesh. The ingress controller will route traffic to the backend service using the local DNS, such as mysvc.myns.svc.cluster.local, where the service named mysvc running in the myns namespace.

With multi-port scenario however, this will not work, because the ingress controller is now required to route to localhost for a specific outbound ports that are specified in the consul.hashicorp.com/connect-service-upstreams annotation. The normal ingress resource API does not support this setup, as it routes to Kubernetes service DNS name, not to localhost.

There may be some ingress controllers that may provide extra non-standard configurations that could support this requirement to route to localhost, but unfortunately no one at Hashicorp has even tested this common use case (ref).

No Observability with Multport

If you are using multi-port scenario, observability is not an option. Just forget you even heard of the word observability, one of the three planes that make up the service mesh solution. The Consul Connect injection process will actually cause stack traces.

Wrapping Up

I hope this is useful in exposure to Consul Connect service mesh and can help you get started should you want to try this out. If you have services that only listen on a single port, then this certainly an interesting solution to explore.

If however, you have an application service that needs support for 2+ ports, because you know, Kubernetes supports this, I would recommend avoiding Consul Connect, as it is not functional to meet minimum requirements for a service mesh. Perhaps someday, when Hashicorp prioritizes basic functionality and usability in future version, this product can be considered.

--

--