Ultimate Baseline GKE cluster

Part I: Proper Google Kubernetes Engine baseline cluster

17 min readOct 22, 2023

On the surface, Google Kubernetes Engine (GKE) is easy to set up and get started. With just one command or a few mouse clicks, you can have a full cluster ready to go. This is a far contrast to AWS EKS, which is like the IKEA of the cloud where much self-assembly is required.

Insecure Defaults

However, the defaults for GKE are, cough, not so secure. The current defaults will grant not only grant Kubernetes worker nodes (GCE) the permissions to escalate privileges, and unnecessarily exposes the cluster to the public Internet by giving each worker node a public IP address.

This article will cover how to create a minimal baseline Kubernetes cluster with GKE that is, well, more secure than the current defaults.

Two Parts

This article will have two parts:

Provision GKE cluster and demo cluster features
Demonstrate cluster features with stateful application.

The demonstration stateful application will be with Dgraph, a highly performant distributed graph database.

For both parts, the following features will be covered:

storage support: default (built-in, encrypted volumes
load balancer (layer 4): default (built-in)
reverse proxy (layer 7): default (built-in)
traffic security: network policies using Calico (add-on option during installation)

Components

As versions of components will change over time, here are the versions of the components that were used in this tutorial.

* Kubernetes API 1.27.3-gke.100
* kubectl v1.27.3
* helm v3.12.2
* gcloud 455.0.0

0. Prerequisites

These are are some prerequisites and initial steps needed to get started before provisioning a Kubernetes cluster and installing add-ons.

0.1 Knowledge: Systems

Basic concepts of systems, such as Linux and the shell (redirection, pipes, process substitution, command substitution, environment variables), as well as virtualization and containers are useful. The concept of a service (daemon) is important.

0.2 Knowledge: Networking

This article requires some basic understanding of networking with TCP/IP and the OSI model, specifically the Transport Layer 4 and Application Layer 7 for HTTP. This article covers using load balancing (layer 4) and reverse proxy (layer 7 load balancing).

0.3 Knowledge: Kubernetes

In Kubernetes, familiarity with service types: ClusterIP, NodePort, LoadBalancer, ExternalName as well as the ingress resource are vital.

Exposure to other types of Kubernetes resource objects used in this guide are helpful: persistent volume claims, storage class, pods, deployments, statefulsets, configmaps, serviceaccount and network policies.

0.4 Tools

These are the tools used in this article.

Google Cloud SDK [gcloud command] to interact with Google Cloud
kubectl client [kubectl] a the tool that can interact with the Kubernetes cluster. This can be installed using adsf tool.
helm [helm] is a tool that can install Kubernetes applications that are packaged as helm charts.
POSIX Shell [sh] such as bash [bash] or zsh [zsh] are used to run the commands. These come standard on Linux, and with macOS you can get the latest with brew install bash zsh if Homebrew is installed.

These tools are highly recommended:

adsf [adsf] is a tool that installs versions of popular tools like kubectl.
jq [jq] is a tool to query and print JSON data
GNU Grep [grep] supports extracting string patterns using extended Regex and PCRE. This comes default on Linux distros, and for macOS it can be installed with brew install grep if Homebrew is installed.

0.5 Setup Environment Variables

These environment variables will be used throughout this guide. You can save the content below as env.sh and source it as necessary.

# global var
export GKE_PROJECT_ID="base-gke"

# network vars
export GKE_NETWORK_NAME="base-main"
export GKE_SUBNET_NAME="base-private"
export GKE_ROUTER_NAME="base-router"
export GKE_NAT_NAME="base-nat"

# principal vars
export GKE_SA_NAME="gke-worker-nodes-sa"
export GKE_SA_EMAIL="$GKE_SA_NAME@${GKE_PROJECT_ID}.iam.gserviceaccount.com"

# gke vars
export GKE_CLUSTER_NAME="base-gke"
export GKE_REGION="us-central1"
export GKE_MACHINE_TYPE="e2-standard-2"

# kubectl client vars
export USE_GKE_GCLOUD_AUTH_PLUGIN="True"
export KUBECONFIG=~/.kube/gcp/$GKE_REGION-$GKE_CLUSTER_NAME.yaml

If opening up a new browser tab, make sure to set the environment variables accordingly. In bash or zsh, this can be done with source env.sh.

0.6 Google Cloud Setup

Before getting started GKE, you need to setup a Google Cloud account and setup Google Cloud SDK. You can get a free trial account with $300 in free credits.

After this is setup, you need to navigate to billing in the cloud console, located at https://console.cloud.google.com/billing/, and copy the billing account number. Use this for the value of the CLOUD_BILLING_ACCOUNT environment variable, as shown below.

# billing var
export CLOUD_BILLING_ACCOUNT="<my-cloud-billing-account>" # <--CHANGEME
# other vars
source env.sh

In the same terminal window, you can run the following commands below to create a project and authorize project to allow creation of a GKE cluster.

# create new project
gcloud projects create $GKE_PROJECT_ID

# set up billing to the GKE project
gcloud beta billing projects link $GKE_PROJECT_ID \
  --billing-account $ClOUD_BILLING_ACCOUNT

# authorize APIs for GKE project
gcloud config set project $GKE_PROJECT_ID
gcloud services enable "compute.googleapis.com"
gcloud services enable "container.googleapis.com"
gcloud config set compute/region $GKE_REGION

0.7 Kubernetes Client Setup

If you use asdf to install kubectl, you can get the latest version with the following commands:

# install kubectl plugin for asdf
asdf plugin-add kubectl \
  https://github.com/asdf-community/asdf-kubectl.git

# fetch latest kubectl 
asdf install kubectl latest
asdf global kubectl latest

# test results of latest kubectl 
kubectl version --client

This should show something like:

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

Also, create directory to store Kubernetes configurations that will be used by the KUBECONFIG env variable:

mkdir -p $HOME/.kube

1.0 Provision Google Cloud resources

Provisioning GKE will take place in the following order:

Create the Google Service Account
Create the private subnet
Create the GKE cluster using the Service Account and the private subnet
Configure Kubernetes client access to GKE cluster

This will create a cluster support for storage (persistent volumes), external load balancer, ingress controller, network policies (Calico), and workload identity.

1.1 Create a secure Service Account

A general term across cloud providers for what we call an account such as users and groups, is call a principal or sometimes an identity.

For automation, there will some type of principal used for automation, and with Google Cloud, this is the Google Service Account (analogous to an IAM Role in AWS). When no service account is specified for GKE, a default that was created with the project will be used.

This default service account can be dangerous, so we need to create a more secure service account that the minimum privileges needed for running a GKE cluster.

The following commands below will create a service account with the minimum privileges needed for the GKE cluster.

#######################
# list of minimal required roles
#######################################
ROLES=(
  roles/logging.logWriter
  roles/monitoring.metricWriter
  roles/monitoring.viewer
  roles/stackdriver.resourceMetadata.writer
)

#######################
# create google service account principal
#######################################
gcloud iam service-accounts create $GKE_SA_NAME \
  --display-name $GKE_SA_NAME --project $GKE_PROJECT_ID

#######################
# assign google service account to roles in GKE project
#######################################
for ROLE in ${ROLES[*]}; do
  gcloud projects add-iam-policy-binding $GKE_PROJECT_ID \
    --member "serviceAccount:$GKE_SA_EMAIL" \
    --role $ROLE
done

In most systems, you usually add permissions to a principal, but in Google, you add the principal to the permissions. Thus, we need to add the GSA (Google Service Account) to a role within the project, which will grant the necessary permissions associated with that role. This allows adding permissions to only resources created within a single project.

📓 NOTE: When using gcloud command with successive gcloud projects add-iam-policy-binding, you may run in to rate limiting from Google. You’ll have to wait a minute before running the command again.

ERROR: (gcloud.projects.add-iam-policy-binding) RESOURCE_EXHAUSTED: Quota 
exceeded for quota metric 'Write requests' and limit 
'Write requests per minute' of service 'cloudresourcemanager.googleapis.com' 
for consumer 'project_number:XXXXXXXXXX'.

1.2 Create a private network infrastructure

When provisioning GKE, the default VPC and subnet will be used, which may increase the attack surface of the GKE cluster. Anything running on the subnet, and possibly outside from the public Internet will be able to access the GKE cluster.

The first step to change this behavior would be to create a new VPC and subnet with outbound routing to the Internet, so that the cluster can pull down images from an external container registry like Docker Hub or Quay.

📓 NOTE: The following below will create a local regional based network infrastructure suitable for this demo. For organizations that may need to provision GKE clusters across multiple regions, you will want setup Shared VPC networking. Multi-cluster use cases will not be covered in this article.

The following below will create a private subnet with a router and NAT for outbound traffic, which is needed for pulling a container image from the Internet.

#######################
# create VPC for target region
#######################################
gcloud compute networks create $GKE_NETWORK_NAME \
  --subnet-mode=custom \
  --mtu=1460 \
  --bgp-routing-mode=regional

#######################
# create subnet (spanning all availability zones w/i region)
#######################################
gcloud compute networks subnets create $GKE_SUBNET_NAME \
  --network=$GKE_NETWORK_NAME \
  --range=10.10.0.0/24 \
  --region=$GKE_REGION \
  --enable-private-ip-google-access

#######################
# add support for outbound traffic
#######################################
gcloud compute routers create $GKE_ROUTER_NAME \
  --network=$GKE_NETWORK_NAME \
  --region=$GKE_REGION

gcloud compute routers nats create $GKE_NAT_NAME \
  --router=$GKE_ROUTER_NAME \
  --region=$GKE_REGION \
  --nat-custom-subnet-ip-ranges=$GKE_SUBNET_NAME \
  --auto-allocate-nat-external-ips

📓 NOTE: Cloud NAT has historically been quite costly, and for this reason, a popular solution is to provision their own GCE instance and configure it to be a NAT router for the subnet. This alternative solution path will not be covered in this article.

1.3 Create the Kubernetes cluster

We can create a the GKE cluster that uses private subnet and least privilege GSA with the following command below.

As an added bonus, we will also support Workload Identity, to allow for least privilege access to cloud resources when necessary.

gcloud container clusters create $GKE_CLUSTER_NAME \
  --project $GKE_PROJECT_ID \
  --region $GKE_REGION \
  --num-nodes 1 \
  --service-account "$GKE_SA_EMAIL" \
  --machine-type $GKE_MACHINE_TYPE \
  --enable-ip-alias \
  --enable-network-policy \
  --enable-private-nodes \
  --no-enable-master-authorized-networks \
  --master-ipv4-cidr 172.16.0.32/28 \
  --network $GKE_NETWORK_NAME \
  --subnetwork $GKE_SUBNET_NAME \
  --workload-pool "$GKE_PROJECT_ID.svc.id.goog"

📓NOTE: These settings above will create a worker nodes will be on a private subnet, but the master nodes managed by Google will still have public Internet access, which is required for kubectl tool and will be secured through Google Cloud credentials. To fully secure the master nodes as well, see Creating a private cluster, but note that this will also require setting up access to the master nodes configuring Cloud VPN, Identity Aware Proxy or using a identity based access like Boundary or an alternative solution with VPN or bastion host solution. This topic will not be covered in this article.

1.4 Setup Kubernetes Client Access

During the creation of the GKE cluster, with KUBECONFIG environment variable setup, the configuration should be automatically configured.

Should a reason arise where you need to setup this up or refresh it, you can run the following command:

gcloud container clusters  get-credentials $GKE_CLUSTER_NAME \
  --project $GKE_PROJECT_ID \
  --region $GKE_REGION

We need to use a kubectl client that closely matches the Kubernetes master nodes. If you have asdf installed (along with the asdf-kubectl module), you can do install a kubectl tool that matches the master nodes with the following commands:

# fetch exact version of Kubernetes server
# Requires:
#   kubectl 1.28 or greater
#   GNU Grep
VER=$(kubectl version \
  | grep -oP '(?<=Server Version: v)(\d{1,2}\.){2}\d{1,2}'
)

# setup kubectl tool
asdf list kubectl | grep -q $VER || asdf install kubectl $VER
asdf global kubectl $VER

📓 NOTE: GNU Grep supports what I like to call OP (overpowered) mode, which can support both only-matching mode as well as perl-regexp mode for more advanced pattern matching such as look ahead zero-length assertions. The default BSD Grep that comes installed on mac OS does not support this.

📓 NOTE: macOS users with Homebrew can install GNU Grep with brew install grep. The PATH env var needs to be updated; see instructions with brew info grep, as the path can vary between arm64 and amd64.

1.5 Testing the client

You can test functionality with the following commands:

kubectl get nodes
kubectl get all --all-namespaces

2.0 Testing Cluster Features (optional)

These are a set minimalist tests that you can run to test the cluster. These may be useful in testing and troubleshooting configurations on GKE in general.

2.1 Persistent Volumes

Below is a minimal test that can be used to test persistent volumes. This creates a small Ubuntu server that mounts a volume using the premium-rwo storage class for faster SSD storage.

# create pod with persistent volume
kubectl create namespace "pv-test"

# deploy application with mounted volume
cat <<EOF | kubectl apply --namespace "pv-test" --filename -
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: ubuntu
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
      volumeMounts:
      - name: persistent-storage
        mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: pv-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pv-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: premium-rwo
  resources:
    requests:
      storage: 4Gi
EOF

You can test the results of the volume creation with the following command:

kubectl get all,pvc --namespace "pv-test"

We can also look at the events that took place in this namespace with:

kubectl events --namespace "pv-test"

When finished, you can delete the application with:

kubectl delete pod app --namespace "pv-test"
kubectl delete pvc pv-claim --namespace "pv-test"
kubectl delete ns "pv-test"

2.2 External Load Balancer

You can provision an external load balancer in Kubernetes by deploying a service object of type LoadBalancer. On GKE, this will create a L4 load balancer.

# deploy application
kubectl create namespace httpd-svc
kubectl create deployment httpd \
  --image=httpd \
  --replicas=3 \
  --port=80 \
  --namespace=httpd-svc

# provision external load balancer
cat <<EOF | kubectl apply --namespace httpd-svc -f -
apiVersion: v1
kind: Service
metadata:
  name: httpd
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  type: LoadBalancer
  selector:
    app: httpd
EOF

Verify that all the components were installed:

kubectl get all --namespace=httpd-svc

You can run curl to fetch a response from the web service:

export SVC_LB=$(kubectl get service httpd \
  --namespace "httpd-svc" \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'
)

curl --silent --include $SVC_LB

Additionally, for the curious, if you want to look at resources created by the controller, you can run the following command:

export SVC_LB=$(kubectl get service httpd \
  --namespace "httpd-svc" \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'
)

gcloud compute forwarding-rules list \
  --filter $SVC_LB \
  --format \
  "table[box](name,IPAddress,target.segment(-2):label=TARGET_TYPE)"

When finished, you can delete the service with the following:

kubectl delete "service/httpd" --namespace "httpd-svc"
kubectl delete namespace "httpd-svc"

2.3 Ingress Controller

You can deploy L7 load balancer that can route back to your application, called a reverse proxy, but creating an ingress object in Kubernetes.

📓 NOTE: GKE comes with a default ingress controller called ingress-gce and is known by many names by Google: GKE Ingress for Application Load Balancers, GKE Ingress controller, GLBC (Google Load Balancer Controller), and by the Git project name ingress-gce (even though this controller is for GKE, not GCE). For this tutorial, we’ll use the Git project name for consistency.

📓 NOTE: If you configure an alternative default ingress controller, and you would still like to use ingress-gce, you need to specify the older deprecated annotation kubernetes.io/ingress.class to gce as the ingress class setting is ignored (see GKE Ingress controller behavior for more info).

📓 NOTE: When using ingress-gce, the service must be of type NodePort or LoadBalancer, as the default ClusterIP will cause an error. This is likely due to earlier version of GKE, created a private virtual network for Pods, so they were not accessible by an external load balancer outside of the Kubenetes cluster. Current versions of GKE use ip-alias, also called VPC native and Alias IP, which puts Pods on the same subnet as Nodes, allowing pods to be accessible from outside of the cluster. Thus is looks like ingress-gce was not updated to support this newer configuration.

Below is an example on how to deploy an Apache web server with GKE ingress:

# deploy application 
kubectl create namespace "httpd-ing"
kubectl create deployment httpd \
  --image "httpd" \
  --replicas 3 \
  --port 80 \
  --namespace "httpd-ing"

# create proxy to deployment
kubectl expose deployment httpd \
  --port 80 \
  --target-port 80 \
  --type NodePort \
  --namespace "httpd-ing"

# provision application load balancer
kubectl create ingress gke-ingress \
  --rule "/=httpd:80" \
  --annotation "kubernetes.io/ingress.class=gce" \
  --namespace "httpd-ing"

Verify the components were installed with the following below. It make take a few minutes to get an IP address for the ingress resource.

kubectl get all,ing --namespace "httpd-ing"

Test the connection with the following below. Note that it will take about 3 minutes before a public IP address is allocated to the load balancer.

export ING_LB=$(kubectl get ing gke-ingress \
  --namespace "httpd-ing" \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'
)

curl --silent --include $ING_LB

Additionally, for the curious, if you want to look at GCP resources created by the controller, you can run the following command:

export ING_LB=$(kubectl get ing gke-ingress \
  --namespace "httpd-ing" \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'
)

gcloud compute forwarding-rules list \
  --filter $ING_LB \
  --format \
  "table[box](name,IPAddress,target.segment(-2):label=TARGET_TYPE)"

When finished with this test, you can delete it with the following:

kubectl delete "ingress/gke-ingress" --namespace "httpd-ing"
kubectl delete namespace "httpd-ing"

2.4 Network Policies

Calico has great tutorial application that shows the connections between a front-end service, back-end service, and the client.

STEP 1: You can install this demonstration application with the following commands:

MANIFESTS=(00-namespace 01-management-ui 02-backend 03-frontend 04-client)
APP_URL=https://docs.projectcalico.org/v3.5/getting-started/kubernetes/tutorials/stars-policy/manifests/

for MANIFEST in ${MANIFESTS[*]}; do 
  kubectl apply -f $APP_URL/$MANIFEST.yaml
done

STEP 2: You can checkout the graphical application by running the the command below. You can run this in the another terminal tab (make sure to set KUBECONFIG so that you can access the cluster).

kubectl port-forward service/management-ui \
  --namespace management-ui 9001

STEP 3: Open a browser to http://localhost:9001/. You should see the management user interface. The C node is the client service, the F node is the front-end service, and the B node is the back-end service. Each node has full communication access to all other nodes, as indicated by the bold, colored lines.

STEP 4: Apply the following network policies to isolate the services from each other:

DENY_URL=https://docs.projectcalico.org/v3.5/getting-started/kubernetes/tutorials/stars-policy/policies/default-deny.yaml

kubectl apply --namespace client --filename $DENY_URL
kubectl apply --namespace stars --filename $DENY_URL

STEP 5: If the graphical application is still running, hit refresh. It will not be able to access graphical application.

STEP 6: Apply the following network policies to allow the management user interface to access the services:

export ALLOW_URL=https://docs.projectcalico.org/v3.5/getting-started/kubernetes/tutorials/stars-policy/policies/

kubectl apply --filename $ALLOW_URL/allow-ui.yaml
kubectl apply --filename $ALLOW_URL/allow-ui-client.yaml

STEP 7: After refreshing the browser, you can see that the management user interface can reach the nodes again, but the nodes cannot communicate with each other.

STEP 8: Apply the following network policy to allow traffic from the front-end service to the back-end service:

kubectl apply --filename $ALLOW_URL/backend-policy.yaml

STEP 9: After refreshing the browser, you can see that the front-end can communicate with the back-end.

STEP 10: Apply the following network policy to allow traffic from the client to the front-end service.

kubectl apply --filename $ALLOW_URL/frontend-policy.yaml

After refreshing the browser, you can see the client can communicate to the front-end service. The front-end service can still communicate to the back-end service.

CLEANUP: These commands will delete the application.

MANIFESTS=(04-client 03-frontend 02-backend 01-management-ui 00-namespace)
APP_URL=https://docs.projectcalico.org/v3.5/getting-started/kubernetes/tutorials/stars-policy/manifests/

for MANIFEST in ${MANIFESTS[*]}; do 
  kubectl delete --filename $APP_URL/$MANIFEST.yaml
done

Cleanup

Persistent Volumes

Before deleting a GKE cluster, you want to make sure that all persistent volumes are deleted. You can check for persistent volume claims across the cluster with this command:

kubectl get pvc --all-namespaces

Should you find any pvc (persistent volume claims), you will want to delete both the pvc and the corresponding application as well. Generally, deleting the pvc will delete the corresponding pv (persistent volume), but only when the application that uses the storage is deleted.

If you delete GKE cluster without deleting all persistent volumes, there will be left over disk resources eating up costs.

Kubernetes Cluster

Afterwards, you can delete the GKE cluster with a single command:

gcloud container clusters delete $GKE_CLUSTER_NAME

After, you may want to clean up any configuration files on your system:

rm -f $KUBECONFIG

Network Infrastructure

We can delete the network infrastructure with the following commands:

gcloud compute routers nats delete $GKE_NAT_NAME --router $GKE_ROUTER_NAME
gcloud compute routers delete $GKE_ROUTER_NAME
gcloud compute networks subnets delete $GKE_SUBNET_NAME
gcloud compute networks delete $GKE_NETWORK_NAME

Delete the Service Account and Role Assignments

It is good practice to remove service account entry from the role list in a given project; otherwise there unused service accounts will be listed and we won’t be able to account for this in the future. Your organization can get trapped in a state of not knowing whether it is safe to remove these left over service accounts in the future, or whether such a service account may have undesired access to resources. So, spring cleaning now is useful as we remove resources we no longer need.

You can remove the service account from role lists within the project using the following commands.

#######################
# list of roles configured earlier
#######################################
ROLES=(
  roles/logging.logWriter
  roles/monitoring.metricWriter
  roles/monitoring.viewer
  roles/stackdriver.resourceMetadata.writer
)


#######################
# remove service account from roles
#######################################
for ROLE in ${ROLES[*]}; do
  gcloud projects remove-iam-policy-binding $GKE_PROJECT_ID \
    --member "serviceAccount:$GKE_SA_EMAIL" \
    --role $ROLE
done

After removing access from the project, delete the unused service account:

gcloud iam service-accounts delete $GKE_SA_EMAIL --project $GKE_PROJECT_ID

Conclusion

This is a good baseline to get your started with a GKE cluster that is less insecure than the defaults that Google Cloud provides.

From this vantage, you can explore other topics like shared VPC, full private cluster, identity based access to manage GKE, configure least-privilege access to cloud resources (like GCS, Cloud DNS, etc) using Workload Identity.

This baseline will be used in other articles to explore topics like infrastructure-as-code with Terraform (or OpenTofu), policy-as-code with Sentinel or open policy agent, progressive delivery (ArgoCD, FluxCD, Spinnaker), gateways, reverse-proxies, and service meshes (Consul Connect, Istio, Linkerd), o11y (log aggregation, log shipping, metrics, continuous profiling, distributed tracing, alerting, visualization, etc), and more.