Installing Anthos service mesh Google managed control plane on GKE autopilot with multi-cluster mesh CA

Hil Liao
9 min readMar 17, 2022

It used to be hard to install Anthos service mesh on GKE. Let alone GKE autopilot in 2021. But Google has fixed many bugs and problems with the installation in 2022. I am surprised to see the process has become easier and I will guide you through the process of Installing Anthos service mesh Google managed control plane on GKE autopilot private clusters with multi-cluster mesh CA.

I basically followed Configure managed Anthos Service Mesh for the installation but first, let’s create 2 GKE autopilot clusters. The prerequisite is you are a project owner in a GCP project. Set the default project and verify project, account with gcloud config list. Modify the following environment variables in cloud shell for the existing project. There are many hard coded values in the gcloud command to create the clusters. Modify to suit your needs.

VPC=[USER INPUT] # contain 2 subnets in different regions 10.0.0.0/8
SUBNET=[USER INPUT] # assuming both subnets have the same name
export PROJECT_1=[USER INPUT]
export LOCATION_1=us-central1
export CLUSTER_1=autopilot-cluster-1
export CTX_1="gke_${PROJECT_1}_${LOCATION_1}_${CLUSTER_1}"
export PROJECT_2=$PROJECT_1
export LOCATION_2=us-east4
export CLUSTER_2=autopilot-cluster-2
export CTX_2="gke_${PROJECT_2}_${LOCATION_2}_${CLUSTER_2}"
gcloud container --project "$PROJECT_1" clusters create-auto "$CLUSTER_1" --region $LOCATION_1 --release-channel "regular" --enable-private-nodes --enable-private-endpoint --master-ipv4-cidr "172.16.180.128/28" --enable-master-authorized-networks --master-authorized-networks 10.0.0.0/8 --network "projects/$PROJECT_1/global/networks/$VPC" --subnetwork "projects/$PROJECT_1/regions/$LOCATION_1/subnetworks/$SUBNET"gcloud container --project "$PROJECT_1" clusters create-auto "$CLUSTER_2" --region $LOCATION_2 --release-channel "regular" --enable-private-nodes --enable-private-endpoint --master-ipv4-cidr "172.16.180.64/28" --enable-master-authorized-networks --master-authorized-networks 10.0.0.0/8 --network "projects/$PROJECT_1/global/networks/$VPC" --subnetwork "projects/$PROJECT_1/regions/$LOCATION_2/subnetworks/$SUBNET"

Once the clusters are created, enable global access for private Autopilot clusters.

gcloud container clusters update ${CLUSTER_2} --project ${PROJECT_2} --region ${LOCATION_2} --enable-master-global-access &gcloud container clusters update ${CLUSTER_1} --project ${PROJECT_1} --region ${LOCATION_1} --enable-master-global-access &

Upon success, you’d observe

Updating autopilot-cluster-2...done.
Updated [https://container.googleapis.com/v1/projects/$PROJECT_1/zones/us-east4/clusters/autopilot-cluster-2].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-east4/autopilot-cluster-2?project=$PROJECT_1

You’d need to create a bastion host in the $VPC to access the GKE control plane as the clusters created are private with authorized network set to 10.0.0.0/8. On the bastion host, try hitting the GKE control plane with IP observed in Cloud Console: nc -vz 172.16.200.66 443 and you should see Connection to 172.16.200.66 443 port [tcp/https] succeeded!. Install binaries dependencies.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && chmod a+x kubectl && sudo mv kubectl /usr/local/bin/
sudo apt install jq

Follow the steps in Configure managed Anthos Service Mesh up to chmod +x asmcli. Next step is to install the mesh on GKE autopilot. I executed the following commands sequentially without switching kubeconfig context.

./asmcli install     -p $PROJECT_1     -l $LOCATION_2     -n $CLUSTER_2     --managed     --verbose     --output_dir $CLUSTER_2     --use_managed_cni     --channel rapid     --enable-all &# install ASM on another private Autopilot GKE cluster with Authorized network set to 10.0.0.0/8./asmcli install     -p $PROJECT_1     -l $LOCATION_1     -n $CLUSTER_1     --managed     --verbose     --output_dir $CLUSTER_1     --use_managed_cni     --channel rapid     --enable-all &

Create Cloud NAT to enable Internet egress; repeat for the 2 regions of the GKE clusters. Internet egress is required for a service in a mesh to hit the other cluster’s control plane endpoint and download the test apps for cross cluster load balancing. Skip the Cloud NAT section if your VPC already has some kind of Internet egress configured.

for REGION in $LOCATION_1 $LOCATION_2
do
gcloud compute routers create nat-router-$REGION --network $VPC --region $REGION && gcloud compute routers nats create nat-config-$REGION --router-region $REGION --router nat-router-$REGION --nat-all-subnet-ip-ranges --auto-allocate-nat-external-ips --enable-dynamic-port-allocation
done

If you followed the steps in Configure managed Anthos Service Mesh, your kubeconfig is configured for the 2 GKE clusters. Execute the following command to test a pod can egress to the Internet.

for CTX in ${CTX_1} ${CTX_2}
do
cat <<EOF | kubectl apply --context $CTX -f -
apiVersion: v1
kind: Pod
metadata:
name: gcloud
labels:
app: gcloud-cmd
spec:
containers:
- name: gcloud
image: gcr.io/google.com/cloudsdktool/cloud-sdk:latest
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
EOF
done

Wait for 3 minutes and use cloud console to verify the gcloud pods are created then kubectl into them.

for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX exec -t gcloud -- curl amazon.com -I
done
# start an interactive shell
kubectl --context $CTX_1 exec -it gcloud -- bash

Expect to see 301 Moved Permanently from both curl commands. Stop and troubleshoot the VPC if the command is stuck meaning can’t egress to the Internet.

Optionally apply the Google-managed data plane for both clusters.

# get the current kubectl context. For example, you get $CTX_2
kubectl config get-contexts
kubectl create ns mesh && kubectl annotate --overwrite namespace mesh mesh.cloud.google.com/proxy='{"managed":"true"}' && if kubectl get dataplanecontrols -o custom-columns=REV:.spec.revision,STATUS:.status.state | grep rapid | grep -v none > /dev/null; then echo "Managed Data Plane is ready."; else echo "Managed Data Plane is NOT ready."; fikubectl config use-context $CTX_1 # switch to the other GKE then repeat the command above

Get IP ranges for the clusters in order to create the firewall rules to allow ingress to GKE autopilot pods,nodes; I developed the following script and reported to Google for them to publish the fix of getting the target tags from existing firewall rules by GKE creation.

function join_by { local IFS="$1"; shift; echo "$*"; }
ALL_CLUSTER_CIDRS=$(gcloud container clusters list --project $PROJECT_1 --format='value(clusterIpv4Cidr)' | sort | uniq)
ALL_CLUSTER_CIDRS=$(join_by , $(echo "${ALL_CLUSTER_CIDRS}"))
echo $ALL_CLUSTER_CIDRS # observe the following line
10.136.48.0/20,10.37.32.0/20
TAGS=""
for CLUSTER in ${CLUSTER_1} ${CLUSTER_2}
do
TAGS+=$(gcloud compute firewall-rules list --filter="Name:$CLUSTER*" --format="value(targetTags)" | uniq) && TAGS+=","
done
TAGS=${TAGS::-1}
echo "Network tags for pod ranges are $TAGS"
gcloud compute firewall-rules create asm-multicluster-pods \
--allow=tcp,udp,icmp,esp,ah,sctp \
--direction=INGRESS \
--priority=900 --network=$VPC \
--source-ranges="${ALL_CLUSTER_CIDRS}" \
--target-tags=$TAGS

Configure endpoint discovery between private clusters per B. For Managed Anthos Service Mesh. OUTPUT_DIR is generated from asmcli installation command. Assume istio-1.12.5-asm.0 is the sub-folder.

export OUTPUT_DIR=$CLUSTER_2
ls $OUTPUT_DIR # copy istio-* sub-folder
export SAMPLES_DIR=$OUTPUT_DIR/istio-1.12.5-asm.0
ls $SAMPLES_DIR # make sure folder exists
PUBLIC_IP=`gcloud container clusters describe "${CLUSTER_1}" --project "${PROJECT_1}" \
--zone "${LOCATION_1}" --format "value(privateClusterConfig.publicEndpoint)"` && $OUTPUT_DIR/istioctl x create-remote-secret --context=${CTX_1} --name=${CLUSTER_1} --server=https://${PUBLIC_IP} > ${CTX_1}.secret
PUBLIC_IP=`gcloud container clusters describe "${CLUSTER_2}" --project "${PROJECT_2}" \
--zone "${LOCATION_2}" --format "value(privateClusterConfig.publicEndpoint)"` && $OUTPUT_DIR/istioctl x create-remote-secret --context=${CTX_2} --name=${CLUSTER_2} --server=https://${PUBLIC_IP} > ${CTX_2}.secret
kubectl apply -f ${CTX_1}.secret --context=${CTX_2} && kubectl apply -f ${CTX_2}.secret --context=${CTX_1}

Install the test app per Install the HelloWorld service. Do not execute the next command until current command succeeds.

for CTX in ${CTX_1} ${CTX_2}
do
kubectl create --context=${CTX} namespace sample
kubectl label --context=${CTX} namespace sample \
istio-injection- istio.io/rev=asm-managed-rapid --overwrite
done
kubectl create --context=${CTX_1} \
-f ${SAMPLES_DIR}/samples/helloworld/helloworld.yaml \
-l service=helloworld -n sample
kubectl create --context=${CTX_2} \
-f ${SAMPLES_DIR}/samples/helloworld/helloworld.yaml \
-l service=helloworld -n sample
kubectl create --context=${CTX_1} \
-f ${SAMPLES_DIR}/samples/helloworld/helloworld.yaml \
-l version=v1 -n sample
kubectl create --context=${CTX_2} \
-f ${SAMPLES_DIR}/samples/helloworld/helloworld.yaml \
-l version=v2 -n sample

View cloud console to verify helloworld-v? pods are running. They usually take 3 minutes. Install the sleep pods to launch the tests of cross-cluster load balancing. execute the following command.

for CTX in ${CTX_1} ${CTX_2}
do
kubectl apply --context=${CTX} \
-f ${SAMPLES_DIR}/samples/sleep/sleep.yaml -n sample
done

Verify pods are running in cloud console. Execute the following commands to hit the helloworld service from the sleep pods. helloworld.sample:5000 is in the format of $service_name.$namespace:$port

watch kubectl exec --context="${CTX_1}" -n sample -c sleep \
"$(kubectl get pod --context="${CTX_1}" -n sample -l \
app=sleep -o jsonpath='{.items[0].metadata.name}')" \
-- curl -sS helloworld.sample:5000/hello
watch kubectl exec --context="${CTX_2}" -n sample -c sleep \
"$(kubectl get pod --context="${CTX_2}" -n sample -l \
app=sleep -o jsonpath='{.items[0].metadata.name}')" \
-- curl -sS helloworld.sample:5000/hello

Observe the output:

Hello version: v2, instance: helloworld-v2-f6b55ccb5-jvzl6
Hello version: v1, instance: helloworld-v1–5774b8c4c9-p9cs5

Create Istio ingress gateway. Each command code block below have the following purposes. Do not execute next command until current command succeeds.

  1. Label the namespace created earlier with sidecar injection first.
  2. Create the ingress gateway
  3. Create a Gateway at port 80
  4. Create NGinx deployments. Verify the deployments are running in cloud console.
  5. Create ClusterIP service. Verify the created service are deployments are in the sample namespace are are green in cloud console. The command after exposing the service is to name the port so Anthos service mesh will be able to get metrics.
  6. Create virtual service to point traffic from the gateway to the exposed clusterIP service. Create another virtual service for the existing helloworld ClusterIP service in 1 cluster.
  7. Test the nginx and helloworld service at the gateway. Observe v1 or v2 of Hello version: v2, instance: helloworld-v2-f6b55ccb5-vdzdq and HTTP/1.1 200 OK
# 1
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX label namespace mesh istio-injection- istio.io/rev=asm-managed-rapid --overwrite
done
# 2
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX apply -n mesh -f $OUTPUT_DIR/samples/gateways/istio-ingressgateway
done
#3
for CTX in ${CTX_1} ${CTX_2}
do
cat <<EOF | kubectl apply --context $CTX -f -
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: http-80
namespace: mesh
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
EOF
done
# 4
for CTX in ${CTX_1} ${CTX_2}
do
cat <<EOF | kubectl apply --context $CTX -n sample -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
spec:
selector:
matchLabels:
run: test-nginx
replicas: 1
template:
metadata:
labels:
run: test-nginx
spec:
containers:
- name: test-nginx
image: nginx
ports:
- containerPort: 80
EOF
done
# 5
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX -n sample expose deployment/test-nginx && kubectl --context $CTX -n sample patch service test-nginx --type='json' -p='[{"op": "replace", "path": "/spec/ports/0/name", "value": "http"}]'
done
# 6
for CTX in ${CTX_1} ${CTX_2}
do
cat <<EOF | kubectl apply --context $CTX -f -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: test-nginx
namespace: mesh
spec:
hosts:
- '*'
gateways:
- mesh/http-80
http:
- match:
- uri:
prefix: /
route:
- destination:
port:
number: 80
host: test-nginx.sample.svc.cluster.local
weight: 100
EOF
done
cat <<EOF | kubectl apply --context $CTX -n mesh -f -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: helloworld
spec:
hosts:
- '*'
gateways:
- mesh/http-80
http:
- match:
- uri:
prefix: /hello
route:
- destination:
port:
number: 5000
host: helloworld.sample.svc.cluster.local
weight: 100
# 7
INGRESS_IP=$(kubectl -n mesh get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "gateway IP is [$INGRESS_IP]"
watch curl http://$INGRESS_IP/hello
for CTX in ${CTX_1} ${CTX_2}
do
export INGRESS_IP=$(kubectl --context $CTX -n mesh get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "gateway IP is [$INGRESS_IP]"
curl -I http://$INGRESS_IP/
done

Create new istio-ingressgateway with internal load balancing.

  1. Verify the istio ingress gateway yaml files exist in the asmcli installation output folder.
  2. Create the kustomization main file
  3. Create the service patch file
  4. Execute the kustomization command. Inspect the generated yaml have the patch applied.
  5. Remove the kustomization related files and overwrite the original service.yaml
  6. Create a new namespace for the internal istio ingress gateway
  7. Create the internal istio ingress gateway in the 2 clusters.
  8. Test the nginx and helloworld service at the gateway. Observe v1 or v2 of Hello version: v2, instance: helloworld-v2-f6b55ccb5-vdzdq and HTTP/1.1 200 OK
# 1
GW_PATH=$OUTPUT_DIR/samples/gateways/istio-ingressgateway
ls $GW_PATH/ # verify path exists
cp -r $GW_PATH $GW_PATH/../ilb-istio-ingressgateway
GW_PATH=$GW_PATH/../ilb-istio-ingressgateway
# 2
cat << EOF > $GW_PATH/kustomization.yaml
resources:
- service.yaml
patchesStrategicMerge:
- ilb-service.yaml
EOF
# 3
cat << EOF > $GW_PATH/ilb-service.yaml
# kubectl kustomize $GW_PATH/
# files needed for the patch: kustomization.yaml, service.yaml

apiVersion: v1
kind: Service
metadata:
name: istio-ingressgateway
annotations:
networking.gke.io/load-balancer-type: "Internal"
networking.gke.io/internal-load-balancer-allow-global-access: "true"
EOF
# 4
cp -r $GW_PATH $GW_PATH/../backup-istio-ingressgateway && \
kubectl kustomize $GW_PATH/ > $GW_PATH/internal-service.yaml
cat $GW_PATH/internal-service.yaml # content should have "Internal"
# 5
rm $GW_PATH/ilb-service.yaml $GW_PATH/kustomization.yaml && \
mv $GW_PATH/internal-service.yaml $GW_PATH/service.yaml
# 6
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX create ns ilb-mesh && kubectl --context $CTX label namespace ilb-mesh istio-injection- istio.io/rev=asm-managed-rapid --overwrite
done
# 7
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX apply -n ilb-mesh -f $GW_PATH
done
# 8
INGRESS_IP=$(kubectl -n ilb-mesh get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "internal gateway IP is [$INGRESS_IP]"
watch curl http://$INGRESS_IP/hello
for CTX in ${CTX_1} ${CTX_2}
do
export INGRESS_IP=$(kubectl --context $CTX -n ilb-mesh get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "internal gateway IP is [$INGRESS_IP]"
curl -I http://$INGRESS_IP/
done

Cleanup by deleting the created cloud resources

# Clean up by deleting istio-ingressgateway
for CTX in ${CTX_1} ${CTX_2}
do
kubectl --context $CTX delete -n mesh -f $OUTPUT_DIR/samples/gateways/istio-ingressgateway && kubectl --context $CTX delete -n ilb-mesh -f $GW_PATH
done
# Delete GKE hub memberships
gcloud container hub memberships list
gcloud container hub memberships delete $CLUSTER_1 && gcloud container hub memberships delete $CLUSTER_2
# delete GKE clusters
gcloud container clusters delete $CLUSTER_1 --region $LOCATION_1
gcloud container clusters delete $CLUSTER_2 --region $LOCATION_2
# Delete the firewall rules
gcloud compute firewall-rules delete asm-multicluster-pods
# Delete cloud routers and NAT
for REGION in $LOCATION_1 $LOCATION_2
do
gcloud compute routers delete nat-router-$REGION --region $REGION
done

--

--