Multi-repo GKE Config sync with multi-cluster Anthos service mesh

Overview

Hil Liao
12 min readJan 8, 2021

Google cloud has a great product called Anthos. Its best feature is able to span the Kubernetes engine cluster to other cloud providers or to VMWare vSphere in a data center on premises. The licensing fee costs a few thousand dollars monthly per project. If you don’t need multi-cloud or don’t have a VMWare vSphere on premises, this article shows you how to configure a GKE cluster with Istio service mesh (comparable to Anthos service mesh) and GKE config sync (comparable to Anthos config management).

Configuration: cluster creation

The steps are derived from Syncing from multiple repositories with GKE workload identity as the authentication method to the git repository during config sync operator deployment. First, create a GKE cluster with the Istio feature. Fill the environment variables in the following script and execute as project editor.

PROJECT_ID=
GKE_NAME=istio-2
ZONE=us-central1-c
NET=default
SUBNET=default
gcloud beta container --project $PROJECT_ID clusters create $GKE_NAME --zone $ZONE --no-enable-basic-auth --machine-type "e2-medium" --image-type "COS" --disk-type "pd-ssd" --disk-size "50" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/cloud-platform" --max-pods-per-node "110" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network $NET --subnetwork $SUBNET --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,Istio --istio-config auth=MTLS_PERMISSIVE --no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --autoscaling-profile optimize-utilization --workload-pool "$PROJECT_ID.svc.id.goog"

Upon successful cluster creation, follow the 2 steps to deploy the config sync operator. You are expecting the following command line output

customresourcedefinition.apiextensions.k8s.io/configmanagements.configmanagement.gke.io created
clusterrole.rbac.authorization.k8s.io/config-management-operator created
clusterrolebinding.rbac.authorization.k8s.io/config-management-operator created
serviceaccount/config-management-operator created
deployment.apps/config-management-operator created
namespace/config-management-system created

Configuration: ConfigManagement

Follow the steps in Configuring syncing from the root repository, execute kubectl apply -f config-management-multi-repo.yaml on the following content with the correct $GKE_NAME.

apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
name: config-management
spec:
# clusterName is required and must be unique among all managed clusters
clusterName: $GKE_NAME
enableMultiRepo: true

Upon success, the cluster’s name specified in the yaml needs to match kind: ClusterSelector’s metadata.name in git repository’s clusterregistry folder.

Configuration: RootSync for cluster resources

Next, apply the following kind: RootSync to synchronize cluster scoped resources to the git repository. The repository name is gke-config; branch is master; the directory is config-management. Notice the secretRef is commented as we are about to use GKE workload identity for the git authentication. Set the correct $PROJECT_ID.

apiVersion: configsync.gke.io/v1alpha1
kind: RootSync
metadata:
name: root-sync
namespace: config-management-system
spec:
git:
repo: https://source.developers.google.com/p/$PROJECT_ID/r/gke-config
revision: HEAD
branch: master
dir: "config-management"
auth: gcenode
# secretRef:
# name: SECRET_NAME

The root-reconciler deployment’s git-sync container immediately has the errors in Cloud Logging.

\"msg\"=\"unexpected error syncing repo, will retry\" \"error\"=\"error running command: exit status 128: { stdout: \\\"\\\", stderr: \\\"Cloning into '/repo/root'...\\\\nremote: INVALID_ARGUMENT: Request contains an invalid argument\\\\nremote: [type.googleapis.com/google.rpc.LocalizedMessage]\\\\nremote: locale: \\\\\\\"en-US\\\\\\\"\\\\nremote: message: \\\\\\\"Invalid authentication credentials. Please generate a new identifier: https://source.developers.google.com/new-password\\\\\\\"\\\\nremote: \\\\nremote: [type.googleapis.com/google.rpc.RequestInfo]\\\\nremote: request_id: \\\\\\\"ff5ae9c2a19d4f97a786c2e88f5ddd68\\\\\\\"\\\\nfatal: unable to access 'https://source.developers.google.com/p/$PROJECT_ID/r/gke-config/': The requested URL returned error: 400\\\\n\\\" }\"  "
}

The solution is to create a Google service account such as gke-git-sync; bind Source Repository Reader IAM role in the gke-config Repository; execute the following command to enable GKE workload identity. Observe that deployment of root-reconciler has serviceAccountName: root-reconciler. If you are configuring more than 1 cluster in the same project, the gcloud iam command should have the roles/iam.workloadIdentityUser binding created during 1st cluster’s configurations. Subsequent cluster’s configuration can just execute the kubectl annotate command although re-executing gcloud iam command should not hurt.

PROJECT_ID=
KSA=root-reconciler
GCP_SA=gke-git-sync@$PROJECT_ID.iam.gserviceaccount.com
GCP_SA_PROJECT=$PROJECT_ID
k8s_namespace=config-management-system
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT_ID.svc.id.goog[config-management-system/$KSA]" \
$GCP_SA --project $GCP_SA_PROJECT && \
kubectl annotate serviceaccount \
--namespace $k8s_namespace \
$KSA \
iam.gke.io/gcp-service-account=$GCP_SA

Within 2 minutes, observe the following log from the root-reconciler deployment’s git-sync container as a sign of success.

"message": " \"level\"=5 \"msg\"=\"running command\"  \"cmd\"=\"git clone --no-checkout -b master --depth 1 https://source.developers.google.com/p/$PROJECT_ID/r/gke-config /repo/root\" \"cwd\"=\"\"",
<<REDACTED>>

If the git repository has any cluster scoped resources such as the ClusterRole or the namespace, verify they were created.

Configuration: RepoSync for namespace scoped resources

Following the instructions of Configuring syncing from namespace repositorie, in the namespaces folder, create a namespace containing the following 3 files. In the example below, the namespace is hil.The repo-sync has git pointing to the same repository, gke-config, but a different branch ns-hil’s HEAD.

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: hil
labels:
istio-injection: enabled
# repo-sync.yaml
apiVersion: configsync.gke.io/v1alpha1
kind: RepoSync
metadata:
name: repo-sync
namespace: hil
spec:
git:
repo: https://source.developers.google.com/p/$PROJECT_ID/r/gke-config
revision: HEAD
branch: ns-hil
dir: "config-management"
auth: gcenode
# sync-rolebinding.yaml
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: syncs-repo
namespace: hil
subjects:
- kind: ServiceAccount
name: ns-reconciler-hil
namespace: config-management-system
roleRef:
kind: ClusterRole
name: edit
apiGroup: rbac.authorization.k8s.io

Upon pushing the commit to the master branch, a deployment ns-reconciler-hil in the format of ns-reconciler-$NAMESPACE gets created and contains a container git-sync showing the following error in Cloud Logging every 15 seconds, very similar to the prior error in root-reconciler deployment’s git-sync container.

"message": " \"msg\"=\"unexpected error syncing repo, will retry\" \"error\"=\"error running command: exit status 128: { stdout: \\\"\\\", stderr: \\\"Cloning into '/repo/root'...\\\\nremote: INVALID_ARGUMENT: Request contains an invalid argument
<<REDACTED>>
"Invalid authentication credentials. Please generate a new identifier: https://source.developers.google.com/new-password\\\\\\\"\\\\nremote: \\\\nremote: [type.googleapis.com/google.rpc.RequestInfo]\\\\nremote: request_id: \\\\\\\"<<REDACTED>>\\\\\\\"\\\\nfatal: unable to access 'https://source.developers.google.com/p/$PROJECT_ID/r/gke-config/': The requested URL returned error: 400\\\\n\\\" }\" "

The reason is lack of GKE workload identity configured in the Kubernetes service account ns-reconciler-hil created for the namespace repo sync. Execute the following commands to enable it. Again, subsequent cluster’s configuration can just execute the kubectl annotate command although re-executing gcloud iam command should not hurt.

PROJECT_ID=
KSA=ns-reconciler-hil
GCP_SA=gke-git-sync@$PROJECT_ID.iam.gserviceaccount.com
GCP_SA_PROJECT=$PROJECT_ID
k8s_namespace=config-management-system
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT_ID.svc.id.goog[config-management-system/$KSA]" \
$GCP_SA --project $GCP_SA_PROJECT && \
kubectl annotate serviceaccount \
--namespace $k8s_namespace \
$KSA \
iam.gke.io/gcp-service-account=$GCP_SA

Within 2 minutes, observe the errors stopped and info level logs appear in Cloud Logging.

"message": " \"level\"=5 \"msg\"=\"running command\"  \"cmd\"=\"git clone --no-checkout -b ns-hil --depth 1 https://source.developers.google.com/p/$PROJECT_ID/r/gke-config /repo/root\" \"cwd\"=\"\"",

If you observe the following error in the root-reconciler-* pod’s reconciler container, some constraint templates have not been installed.

KNV1021: No CustomResourceDefinition is defined for the type "K8sRequiredLabels.constraints.gatekeeper.sh" in the cluster

Verify any namespace scoped resources, such as the KSA, were created. Look for the Open Policy Agent gatekeeper section below to install them.

ClusterSelector works only in the root repository

The current limitations show that ClusterSelectors only work in the root repository. The namespace-reader ClusterRole is an example of creating a cluster resource in the selected clusters. Replace $GKE_NAME-1 with the config-management’s clusterName. The cluster selector is selecting clusters with labels of num: “1” AND customer: hil which is the $GKE_NAME-1 cluster. To effectively test ClusterSelectors, create another cluster $GKE_NAME-2 and configure with the steps above. Observe that the 2nd cluster does not have The namespace-reader ClusterRole but have namespace hil. The reason is namespace hil does not have any annotation: configmanagement.gke.io/cluster-selector.

Open Policy Agent gatekeeper as a Policy Controller

Anthos has a policy controller for cluster administrators to define constraints based on existing constraint templates Google provides. The Open policy controller gatekeeper achieves a similar goal. With a project editor IAM role binding, I executed the command from the README page to install the gatekeeper. Then I install the container resource limit template.

# install open policy agent gate keeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.2/deploy/gatekeeper.yaml
# verify creation of the gate keeper namespace is created
kubectl get pods -n gatekeeper-system
# install some constraint templates
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper-library/master/library/general/containerlimits/template.yaml && \
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper-library/master/library/general/requiredlabels/template.yaml
# verify the template is installed
kubectl get ConstraintTemplates

Ideally, push a commit of the constraint to the RootSync’s branch at cluster/ directory per Google’s constraint format. You’d expect the constraint to be created in the selected cluster. Administrators can also execute kubectl apply -f command to install the constraints for experimental purposes. Verify the constraints with the command:

kubectl describe constraints pod-resource-limits
### omitted ###
Spec:
Match:
Kinds:
API Groups:
Kinds: Pod
Namespaces: hil
Parameters:
Cpu: 1
Memory: 2.5Gi

Creating the following pod in the hil namespace will be rejected but not in the default namespace:

apiVersion: v1
kind: Pod
metadata:
name: bad-limit
labels:
app: test-constraint
spec:
containers:
- name: bad-limit
image: gcr.io/google.com/cloudsdktool/cloud-sdk:latest
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
resources:
limits:
cpu: "2"
memory: "25Gi"
requests:
cpu: "0.1"
memory: "100Mi"

Expect Error

kubectl apply -f pod.yaml  -n hil                                                                                                                                                             
Error from server ([denied by pod-resource-limits] container <bad-limit> cpu limit <2> is higher than the maximum allowed of <1>
[denied by pod-resource-limits] container <istio-proxy> cpu limit <2> is higher than the maximum allowed of <1>): error when creating "pod.yaml": admission webhook "validation.gatekeeper.sh" denied the request: [denied by pod-resource-limits] container <bad-limit> cpu limit <2> is higher than the maximum allowed of <1>
[denied by pod-resource-limits] container <istio-proxy> cpu limit <2> is higher than the maximum allowed of <1>

Multi-cluster service mesh on GKE with shared control-plane, single-VPC architecture

Similar to multi-cluster ingress in GKE, it’s possible to create a new GKE cluster to connect to an existing GKE cluster’s Istio control plane as described in Building a multi-cluster service mesh on GKE with shared control-plane, single-VPC architecture for Istio 1.4. While Kubernetes services can be deployed to different clusters, the virtual service points to hostnames in different clusters.

Problems with the Istio 1.4 multi-cluster service mesh

Let’s call the GKE cluster with istio control plan the control cluster and the other GKE cluster the remote cluster. There are problems:

  1. A deployment in the remote cluster can’t resolve a Kubernetes service in the control cluster at $SVC-name.$NAMESPACE.svc.cluster.local. I guess it makes sense as both clusters could have namespaces of identical names.
  2. As pod IP addresses of istio-pilot, istio-policy, istio-telemetry change during cluster autoscaling, the remote cluster’s istio-proxy container will log the following error:
Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?)
[Envoy (Epoch 0)] [2021-01-21 01:02:57.958][13][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure

Anthos service mesh on the latest Istio multi-cluster installation doc appears to have solve this problem. However, the current GKE’s Istio feature is still using Istio 1.4.10 at the time of this writing. To use Istio 1.8 with multi-cluster support, continue reading through the following section.

Install Anthos service mesh without Anthos entitlement on 2 GKE clusters

The latest Anthos service mesh 1.8 does not require Anthos entitlement to install on GKE clusters. It supports multi-cluster service mesh across projects on private IP clusters. The example below is using 2 public clusters in different zones to demonstrate a multi-cluster setup.

  • Create 2 GKE clusters in different zones
PROJECT_ID=
GKE_NAME=$CLUS_NAME1
ZONE=$CLUS_NAME1_ZONE
NET=
SUBNET=

gcloud beta container --project $PROJECT_ID clusters create $GKE_NAME --zone $ZONE --no-enable-basic-auth \
--machine-type "e2-standard-4" --preemptible --image-type "COS" --disk-type "pd-ssd" --disk-size "50" \
--metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/cloud-platform" --max-pods-per-node "110" \
--num-nodes "4" --enable-stackdriver-kubernetes --enable-ip-alias --network $NET --subnetwork $SUBNET \
--default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing \
--no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 \
--autoscaling-profile optimize-utilization --workload-pool "$PROJECT_ID.svc.id.goog"
  • Download the Istio release binary to create SSL certificates. You’d need CA certificate, CA key, root certificate, certificate chain files.
~/Downloads/istio-1.8.2/certs$ make -f ../tools/certs/Makefile.selfsigned.mk root-ca~/Downloads/istio-1.8.2/certs$ make -f ../tools/certs/Makefile.selfsigned.mk asm-cacerts~/Downloads/istio-1.8.2/certs$ ll
total 32
drwxrwxr-x 3 hil hil 4096 Jan 27 13:54 ./
drwxr-x--- 7 hil hil 4096 Jan 27 13:53 ../
drwxrwxr-x 2 hil hil 4096 Jan 27 13:54 asm/
-rw-rw-r-- 1 hil hil 362 Jan 27 13:53 root-ca.conf
-rw-rw-r-- 1 hil hil 1712 Jan 27 13:53 root-cert.csr
-rw-rw-r-- 1 hil hil 1822 Jan 27 13:53 root-cert.pem
-rw-rw-r-- 1 hil hil 41 Jan 27 13:54 root-cert.srl
-rw------- 1 hil hil 3243 Jan 27 13:53 root-key.pem
hil@xeon:~/Downloads/istio-1.8.2/certs$ ll asm
total 24
drwxrwxr-x 2 hil hil 4096 Jan 27 13:54 ./
drwxrwxr-x 3 hil hil 4096 Jan 27 13:54 ../
-rw-rw-r-- 1 hil hil 1903 Jan 27 13:54 ca-cert.pem
-rw------- 1 hil hil 3247 Jan 27 13:54 ca-key.pem
-rw-rw-r-- 1 hil hil 3725 Jan 27 13:54 cert-chain.pem
-rw-rw-r-- 1 hil hil 1822 Jan 27 13:54 root-cert.pem
hil@xeon:~/Downloads/istio-1.8.2/certs$ mkdir -p ~/anthos-install && cd ~/anthos-install && mv ~/Downloads/istio-1.8.2/certs . # make sure your working directory contains the certs dir
$ ./install_asm \
--project_id $PROJECT_ID \
--cluster_name $CLUS_NAME1 \
--cluster_location $CLUS_NAME1_ZONE \
--mode install \
--ca citadel \
--ca_cert certs/asm/ca-cert.pem \
--ca_key certs/asm/ca-key.pem \
--root_cert certs/root-cert.pem \
--cert_chain certs/asm/cert-chain.pem \
--enable_all --enable_cluster_labels
# Sample command output
set 7 field(s) of setter "gcloud.project.environProjectID" to value "null"
asm/
set 2 field(s) of setter "anthos.servicemesh.hubMembershipID" to value ""
install_asm: Installing validation webhook fix...
service/istiod created
install_asm: Installing ASM control plane...
  • [Preview (Mesh CA with environ)] skip this section if you don’t care about using Mesh CA: 2 months after publishing the article, I tested ASM 1.9 with private clusters of public master IPs, Mesh CA with environ or fleet in preview, ingress to all internal IP ranges, implied egress to all; the installation succeeded. I had to struggle as not having connectivity for the Istiod pods caused the cross cluster load balancing to fail. Errors are in istio-asm-* deployment pod’s logs atinfo severity. Create Cloud NAT’s in the cluster’s regions. Don’t test curl google.com -I (200 when private Google access enabled in the subnet) but execute curl docker.com -I from any pod in the cluster to simulate required egress from the cluster’s Istiod pod to the other cluster’s Master IP. Verify the implied egress all firewall rule is effective. Create a firewall ingress rule to allow all possible internal IP ranges (10.0.0.0/8,172.16.0.0/12,192.168.0.0/16) to all instances in the network.
# create the cloud routers
REGION=us-west1 && VPC=hil-vpc && gcloud compute routers create nat-router-$REGION --network $VPC --region $REGION && gcloud compute routers nats create nat-config-$REGION --router-region $REGION --router nat-router-$REGION --nat-all-subnet-ip-ranges --auto-allocate-nat-external-ips
# Create the GKE cluster
PROJECT_ID=
NET=
SUBNET=
GKE_NAME=
MASTER_IP="172.22.16.0/28"
REGION=
gcloud beta container --project $PROJECT_ID clusters create $GKE_NAME --region $REGION --no-enable-basic-auth --machine-type e2-standard-4 --image-type COS --disk-type pd-ssd --disk-size 100 --metadata disable-legacy-endpoints=true --scopes https://www.googleapis.com/auth/cloud-platform --num-nodes 1 --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr $MASTER_IP --enable-master-global-access --enable-ip-alias --network projects/$PROJECT_ID/global/networks/$NET --subnetwork projects/$PROJECT_ID/regions/$REGION/subnetworks/$SUBNET --preemptible --default-max-pods-per-node 110 --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,CloudRun,GcePersistentDiskCsiDriver --workload-pool $PROJECT_ID.svc.id.goog./install_asm --project_id $PROJECT_ID --cluster_name $GKE_NAME --cluster_location $REGION --mode install --enable_all --enable_cluster_labels --option hub-meshca --option egressgateways
  • Repeat the installation steps on another GKE cluster $CLUS_NAME2 in the same project with the same CA certificates. Make sure you put the certs folder in Google cloud storage later and secure it.
./install_asm \
--project_id $PROJECT_ID \
--cluster_name $CLUS_NAME2 \
--cluster_location $CLUS_NAME2_ZONE \
--mode install \
--ca citadel \
--ca_cert certs/asm/ca-cert.pem \
--ca_key certs/asm/ca-key.pem \
--root_cert certs/root-cert.pem \
--cert_chain certs/asm/cert-chain.pem \
--enable_all --enable_cluster_labels
# create the gateway. Only execute 1 of [kubectl*, cat*]kubectl apply -f https://github.com/hilliao/enterprise-solutions/raw/master/googlecloud/canary/gateway.yaml || \cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: http-gw
spec:
selector:
istio: ingressgateway # use istio default ingress gateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
EOF
# create the virtual servicecat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
annotations:
name: default-vs
spec:
gateways:
- http-gw
- https-gw
hosts:
- '*'
http:
- match:
- uri:
prefix: /sample/
name: helloworld-sample
rewrite:
uri: /
route:
- destination:
host: helloworld.sample.svc.cluster.local
port:
number: 5000
EOF
  • Execute the curl commands on the istio-ingressgateway’s IP to test. Investigate and fix any errors executing commands. Do not execute the next command until the current command succeeds. Observe v1 and v2 pods in different clusters from the HTTP response.
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')curl http://${INGRESS_HOST}:${INGRESS_PORT}/sample/hellowatch time curl http://${INGRESS_HOST}:${INGRESS_PORT}/sample/helloexport INGRESS_HOST_1=$(kubectl --context=${CTX_1} -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}') && export INGRESS_PORT_1=$(kubectl --context=${CTX_1} -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}') && watch time curl http://${INGRESS_HOST_1}:${INGRESS_PORT_1}/sample/helloexport INGRESS_HOST_2=$(kubectl --context=${CTX_2} -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}') && export INGRESS_PORT_2=$(kubectl --context=${CTX_2} -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}') && watch time curl http://${INGRESS_HOST_2}:${INGRESS_PORT_2}/sample/hellowatch -n 1 "time curl http://${INGRESS_HOST_1}:${INGRESS_PORT_1}/sample/hello && time curl http://${INGRESS_HOST_2}:${INGRESS_PORT_2}/sample/hello"

--

--