fetch k8s_container|metric'kubernetes.io/anthos/grpc_server_handled_total'|alignrate(1m)|every1m
OK
can be safely ignored if the code is not one of the following: Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded
Workaround:
To configure Prometheus to ignore these false positive alerts, review the following options:
0
replicas so that the modifications can persist.prometheus-config
configmap, and add grpc_method!="Watch"
to the etcdHighNumberOfFailedGRPCRequests
alert config as shown in the following example: rate(grpc_server_handled_total{cluster="CLUSTER_NAME",grpc_code!="OK",job=~".*etcd.*"}[5m])
rate(grpc_server_handled_total{cluster="CLUSTER_NAME",grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",grpc_method!="Watch",job=~".*etcd.*"}[5m])
CLUSTER_NAME
with the name of your cluster.kube-apiserver
log to debug more.Egress NAT connections might be dropped after 5 to 10 minutes of a connection being established if there's no traffic.
As the conntrack only matters in the inbound direction (external connections to the cluster), this issue only happens if the connection doesn't transmit any information for a while and then the destination side transmits something. If the egress NAT'd Pod always instantiates the messaging, then this issue won't be seen.
This issue occurs because the anetd garbage collection inadvertently removes conntrack entries that the daemon thinks are unused. An upstream fix was recently integrated into anetd to correct the behavior.
Workaround:
There is no easy workaround, and we haven't seen issues in version 1.16 from this behavior. If you notice long lived connections dropped due to this issue, workarounds would be to use a workload on the same node as the egress IP address, or to consistently send messages on the TCP connection.
spec.expirationSeconds
when signing certificatesIf you create a CertificateSigningRequest (CSR) with expirationSeconds
set, the expirationSeconds
is ignored.
Workaround:
If you're affected by this issue, you can update your user cluster by adding disableNodeIDVerificationCSRSigning: true
in the user cluster configuration file and run the gkectl update cluster
command to update the cluster with this configuration.
disable_bundled_ingress
If you try to disable bundled ingress for an existing cluster, the gkectl update cluster
command fails with an error similar to the following example:
[FAILURE] Config: ingress IP is required in user cluster spec
This error happens because gkectl
checks for a load balancer ingress IP address during preflight checks. Although this check isn't required when disabling bundled ingress, the gkectl
preflight check fails when disableBundledIngress
is set to true
.
Workaround:
Use the --skip-validation-load-balancer
parameter when you update the cluster, as shown in the following example:
gkectlupdatecluster\--kubeconfigADMIN_CLUSTER_KUBECONFIG--configUSER_CLUSTER_CONFIG\--skip-validation-load-balancer
For more information, see how to disable bundled ingress for an existing cluster.
If you rotate admin cluster certificate authority (CA) certificates, subsequent attempts to run the gkectl update admin
command fail. The error returned is similar to the following:
failed to get last CARotationStage: configmaps "ca-rotation-stage" not found
Workaround:
If you're affected by this issue, you can update your admin cluster by using the --disable-update-from-checkpoint
flag with the gkectl update admin
command:
gkectlupdateadmin--configADMIN_CONFIG_file\--kubeconfigADMIN_CLUSTER_KUBECONFIG\--disable-update-from-checkpoint
When you use the --disable-update-from-checkpoint
flag, the update command doesn't use the checkpoint file as the source of truth during the cluster update. The checkpoint file is still updated for future use.
During preflight checks, the CSI Workload validation check installs a Pod in the default
namespace. The CSI Workload Pod validates that the vSphere CSI Driver is installed and can do dynamic volume provisioning. If this Pod doesn't start, the CSI Workload validation check fails.
There are a few known issues that can prevent this Pod from starting:
default
namespace, the CSI Workload Pod doesn't start.If the CSI Workload Pod doesn't start, you see a timeout error like the following during preflight validations:
-[FAILURE]CSIWorkload:failureinCSIWorkloadvalidation:failedtocreatewriterJobtoverifythewritefunctionalityusingCSI:Jobdefault/anthos-csi-workload-writer-<run-id>replicasarenotinSucceededphase:timedoutwaitingforthecondition
To see if the failure is caused by lack of Pod resources set, run the following command to check the anthos-csi-workload-writer-<run-id>
job status:
kubectldescribejobanthos-csi-workload-writer-<run-id>
If the resources limits aren't set properly for the CSI Workload Pod, the job status contains an error message like the following:
CPUandmemoryresourcelimitsisinvalid,asitarenotdefinedforcontainer:volume-tester
If the CSI Workload Pod doesn't start because of Istio sidecar injection, you can temporarily disable the automatic Istio sidecar injection in the default
namespace. Check the labels of the namespace and use the following command to delete the label that starts with istio.io/rev
:
kubectllabelnamespacedefaultistio.io/rev-
If the Pod is misconfigured, manually verify that dynamic volume provisioning with the vSphere CSI Driver works:
standard-rwo
StorageClass.If dynamic volume provisioning with the vSphere CSI Driver works, run gkectl diagnose
or gkectl upgrade
with the --skip-validation-csi-workload
flag to skip the CSI Workload check.
When you are logged on to a user-managed admin workstation, the gkectl update cluster
command might timeout and fail to update the user cluster. This happens if the admin cluster version is 1.15 and you run gkectl update admin
before you run the gkectl update cluster
. When this failure happens, you see the following error when trying to update the cluster:
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
During the update of a 1.15 admin cluster, the validation-controller
that triggers the preflight checks is removed from the cluster. If you then try to update the user cluster, the preflight check hangs until the timeout is reached.
Workaround:
validation-controller
: gkectl prepare --kubeconfig ADMIN_KUBECONFIG --bundle-path BUNDLE_PATH --upgrade-platform
gkectl update cluster
again to update the user cluster When you are logged on to a user-managed admin workstation, the gkectl create cluster
command might timeout and fail to create the user cluster. This happens if the admin cluster version is 1.15. When this failure happens, you see the following error when trying to create the cluster:
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
Since the validation-controller
was added in 1.16 then when using 1.15 admin cluster the validation-controller
that is responsible to trigger the preflight checks is missing. Therefore, when trying to create user cluster the preflight checks hang till timeout is reached.
Workaround:
validation-controller
: gkectl prepare --kubeconfig ADMIN_KUBECONFIG --bundle-path BUNDLE_PATH --upgrade-platform
gkectl create cluster
again to create the user cluster When you upgrade an admin cluster from version 1.15.x to 1.16.x, or add a connect
, stackdriver
, cloudAuditLogging
, or gkeOnPremAPI
configuration when you update an admin cluster, the operation might be rejected by admin cluster webhook. One of the following error messages might be displayed:
"projects for connect, stackdriver and cloudAuditLogging must be the same when specified during cluster creation."
"locations for connect, gkeOnPremAPI, stackdriver and cloudAuditLogging must be in the same region when specified during cluster creation."
"locations for stackdriver and cloudAuditLogging must be the same when specified during cluster creation."
An admin cluster update or upgrade requires the onprem-admin-cluster-controller
to reconcile the admin cluster in a kind cluster. When the admin cluster state is restored in the kind cluster, the admin cluster webhook can't distinguish if the OnPremAdminCluster
object is for an admin cluster creation, or to resume operations for an update or upgrade. Some create-only validations are invoked on updating and upgrading unexpectedly.
Workaround:
Add the onprem.cluster.gke.io/skip-project-location-sameness-validation: true
annotation to the OnPremAdminCluster
object:
onpremadminclusters
cluster resource: kubectleditonpremadminclustersADMIN_CLUSTER_NAME-nkube-system–kubeconfigADMIN_CLUSTER_KUBECONFIG
ADMIN_CLUSTER_NAME
: the name of the admin cluster.ADMIN_CLUSTER_KUBECONFIG
: the path of the admin cluster kubeconfig file.onprem.cluster.gke.io/skip-project-location-sameness-validation: true
annotation and save the custom resource.disable-update-from-checkpoint
in the update command, or add the parameter `disable-upgrade-from-checkpoint` in the upgrade command. These parameters are only needed for the next time that you run the update
or upgrade
command: gkectlupdateadmin--configADMIN_CONFIG_file--kubeconfigADMIN_CLUSTER_KUBECONFIG\--disable-update-from-checkpoint
gkectlupgradeadmin--configADMIN_CONFIG_file--kubeconfigADMIN_CLUSTER_KUBECONFIG\--disable-upgrade-from-checkpoint
When you are logged on to a user-managed admin workstation, the gkectl delete cluster
command might timeout and fail to delete the user cluster. This happens if you have first run gkectl
on the user-managed workstation to create, update, or upgrade the user cluster. When this failure happens, you see the following error when trying to delete the cluster:
failed to wait for user cluster management namespace "USER_CLUSTER_NAME-gke-onprem-mgmt" to be deleted: timed out waiting for the condition
During deletion, a cluster first deletes all of its objects. The deletion of the Validation objects (that were created during the create, update, or upgrade) are stuck at the deleting phase. This happens because a finalizer blocks the object's deletion, which causes cluster deletion to fail.
Workaround:
kubectl --kubeconfig ADMIN_KUBECONFIG get validations \ -n USER_CLUSTER_NAME-gke-onprem-mgmt
kubectl --kubeconfig ADMIN_KUBECONFIG patch validation/VALIDATION_OBJECT_NAME \ -n USER_CLUSTER_NAME-gke-onprem-mgmt -p '{"metadata":{"finalizers":[]}}' --type=merge
After removing the finalizer from all Validation objects, the objects are removed and the user cluster delete operation completes automatically. You don't need to take additional action.
If the source Pod and egress NAT gateway Pod are on two different worker nodes, traffic from the source Pod can't reach any external services. If the Pods are located on the same host, the connection to external service or application is successful.
This issue is caused by vSphere dropping VXLAN packets when tunnel aggregation is enabled. There's a known issue with NSX and VMware that only sends aggregated traffic on known VXLAN ports (4789).
Workaround:
Change the VXLAN port used by Cilium to 4789
:
cilium-config
ConfigMap: kubectleditcm-nkube-systemcilium-config--kubeconfigUSER_CLUSTER_KUBECONFIG
cilium-config
ConfigMap: tunnel-port:4789
kubectlrolloutrestartdsanetd-nkube-system--kubeconfigUSER_CLUSTER_KUBECONFIG
This workaround reverts every time the cluster is upgraded. You must reconfigure after each upgrade. VMware must resolve their issue in vSphere for a permanent fix.
The admin cluster upgrade from 1.14.x to 1.15.x with always-on secrets encryption enabled fails due to a mismatch between the controller-generated encryption key with the key that persists on the admin master data disk. The output of gkectl upgrade admin
contains the following error message:
E0926 14:42:21.796444 40110 console.go:93] Exit with error: E0926 14:42:21.796491 40110 console.go:93] Failed to upgrade the admin cluster: failed to create admin cluster: failed to wait for OnPremAdminCluster "admin-cluster-name" to become ready: failed to wait for OnPremAdminCluster "admin-cluster-name" to be ready: error: timed out waiting for the condition, message: failed to wait for OnPremAdminCluster "admin-cluster-name" to stay in ready status for duration "2m0s": OnPremAdminCluster "non-prod-admin" is not ready: ready condition is not true: CreateOrUpdateControlPlane: Creating or updating credentials for cluster control plane
Running kubectl get secrets -A --kubeconfig KUBECONFIG`
fails with the following error:
Internal error occurred: unable to transform key "/registry/secrets/anthos-identity-service/ais-secret": rpc error: code = Internal desc = failed to decrypt: unknown jwk
If you have a backup of the admin cluster, do the following steps to workaround the upgrade failure:
secretsEncryption
in the admin cluster configuration file, and update the cluster using the following command: gkectlupdateadmin--configADMIN_CLUSTER_CONFIG_FILE--kubeconfigKUBECONFIG
/opt/data/gke-k8s-kms-plugin/generatedkeys
on the admin master. /etc/kubernetes/manifests
to update the --kek-id
to match the kid
field in the original encryption key. /etc/kubernetes/manifests/kms-plugin.yaml
to another directory then move it back. gkectl upgrade admin
again. If you haven't already upgraded, we recommend that you don't upgrade to 1.15.0-1.15.4. If you must upgrade to an affected version, do the following steps before upgrading the admin cluster:
secretsEncryption
in the admin cluster configuration file, and update the cluster using the following command: gkectlupdateadmin--configADMIN_CLUSTER_CONFIG_FILE--kubeconfigKUBECONFIG
Google Distributed Cloud does not support Changed Block Tracking (CBT) on disks. Some backup software uses the CBT feature to track disk state and perform backups, which causes the disk to be unable to connect to a VM that runs Google Distributed Cloud. For more information, see the VMware KB article.
Workaround:
Don't back up the Google Distributed Cloud VMs, as 3rd party backup software might cause CBT to be enabled on their disks. It's not necessary to back up these VMs.
Don't enable CBT on the node, as this change won't persist across updates or upgrades.
If you already have disks with CBT enabled, follow the Resolution steps in the VMware KB articleto disable CBT on the First Class Disk.
If you use Nutanix storage arrays to provide NFSv3 shares to your hosts, you might experience data corruption or the inability for Pods to run successfully. This issue is caused by a known compatibility issue between certain versions of VMware and Nutanix versions. For more information, see the associated VMware KB article.
Workaround:
The VMware KB article is out of date in noting that there is no current resolution. To resolve this issue, update to the latest version of ESXi on your hosts and to the latest Nutanix version on your storage arrays.
For certain Google Distributed Cloud releases, the kubelet running on the nodes uses a different version than the Kubernetes control plane. There is a mismatch because the kubelet binary preloaded on the OS image is using a different version.
The following table lists the identified version mismatches:
Google Distributed Cloud version | kubelet version | Kubernetes version |
---|---|---|
1.13.10 | v1.24.11-gke.1200 | v1.24.14-gke.2100 |
1.14.6 | v1.25.8-gke.1500 | v1.25.10-gke.1200 |
1.15.3 | v1.26.2-gke.1001 | v1.26.5-gke.2100 |
Workaround:
No action is needed. The inconsistency is only between Kubernetes patch versions and no problems have been caused by this version skew.
When an admin cluster has a certificate authority (CA) version greater than 1, an update or upgrade fails due to the CA version validation in the webhook. The output of gkectl
upgrade/update contains the following error message:
CAVersionmuststartfrom1
Workaround:
auto-resize-controller
deployment in the admin cluster to disable node auto-resizing. This is necessary because a new field introduced to the admin cluster Custom Resource in 1.15 can cause a nil pointer error in the auto-resize-controller
. kubectlscaledeploymentauto-resize-controller-nkube-system--replicas=0--kubeconfigKUBECONFIG
gkectl
commands with --disable-admin-cluster-webhook
flag.For example: gkectlupgradeadmin--configADMIN_CLUSTER_CONFIG_FILE--kubeconfigKUBECONFIG--disable-admin-cluster-webhook
When a non-HA Controlplane V2 cluster is deleted, it is stuck at node deletion until it timesout.
Workaround:
If the cluster contains a StatefulSet with critical data, contact contact Cloud Customer Care to resolve this issue.
Otherwise, do the following steps:
govcvm.destroy
gkectldeletecluster--clusterUSER_CLUSTER_NAME--kubeconfigADMIN_KUBECONFIG--force
When a cluster contains in-tree vSphere persistent volumes (for example, PVCs created with the standard
StorageClass), you will observe com.vmware.cns.tasks.attachvolume tasks triggered every minute from vCenter.
Workaround:
Edit the vSphere CSI feature configMap and set list-volumes to false:
kubectleditconfigmapinternal-feature-states.csi.vsphere.vmware.com-nkube-system--kubeconfigKUBECONFIG
Restart the vSphere CSI controller pods:
kubectlrolloutrestartdeploymentvsphere-csi-controller-nkube-system--kubeconfigKUBECONFIG
When a cluster contains intree vSphere persistent volumes, the commands gkectl diagnose
and gkectl upgrade
might raise false warnings against their persistent volume claims (PVCs) when validating the cluster storage settings. The warning message looks like the following
CSIPrerequisitespvc/pvc-name:PersistentVolumeClaimpvc-nameboundstoanin-treevSpherevolumecreatedbeforeCSImigrationenabled,butitdoesn'thavetheannotationpv.kubernetes.io/migrated-tosettocsi.vsphere.vmware.comafterCSImigrationisenabled
Workaround:
Run the following command to check the annotations of a PVC with the above warning:
kubectlgetpvcPVC_NAME-nPVC_NAMESPACE-oyaml--kubeconfigKUBECONFIG
If the annotations
field in the output contains the following, you can safely ignore the warning:
pv.kubernetes.io/bind-completed:"yes"pv.kubernetes.io/bound-by-controller:"yes"volume.beta.kubernetes.io/storage-provisioner:csi.vsphere.vmware.com
If your cluster is not using a private registry, and your component access service account key and Logging-monitoring (or Connect-register) service account keys are expired, when you rotate the service account keys, gkectl update credentials
fails with an error similar to the following:
Error:reconciliationfailed:failedtoupdateplatform:...
Workaround:
First, rotate the component access service account key. Although the same error message is displayed, you should be able to rotate the other keys after the component access service account key rotation.
If the update is still not successful, contact Cloud Customer Care to resolve this issue.
During a user cluster upgrade, after the user cluster controller is upgraded to 1.16, if you have other 1.15 user clusters managed by the same admin cluster, their user master machine might be unexpectedly recreated.
There is a bug in the 1.16 user cluster controller which can trigger the 1.15 user master machine recreation.
The workaround that you do depends on how you encounter this issue.
Workaround when upgrading the user cluster using the Google Cloud console:
Option 1: Use a 1.16.6+ version of GKE on VMware with the fix.
Option 2: Do the following steps:
kubectleditonpremuserclustersUSER_CLUSTER_NAME-nUSER_CLUSTER_NAME-gke-onprem-mgmt--kubeconfigADMIN_KUBECONFIG
The rerun annotation is:
onprem.cluster.gke.io/server-side-preflight-rerun:true
status
field of the OnPremUserCluster.Workaround when upgrading the user cluster using your own admin workstation:
Option 1: Use a 1.16.6+ version of GKE on VMware with the fix.
Option 2: Do the following steps:
/etc/cloud/build.info
with the following content. This causes the preflight checks to run locally on your admin workstation rather than on the server. gke_on_prem_version:GKE_ON_PREM_VERSION
gke_on_prem_version:1.16.0-gke.669
During cluster creation, if you don't specify a hostname for every IP address in the IP block file, the preflight check fails with the following error message:
multipleVMsfoundbyDNSnameinxxxdatacenter.AnthosOnpremdoesn'tsupportduplicatehostnameinthesamevCenterandyoumaywanttorename/deletetheexistingVM.`
There is a bug in the preflight check which assumes empty hostname as duplicate.
Workaround:
Option 1: Use a version with the fix.
Option 2: Bypass this preflight check by adding --skip-validation-net-config
flag.
Option 3: Specify a unique hostname for each IP address in IP block file.
For a non-HA admin cluster and a control plane v1 user cluster, when you upgrade or update the admin cluster, the admin cluster master machine recreation might happen at the same time as the user cluster master machine reboot, which can surface a race condition. This causes the user cluster control plane Pods to be unable to communicate to the admin cluster control plane, which causes volume attach issues for kube-etcd and kube-apiserver on the user cluster control plane.
To verify the issue, run the following commands for the impacted pod:
kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG--namespaceUSER_CLUSTER_NAMEdescribepodIMPACTED_POD_NAME
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 101s kubelet Unable to attach or mount volumes: unmounted volumes=[kube-audit], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 86s (x2 over 3m28s) kubelet MountVolume.SetUp failed for volume "pvc-77cd0635-57c2-4392-b191-463a45c503cb" : rpc error: code = FailedPrecondition desc = volume ID: "bd313c62-d29b-4ecd-aeda-216648b0f7dc" does not appear staged to "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/92435c96eca83817e70ceb8ab994707257059734826fedf0c0228db6a1929024/globalmount"
Workaround:
sudosystemctlrestartkubelet
During an upgrade or update of an admin cluster, a race condition might cause the vSphere cloud controller manager to unexpectedly delete a new control plane node. This causes the clusterapi-controller to be stuck waiting for the node to be created, and evenutally the upgrade/update times out. In this case, the output of the gkectl
upgrade/update command is similar to the following:
controlplane'default/gke-admin-hfzdg'isnotready:condition"Ready":conditionisnotreadywithreason"MachineInitializing",message"Wait for the control plane machine "gke-admin-hfzdg-6598459f9zb647c8-0\"toberebooted"...
To identify the symptom, run the command below to get log in vSphere cloud controller manager in the admin cluster:
kubectlgetpods--kubeconfigADMIN_KUBECONFIG-nkube-system|grepvsphere-cloud-controller-manager kubectllogs-fvsphere-cloud-controller-manager-POD_NAME_SUFFIX--kubeconfigADMIN_KUBECONFIG-nkube-system
Here is a sample error message from the above command:
nodename:81ff17e25ec6-qual-335-1500f723hasadifferentuuid.Skipdeletingthisnodefromcache.
Workaround:
sudocrictlps|grepvsphere-cloud-controller-manager|awk'{print $1}'sudocrictlstopPREVIOUS_COMMAND_OUTPUT
Upgrading a 1.15 cluster or creating a 1.16 cluster with static IPs fails if there are duplicate hostnames in the same data center. This failure happens because the vSphere cloud controller manager fails to add an external IP and provider ID in the node object. This causes the cluster upgrade/create to timeout.
To identify the issue, get the vSphere cloud controller manager pod logs for the cluster. The command that you use depends on the cluster type, as follows:
kubectlgetpods--kubeconfigADMIN_KUBECONFIG-nkube-system|grepvsphere-cloud-controller-manager kubectllogs-fvsphere-cloud-controller-manager-POD_NAME_SUFFIX--kubeconfigADMIN_KUBECONFIG-nkube-system
kubectlgetpods--kubeconfigADMIN_KUBECONFIG-nUSER_CLUSTER_NAME|grepvsphere-cloud-controller-manager kubectllogs-fvsphere-cloud-controller-manager-POD_NAME_SUFFIX--kubeconfigADMIN_KUBECONFIG-nUSER_CLUSTER_NAME
kubectlgetpods--kubeconfigUSER_KUBECONFIG-nkube-system|grepvsphere-cloud-controller-manager kubectllogs-fvsphere-cloud-controller-manager-POD_NAME_SUFFIX--kubeconfigUSER_KUBECONFIG-nkube-system
Here is a sample error message:
I100317:17:46.7696761search.go:152]Findingnodeadmin-vm-2invc=vcsa-53598.e5c235a1.asia-northeast1.gve.googanddatacenter=Datacenter E100317:17:46.7717171datacenter.go:111]MultiplevmsfoundVMbyDNSName.DNSName:admin-vm-2
Check if the hostname is duplicated in the data center:
You can use the following approach to check if the hostname is duplicated, and do a workaround if needed.exportGOVC_DATACENTER=GOVC_DATACENTERexportGOVC_URL=GOVC_URLexportGOVC_USERNAME=GOVC_USERNAMEexportGOVC_PASSWORD=GOVC_PASSWORDexportGOVC_INSECURE=truegovcfind.-typem-guest.hostNameHOSTNAME
exportGOVC_DATACENTER=mtv-lifecycle-vc01 exportGOVC_URL=https://mtv-lifecycle-vc01.anthos/sdk exportGOVC_USERNAME=xxx exportGOVC_PASSWORD=yyy exportGOVC_INSECURE=truegovcfind.-typem-guest.hostNamef8c3cd333432-lifecycle-337-xxxxxxxz ./vm/gke-admin-node-6b7788cd76-wkt8g ./vm/gke-admin-node-6b7788cd76-99sg2 ./vm/gke-admin-master-5m2jb
The workaround that you do depends on the operation that failed.
Workaround for upgrades:
Do the workaround for the applicable cluster type.
gkectlupdatecluster--kubeconfigADMIN_KUBECONFIG--configNEW_USER_CLUSTER_CONFIG--force
gkectl upgrade cluster
gkectlupdateadmin--kubeconfigADMIN_KUBECONFIG--configNEW_ADMIN_CLUSTER_CONFIG--force--skip-cluster-ready-check
kubectlgetmachine--kubeconfigADMIN_KUBECONFIG-owide-A
kubectlpatchmachineADMIN_MASTER_MACHINE_NAME--kubeconfigADMIN_KUBECONFIG--type='json'-p'[{"op": "replace", "path": "/spec/providerSpec/value/networkSpec/address/hostname", "value":"NEW_ADMIN_MASTER_HOSTNAME"}]'
kubectlgetmachineADMIN_MASTER_MACHINE_NAME--kubeconfigADMIN_KUBECONFIG-oyaml kubectlgetmachineADMIN_MASTER_MACHINE_NAME--kubeconfigADMIN_KUBECONFIG-ojsonpath='{.spec.providerSpec.value.networkSpec.address.hostname}'
gkectlupgradeadmin--kubeconfigADMIN_KUBECONFIG--configADMIN_CLUSTER_CONFIG--disable-upgrade-from-checkpoint
Workaround for installations:
Do the workaround for the applicable cluster type.
gkectl create admin
. gkectl create cluster
.$
and `
are not supported in vSphere username or passwordThe following operations fail when the vSphere username or password contains $
or `
:
Use a 1.16.4+ version of Google Distributed Cloud with the fix or perform the below workaround. The workaround that you do depends on the operation that failed.
Workaround for upgrades:
$
and `
. gkectlupdatecluster--kubeconfigADMIN_KUBECONFIG--configUSER_CLUSTER_CONFIG--force
gkectlupdateadmin--kubeconfigADMIN_KUBECONFIG--configADMIN_CLUSTER_CONFIG--force--skip-cluster-ready-check
Workaround for installations:
$
and `
. gkectl create admin
. gkectl create cluster
.After a node is deleted and then recreated with the same node name, there is a slight chance that a subsequent PersistentVolumeClaim (PVC) creation fails with an error like the following:
Theobject'vim.VirtualMachine:vm-988369'hasalreadybeendeletedorhasnotbeencompletelycreated
This is caused by race condition where vSphere CSI controller does not delete a removed machine from its cache.
Workaround:
Restart the vSphere CSI controller pods:
kubectlrolloutrestartdeploymentvsphere-csi-controller-nkube-system--kubeconfigKUBECONFIG
When you run the gkectl repair admin-master
command on an HA admin cluster, gkectl
returns the following error message:
Exitwitherror:Failedtorepair:failedtoselectthetemplate:failedtogetclusternamefromkubeconfig,pleasecontactGooglesupport.failedtodecodekubeconfigdata:yaml:unmarshalerrors: line3:cannotunmarshal!!seqintomap[string]*api.Cluster line8:cannotunmarshal!!seqintomap[string]*api.Context
Workaround:
Add the --admin-master-vm-template=
flag to the command and provide the VM template of the machine to repair:
gkectlrepairadmin-master--kubeconfig=ADMIN_CLUSTER_KUBECONFIG\--configADMIN_CLUSTER_CONFIG_FILE\--admin-master-vm-template=/DATA_CENTER/vm/VM_TEMPLATE_NAME
To find the VM template of the machine:
You should see the three VM templates for the admin cluster.
gkectlrepairadmin-master\--config=/home/ubuntu/admin-cluster.yaml\--kubeconfig=/home/ubuntu/kubeconfig\--admin-master-vm-template=/atl-qual-vc07/vm/gke-admin-98g94-zx...7vx-0-tmpl
If you use Seesaw as the load balancer type for your cluster and you see that a Seesaw VM is down or keeps failing to boot, you might see the following error message in the vSphere console:
GRUB_FORCE_PARTUUIDset,initrdlessbootfailed.Attemptingwithinitrd
This error indicates that the disk space is low on the VM because the fluent-bit running on the Seesaw VM is not configured with correct log rotation.
Workaround:
Locate the log files that consume most of the disk space using du -sh -- /var/lib/docker/containers/* | sort -rh
. Clean up the log file with largest size and reboot the VM.
Note: If the VM is completely inaccessible, attach the disk to a working VM (e.g. admin workstation), remove the file from the attached disk, then reattach the disk back to the original Seesaw VM.
To prevent the issue from happening again, connect to the VM and modify the /etc/systemd/system/docker.fluent-bit.service
file. Add --log-opt max-size=10m --log-opt max-file=5
in the Docker command, then run systemctl restart docker.fluent-bit.service
When you try to upgrade (gkectl upgrade admin
) or update (gkectl update admin
) a non-High-Availability admin cluster with checkpoint enabled, the upgrade or update may fail with errors like the following:
Checkingadminclustercertificates...FAILURE Reason:20adminclustercertificateserror(s). UnhealthyResources: AdminMasterclusterCAbundle:failedtogetclusterCAbundleonadminmaster,command[ssh-oIdentitiesOnly=yes-iadmin-ssh-key-oStrictHostKeyChecking=no-oConnectTimeout=30ubuntu@AdminMasterIP--sudocat/etc/kubernetes/pki/ca-bundle.crt]failedwitherror:exitstatus255,stderr:Authorizedusesonly.Allactivitymaybemonitoredandreported. ubuntu@AdminMasterIP:Permissiondenied(publickey).
failedtosshAdminMasterIP,failedwitherror:exitstatus255,stderr:Authorizedusesonly.Allactivitymaybemonitoredandreported. ubuntu@AdminMasterIP:Permissiondenied(publickey)
errordialingubuntu@AdminMasterIP:failedtoestablishanauthenticatedSSHconnection:ssh:handshakefailed:ssh:unabletoauthenticate,attemptedmethods[nonepublickey]...
Workaround:
If you're unable to upgrade to a patch version of Google Distributed Cloud with the fix, contact Google Support for assistance.
When an admin cluster is enrolled in the GKE On-Prem API, upgrading the admin cluster to the affected versions could fail because the fleet membership couldn't be updated. When this failure happens, you see the following error when trying to upgrade the cluster:
failedtoregistercluster:failedtoapplyHubMembership:MembershipAPIrequestfailed:rpcerror:code=InvalidArgumentdesc=InvalidFieldErrorforfieldendpoint.on_prem_cluster.resource_link:fieldcannotbeupdated
An admin cluster is enrolled in the API when you explicitly enroll the cluster, or when you upgrade a user cluster using a GKE On-Prem API client.
Workaround:
Unenroll the admin cluster:gcloudalphacontainervmwareadmin-clustersunenrollADMIN_CLUSTER_NAME--projectCLUSTER_PROJECT--location=CLUSTER_LOCATION--allow-missing
When an admin cluster is enrolled in the GKE On-Prem API, its resource link annotation is applied to the OnPremAdminCluster
custom resource, which is not preserved during later admin cluster updates due to the wrong annotation key being used. This can cause the admin cluster to be enrolled in the GKE On-Prem API again by mistake.
An admin cluster is enrolled in the API when you explicitly enroll the cluster, or when you upgrade a user cluster using a GKE On-Prem API client.
Workaround:
Unenroll the admin cluster:gcloudalphacontainervmwareadmin-clustersunenrollADMIN_CLUSTER_NAME--projectCLUSTER_PROJECT--location=CLUSTER_LOCATION--allow-missing
orderPolicy
not recognizedOrderPolicy
doesn't get recognized as a parameter and isn't used. Instead, Google Distributed Cloud always uses Random
.
This issue occurs because the CoreDNS template was not updated, which causes orderPolicy
to be ignored.
Workaround:
Update the CoreDNS template and apply the fix. This fix persists until an upgrade.
kubectleditcm-nkube-systemcoredns-template
coredns-template:|- .:53{errors health{lameduck5s }ready kubernetescluster.localin-addr.arpaip6.arpa{podsinsecure fallthroughin-addr.arpaip6.arpa } {{-if.PrivateGoogleAccess}}importzones/private.Corefile {{-end}} {{-if.RestrictedGoogleAccess}}importzones/restricted.Corefile {{-end}}prometheus:9153 forward.{{.UpstreamNameservers}}{max_concurrent1000{{-ifne.OrderPolicy""}}policy{{.OrderPolicy}}{{-end}}}cache30 {{-if.DefaultDomainQueryLogging}}log {{-end}}loop reload loadbalance }{{range$i,$stubdomain:=.StubDomains}} {{$stubdomain.Domain}}:53{errors {{-if$stubdomain.QueryLogging}}log {{-end}}cache30forward.{{$stubdomain.Nameservers}}{max_concurrent1000{{-ifne$.OrderPolicy""}}policy{{$.OrderPolicy}}{{-end}}}} {{-end}}
Certain race conditions could cause the OnPremAdminCluster
status to be inconsistent between checkpoint and actual CR. When the issue happens, you could encounter the following error when update the admin cluster after you upgraded it:
Exitwitherror: E032110:20:53.515562961695console.go:93]Failedtoupdatetheadmincluster:OnPremAdminCluster"gke-admin-rj8jr"isinthemiddleofacreate/upgrade(""->"1.15.0-gke.123"),whichmustbecompletedbeforeitcanbeupdated Failedtoupdatetheadmincluster:OnPremAdminCluster"gke-admin-rj8jr"isinthemiddleofacreate/upgrade(""->"1.15.0-gke.123"),whichmustbecompletedbeforeitcanbeupdated
Google Distributed Cloud changes the admin certificates on admin cluster control planes with every reconciliation process, such as during a cluster upgrade. This behavior increases the possibility of getting invalid certificates for your admin cluster, especially for version 1.15 clusters.
If you're affected by this issue, you may encounter problems like the following:
gkectl create admin
gkectl upgrade amdin
gkectl update admin
These commands may return authorization errors like the following:
Failedtoreconcileadmincluster:unabletopopulateadminclients:failedtogetadmincontrollerruntimeclient:Unauthorized
kube-apiserver
logs for your admin cluster may contain errors like the following: Unabletoauthenticatetherequest" err="[x509:certificatehasexpiredorisnotyetvalid...
Workaround:
Upgrade to a version of Google Distributed Cloud with the fix: 1.13.10+, 1.14.6+, 1.15.2+. If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.
Network gateway Pods in kube-system
might show a status of Pending
or Evicted
, as shown in the following condensed example output:
$kubectl-nkube-systemgetpods|grepang-node ang-node-bjkkc2/2Running05d2h ang-node-mw8cq0/2Evicted06m5s ang-node-zsmq70/2Pending07h
These errors indicate eviction events or an inability to schedule Pods due to node resources. As Anthos Network Gateway Pods have no PriorityClass, they have the same default priority as other workloads. When nodes are resource-constrained, the network gateway Pods might be evicted. This behavior is particularly bad for the ang-node
DaemonSet, as those Pods must be scheduled on a specific node and can't migrate.
Workaround:
Upgrade to 1.15 or later.
As a short-term fix, you can manually assign a PriorityClass to the Anthos Network Gateway components. The Google Distributed Cloud controller overwrites these manual changes during a reconciliation process, such as during a cluster upgrade.
system-cluster-critical
PriorityClass to the ang-controller-manager
and autoscaler
cluster controller Deployments.system-node-critical
PriorityClass to the ang-daemon
node DaemonSet. After you use gcloud
to register an admin cluster with non-empty gkeConnect
section, you might see the following error when trying to upgrade the cluster:
failedtoregistercluster:failedtoapplyHubMem\ bership:MembershipAPIrequestfailed:rpcerror:code=InvalidArgumentdesc=InvalidFieldErrorforfieldendpoint.o\ n_prem_cluster.admin_cluster:fieldcannotbeupdated
Delete the gke-connect
namespace:
kubectldeletensgke-connect--kubeconfig=ADMIN_KUBECONFIG
kubectlgetonpremadmincluster-nkube-system--kubeconfig=ADMIN_KUBECONFIG
gcloudcontainerfleetmembershipsdeleteADMIN_CLUSTER_NAME
gkectl diagnose snapshot --log-since
fails to limit the time window for journalctl
commands running on the cluster nodesThis does not affect the functionality of taking a snapshot of the cluster, as the snapshot still includes all logs that are collected by default by running journalctl
on the cluster nodes. Therefore, no debugging information is missed.
gkectl prepare windows
failsgkectl prepare windows
fails to install Docker on Google Distributed Cloud versions earlier than 1.13 because MicrosoftDockerProvider
is deprecated.
Workaround:
The general idea to workaround this issue is to upgrade to Google Distributed Cloud 1.13 and use the 1.13 gkectl
to create a Windows VM template and then create Windows node pools. There are two options to get to Google Distributed Cloud 1.13 from your current version as shown below.
Note: We do have options to workaround this issue in your current version without needing to upgrade all the way to 1.13, but it will need more manual steps, please reach out to our support team if you would like to consider this option.
Option 1: Blue/Green upgrade
You can create a new cluster using Google Distributed Cloud 1.13+ version with windows node pools, and migrate your workloads to the new cluster, then tear down the current cluster. It's recommended to use the latest Google Distributed Cloud minor version.
Note: This will require extra resources to provision the new cluster, but less downtime and disruption for existing workloads.
Option 2: Delete Windows node pools and add them back when upgrading to Google Distributed Cloud 1.13
Note: For this option, the Windows workloads will not be able to run until the cluster is upgraded to 1.13 and Windows node pools are added back.
gkectlupdatecluster--kubeconfig=ADMIN_KUBECONFIG--configUSER_CLUSTER_CONFIG_FILE
enableWindowsDataplaneV2: true
is configured in OnPremUserCluster
CR, otherwise the cluster will keep using Docker for Windows node pools, which will not be compatible with the newly created 1.13 Windows VM template that not have Docker installed. If not configured or setting to false, update your cluster to set it to true in user-cluster.yaml, then run: gkectlupdatecluster--kubeconfig=ADMIN_KUBECONFIG--configUSER_CLUSTER_CONFIG_FILE
gkectlpreparewindows--base-vm-templateBASE_WINDOWS_VM_TEMPLATE_NAME--bundle-path1.13_BUNDLE_PATH--kubeconfig=ADMIN_KUBECONFIG
OSImage
field set to the newly created Windows VM template.gkectlupdatecluster--kubeconfig=ADMIN_KUBECONFIG--configUSER_CLUSTER_CONFIG_FILE
RootDistanceMaxSec
configuration not taking effect for ubuntu
nodesThe 5 seconds default value for RootDistanceMaxSec
will be used on the nodes, instead of 20 seconds which should be the expected configuration. If you check the node startup log by SSH'ing into the VM, which is located at `/var/log/startup.log`, you can find the following error:
+has_systemd_unitsystemd-timesyncd /opt/bin/master.sh:line635:has_systemd_unit:commandnotfound
Using a 5 seconds RootDistanceMaxSec
might cause the system clock to be out of sync with NTP server when the clock drift is larger than 5 seconds.
Workaround:
Apply the following DaemonSet to your cluster to configure RootDistanceMaxSec
:
apiVersion:apps/v1kind:DaemonSetmetadata:name:change-root-distancenamespace:kube-systemspec:selector:matchLabels:app:change-root-distancetemplate:metadata:labels:app:change-root-distancespec:hostIPC:truehostPID:truetolerations:# Make sure pods gets scheduled on all nodes.-effect:NoScheduleoperator:Exists-effect:NoExecuteoperator:Existscontainers:-name:change-root-distanceimage:ubuntucommand:["chroot","/host","bash","-c"]args:-|while true; doconf_file="/etc/systemd/timesyncd.conf.d/90-gke.conf"if [ -f $conf_file ] && $(grep -q "RootDistanceMaxSec=20" $conf_file); thenecho "timesyncd has the expected RootDistanceMaxSec, skip update"elseecho "updating timesyncd config to RootDistanceMaxSec=20"mkdir -p /etc/systemd/timesyncd.conf.dcat > $conf_file << EOF[Time]RootDistanceMaxSec=20EOFsystemctl restart systemd-timesyncdfisleep 600donevolumeMounts:-name:hostmountPath:/hostsecurityContext:privileged:truevolumes:-name:hosthostPath:path:/
gkectl update admin
fails because of empty osImageType
fieldWhen you use version 1.13 gkectl
to update a version 1.12 admin cluster, you might see the following error:
Failedtoupdatetheadmincluster:updatingOSimagetypeinadmincluster isnotsupportedin"1.12.x-gke.x"
When you use gkectl update admin
for version 1.13 or 1.14 clusters, you might see the following message in the response:
Exitwitherror: Failedtoupdatethecluster:theupdatecontainsmultiplechanges.Please updateonlyonefeatureatatime
If you check the gkectl
log, you might see that the multiple changes include setting osImageType
from an empty string to ubuntu_containerd
.
These update errors are due to improper backfilling of the osImageType
field in the admin cluster config since it was introduced in version 1.9.
Workaround:
Upgrade to a version of Google Distributed Cloud with the fix. If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.
The ability to provide an additional serving certificate for the Kubernetes API server of a user cluster with authentication.sni
doesn't work when the Controlplane V2 is enabled (enableControlplaneV2: true
).
Workaround:
Until a Google Distributed Cloud patch is available with the fix, if you need to use SNI, disable Controlplane V2 (enableControlplaneV2: false
).
$
in the private registry username causes admin control plane machine startup failureThe admin control plane machine fails to start up when the private registry username contains $
. When checking the /var/log/startup.log
on the admin control plane machine, you see the following error:
++REGISTRY_CA_CERT=xxx ++REGISTRY_SERVER=xxx /etc/startup/startup.conf:line7:anthos:unboundvariable
Workaround:
Use a private registry username without $
, or use a version of Google Distributed Cloud with the fix.
When you update admin clusters, you will see the following false-positive warnings in the log, and you can ignore them.
console.go:47]detectedunsupportedchanges:&v1alpha1.OnPremAdminCluster{... -CARotation:&v1alpha1.CARotationConfig{Generated:&v1alpha1.CARotationGenerated{CAVersion:1}}, +CARotation:nil, ... }
After you rotate KSA signing keys and subsequently update a user cluster, gkectl update
might fail with the following error message:
FailedtoapplyOnPremUserCluster'USER_CLUSTER_NAME-gke-onprem-mgmt/USER_CLUSTER_NAME': admissionwebhook"vonpremusercluster.onprem.cluster.gke.io"deniedtherequest: requestsmustnotdecrement*v1alpha1.KSASigningKeyRotationConfigVersion,oldversion:2,newversion:1"
Workaround:
Change the version of your KSA signing key version back to 1, but retain the latest key data:USER_CLUSTER_NAME
namespace, and get the name of ksa-signing-key secret: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAMEgetsecrets|grepksa-signing-key
kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAMEgetsecretKSA-KEY-SECRET-NAME-oyaml|\ sed's/ name: .*/ name: service-account-cert/'|\ kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAMEapply-f-
kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAMEdeletesecretKSA-KEY-SECRET-NAME
data.data
field in ksa-signing-key-rotation-stage configmap to '{"tokenVersion":1,"privateKeyVersion":1,"publicKeyVersions":[1]}'
: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAME\ editconfigmapksa-signing-key-rotation-stage
kubectl--kubeconfig=ADMIN_KUBECONFIGpatchvalidatingwebhookconfigurationonprem-user-cluster-controller-p'webhooks:- name: vonpremnodepool.onprem.cluster.gke.io rules: - apiGroups: - onprem.cluster.gke.io apiVersions: - v1alpha1 operations: - CREATE resources: - onpremnodepools- name: vonpremusercluster.onprem.cluster.gke.io rules: - apiGroups: - onprem.cluster.gke.io apiVersions: - v1alpha1 operations: - CREATE resources: - onpremuserclusters'
spec.ksaSigningKeyRotation.generated.ksaSigningKeyRotation
field to 1
in your OnPremUserCluster custom resource: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAME-gke-onprem-mgmt\ editonpremuserclusterUSER_CLUSTER_NAME
kubectl--kubeconfig=ADMIN_KUBECONFIG-n=USER_CLUSTER_NAME-gke-onprem-mgmt\ getonpremusercluster
kubectl--kubeconfig=ADMIN_KUBECONFIGpatchvalidatingwebhookconfigurationonprem-user-cluster-controller-p'webhooks:- name: vonpremnodepool.onprem.cluster.gke.io rules: - apiGroups: - onprem.cluster.gke.io apiVersions: - v1alpha1 operations: - CREATE - UPDATE resources: - onpremnodepools- name: vonpremusercluster.onprem.cluster.gke.io rules: - apiGroups: - onprem.cluster.gke.io apiVersions: - v1alpha1 operations: - CREATE - UPDATE resources: - onpremuserclusters'
When you use Terraform to delete a user cluster with a F5 BIG-IP load balancer, the F5 BIG-IP virtual servers aren't removed after the cluster deletion.
Workaround:
To remove the F5 resources, follow the steps to clean up a user cluster F5 partition
docker.io
If you create a version 1.13.8 or version 1.14.4 admin cluster, or upgrade an admin cluster to version 1.13.8 or 1.14.4, the kind cluster pulls the following container images from docker.io
:
docker.io/kindest/kindnetd
docker.io/kindest/local-path-provisioner
docker.io/kindest/local-path-helper
If docker.io
isn't accessible from your admin workstation, the admin cluster creation or upgrade fails to bring up the kind cluster. Running the following command on the admin workstation shows the corresponding containers pending with ErrImagePull
:
dockerexecgkectl-control-planekubectlgetpods-A
The response contains entries like the following:
... kube-systemkindnet-xlhmr0/1 ErrImagePull03m12s ... local-path-storagelocal-path-provisioner-86666ffff6-zzqtp0/1 Pending03m12s ...
These container images should be preloaded in the kind cluster container image. However, kind v0.18.0 has an issue with the preloaded container images, which causes them to be pulled from the internet by mistake.
Workaround:
Run the following commands on the admin workstation, while your admin cluster is pending on creation or upgrade:
dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497afdocker.io/kindest/kindnetd:v20230330-48f316cd dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497afdocker.io/kindest/kindnetd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270docker.io/kindest/local-path-helper:v20230330-48f316cd dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270docker.io/kindest/local-path-helper@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270 dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501docker.io/kindest/local-path-provisioner:v0.0.23-kind.0 dockerexecgkectl-control-planectr-nk8s.ioimagestagdocker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501docker.io/kindest/local-path-provisioner@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501
If your cluster VMs are connected with a switch that filters out duplicate GARP (gratuitous ARP) requests, the keepalived leader election might encounter a race condition, which causes some nodes to have incorrect ARP table entries.
The affected nodes can ping
the control plane VIP, but a TCP connection to the control plane VIP will time out.
Workaround:
Run the following command on each control plane node of the affected cluster:iptables-IFORWARD-iens192--destinationCONTROL_PLANE_VIP-jDROP
vsphere-csi-controller
needs be restarted after the vCenter certificate rotation vsphere-csi-controller
should refresh its vCenter secret after vCenter certificate rotation. However, the current system does not properly restart the pods of vsphere-csi-controller
, causing vsphere-csi-controller
to crash after the rotation.
Workaround:
For clusters created at 1.13 and later versions, follow the instructions below to restart vsphere-csi-controller
kubectl --kubeconfig=ADMIN_KUBECONFIG rollout restart deployment vsphere-csi-controller -n kube-system
Even when cluster registration fails during admin cluster creation, the command gkectl create admin
does not fail on the error and might succeed. In other words, the admin cluster creation could "succeed" without being registered to a fleet.
Failed to register admin cluster
You can also check whether you can find the cluster among registered clusters on cloud console.
Workaround:
For clusters created at 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters created at earlier versions,
gkectl update admin
to re-register the admin cluster. During admin cluster upgrade, if upgrading user control plane nodes times out, the admin cluster will not be re-registered with the updated connect agent version.
Workaround:
Check whether the cluster shows among registered clusters. As an optional step, Log in to the cluster after setting up authentication. If the cluster is still registered, you might skip the following instructions for re-attempting the registration. For clusters upgraded to 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters upgraded to earlier versions,gkectl update admin
to re-register the admin cluster. vCenter.dataDisk
For a high-availability admin cluster, gkectl prepare
shows this false error message:
vCenter.dataDisk must be present in the AdminCluster spec
Workaround:
You can safely ignore this error message.
During creation of a node pool that uses VM-Host affinity, a race condition might result in multiple VM-Host affinity rules being created with the same name. This can cause node pool creation to fail.
Workaround:
Remove the old redundant rules so that node pool creation can proceed. These rules are named [USER_CLUSTER_NAME]-[HASH].
gkectl repair admin-master
may fail due to failed to delete the admin master node object and reboot the admin master VM
The gkectl repair admin-master
command may fail due to a race condition with the following error.
Failedtorepair:failedtodeletetheadminmasternodeobjectandreboottheadminmasterVM
Workaround:
This command is idempotent. It can rerun safely until the command succeeds.
After you re-create or update a control-plane node, certain Pods might be left in the Failed
state due to NodeAffinity predicate failure. These failed Pods don't affect normal cluster operations or health.
Workaround:
You can safely ignore the failed Pods or manually delete them.
If you use prepared credentials and a private registry, but you haven't configured prepared credentials for your private registry, the OnPremUserCluster might not become ready, and you might see the following error message:
failed to check secret reference for private registry …
Workaround:
Prepare the private registry credentials for the user cluster according to the instructions in Configure prepared credentials.
gkectl upgrade admin
fails with StorageClass standard sets the parameter diskformat which is invalid for CSI Migration
During gkectl upgrade admin
, the storage preflight check for CSI Migration verifies that the StorageClasses don't have parameters that are ignored after CSI Migration. For example, if there's a StorageClass with the parameter diskformat
then gkectl upgrade admin
flags the StorageClass and reports a failure in the preflight validation. Admin clusters created in Google Distributed Cloud 1.10 and before have a StorageClass with diskformat: thin
which will fail this validation however this StorageClass still works fine after CSI Migration. These failures should be interpreted as warnings instead.
For more information, check the StorageClass parameter section in Migrating In-Tree vSphere Volumes to vSphere Container Storage Plug-in.
Workaround:
After confirming that your cluster has a StorageClass with parameters ignored after CSI Migration run gkectl upgrade admin
with the flag --skip-validation-cluster-health
.
Under certain conditions disks can be attached as readonly to Windows nodes. This results in the corresponding volume being readonly inside a Pod. This problem is more likely to occur when a new set of nodes replaces an old set of nodes (for example, cluster upgrade or node pool update). Stateful workloads that previously worked fine might be unable to write to their volumes on the new set of nodes.
Workaround:
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIGgetpod\POD_NAME--namespacePOD_NAMESPACE\-o=jsonpath='{.metadata.uid}{"\n"}'
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIGgetpvc\PVC_NAME--namespacePOD_NAMESPACE\-ojsonpath='{.spec.volumeName}{"\n"}'
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIGgetpods\--namespacePOD_NAMESPACE\-ojsonpath='{.spec.nodeName}{"\n"}'
PS C:\Users\administrator> pvname=PV_NAME PS C:\Users\administrator> podid=POD_UID
PS C:\Users\administrator> disknum=(Get-Partition -Volume (Get-Volume -UniqueId ("\\?\"+(Get-Item (Get-Item "C:\var\lib\kubelet\pods\$podid\volumes\kubernetes.io~csi\$pvname\mount").Target).Target))).DiskNumber
readonly
: PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
True
. readonly
to false
. PS C:\Users\administrator> Set-Disk -Number $disknum -IsReadonly $false PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIGdeletepodPOD_NAME\--namespacePOD_NAMESPACE
vsphere-csi-secret
is not updated after gkectl update credentials vsphere --admin-cluster
If you update the vSphere credentials for an admin cluster following updating cluster credentials, you might find vsphere-csi-secret
under kube-system
namespace in the admin cluster still uses the old credential.
Workaround:
vsphere-csi-secret
secret name: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=kube-systemgetsecrets|grepvsphere-csi-secret
vsphere-csi-secret
secret you got from the above step: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=kube-systempatchsecretCSI_SECRET_NAME-p\"{\"data\":{\"config\":\"$(\kubectl--kubeconfig=ADMIN_KUBECONFIG-n=kube-systemgetsecretsCSI_SECRET_NAME-ojsonpath='{.data.config}'\|base64-d\|sed-e'/user/c user = \"VSPHERE_USERNAME_TO_BE_UPDATED\"'\|sed-e'/password/c password = \"VSPHERE_PASSWORD_TO_BE_UPDATED\"'\|base64-w0\)\"}}"
vsphere-csi-controller
: kubectl--kubeconfig=ADMIN_KUBECONFIG-n=kube-systemrolloutrestartdeploymentvsphere-csi-controller
kubectl--kubeconfig=ADMIN_KUBECONFIG-n=kube-systemrolloutstatusdeploymentvsphere-csi-controller
vsphere-csi-secret
should be used by the controller. audit-proxy
crashloop when enabling Cloud Audit Logs with gkectl update cluster
audit-proxy
might crashloop because of empty --cluster-name
. This behavior is caused by a bug in the update logic, where the cluster name is not propagated to the audit-proxy pod / container manifest.
Workaround:
For a control plane v2 user cluster with enableControlplaneV2: true
, connect to the user control plane machine using SSH, and update /etc/kubernetes/manifests/audit-proxy.yaml
with --cluster_name=USER_CLUSTER_NAME
.
For a control plane v1 user cluster, edit the audit-proxy
container in the kube-apiserver
statefulset to add --cluster_name=USER_CLUSTER_NAME
:
kubectleditstatefulsetkube-apiserver-nUSER_CLUSTER_NAME--kubeconfig=ADMIN_CLUSTER_KUBECONFIG
gkectl upgrade cluster
Right after gkectl upgrade cluster
, the control plane pods might be re-deployed again. The cluster state from gkectl list clusters
change from RUNNING
TO RECONCILING
. Requests to the user cluster might timeout.
This behavior is because of the control plane certificate rotation happens automatically after gkectl upgrade cluster
.
This issue only happens to user clusters that do NOT use control plane v2.
Workaround:
Wait for the cluster state to change back to RUNNING
again in gkectl list clusters
, or upgrade to versions with the fix: 1.13.6+, 1.14.2+ or 1.15+.
Google Distributed Cloud 1.12.7-gke.19 is a bad release and you should not use it. The artifacts have been removed from the Cloud Storage bucket.
Workaround:
Use the 1.12.7-gke.20 release instead.
gke-connect-agent
continues to use the older image after registry credential updatedIf you update the registry credential using one of the following methods:
gkectl update credentials componentaccess
if not using private registrygkectl update credentials privateregistry
if using private registryyou might find gke-connect-agent
continues to use the older image or the gke-connect-agent
pods cannot be pulled up due to ImagePullBackOff
.
This issue will be fixed in Google Distributed Cloud releases 1.13.8, 1.14.4, and subsequent releases.
Workaround:
Option 1: Redeploy gke-connect-agent
manually:
gke-connect
namespace: kubectl--kubeconfig=KUBECONFIGdeletenamespacegke-connect
gke-connect-agent
with the original register service account key (no need to update the key): For admin cluster: gkectlupdatecredentialsregister--kubeconfig=ADMIN_CLUSTER_KUBECONFIG--configADMIN_CLUSTER_CONFIG_FILE--admin-cluster
gkectlupdatecredentialsregister--kubeconfig=ADMIN_CLUSTER_KUBECONFIG--configUSER_CLUSTER_CONFIG_FILE
Option 2: You can manually change the data of the image pull secret regcred
which is used by gke-connect-agent
deployment:
kubectl--kubeconfig=KUBECONFIG-n=gke-connectpatchsecretsregcred-p"{\"data\":{\".dockerconfigjson\":\"$(kubectl--kubeconfig=KUBECONFIG-n=kube-systemgetsecretsprivate-registry-creds-ojsonpath='{.data.\.dockerconfigjson}')\"}}"
Option 3: You can add the default image pull secret for your cluster in the gke-connect-agent
deployment by:
gke-connect
namespace: kubectl--kubeconfig=KUBECONFIG-n=kube-systemgetsecretprivate-registry-creds-oyaml|sed's/ namespace: .*/ namespace: gke-connect/'|kubectl--kubeconfig=KUBECONFIG-n=gke-connectapply-f-
gke-connect-agent
deployment name: kubectl--kubeconfig=KUBECONFIG-n=gke-connectgetdeployment|grepgke-connect-agent
gke-connect-agent
deployment: kubectl--kubeconfig=KUBECONFIG-n=gke-connectpatchdeploymentDEPLOYMENT_NAME-p'{"spec":{"template":{"spec":{"imagePullSecrets": [{"name": "private-registry-creds"}, {"name": "regcred"}]}}}}'
When you validate the configuration before creating a cluster with Manual load balancer by running gkectl check-config
, then the command will fail with the following error messages.
-ValidationCategory:ManualLBRunningvalidationcheckfor"Networkconfiguration"...panic:runtimeerror:invalidmemoryaddressornilpointer dereference
Workaround:
Option 1: You can use the patch version 1.13.7 and 1.14.4 that will include the fix.
Option 2: You can also run the same command to validate the configuration but skip the load balancer validation.
gkectlcheck-config--skip-validation-load-balancer
Clusters running etcd version 3.4.13 or earlier may experience watch starvation and non-operational resource watches, which can lead to the following problems:
These problems can make the cluster non-functional.
This issue is fixed in Google Distributed Cloud releases 1.12.7, 1.13.6, 1.14.3, and subsequent releases. These newer releases use etcd version 3.4.21. All prior versions of Google Distributed Cloud are affected by this issue.
Workaround
If you can't upgrade immediately, you can mitigate the risk of cluster failure by reducing the number of nodes in your cluster. Remove nodes until the etcd_network_client_grpc_sent_bytes_total
metric is less than 300 MBps.
To view this metric in Metrics Explorer:
Kubernetes Container
in the filter bar, and then use the submenus to select the metric: etcd_network_client_grpc_sent_bytes_total
.At cluster restarts or upgrades, GKE Identity Service can get overwhelmed with traffic consisting of expired JWT tokens forwarded from the kube-apiserver
to GKE Identity Service over the authentication webhook. Although GKE Identity Service doesn't crashloop, it becomes unresponsive and ceases to serve further requests. This problem ultimately leads to higher control plane latencies.
This issue is fixed in the following Google Distributed Cloud releases:
To determine if you're affected by this issue, perform the following steps:
curl-s-o/dev/null-w"%{http_code}"\-XPOSThttps://CLUSTER_ENDPOINT/api/v1/namespaces/anthos-identity-service/services/https:ais:https/proxy/authenticate-d'{}'
Replace CLUSTER_ENDPOINT with the control plane VIP and control plane load balancer port for your cluster (for example, 172.16.20.50:443
).
If you're affected by this issue, the command returns a 400
status code. If the request times out, restart the ais
Pod and rerun the curl
command to see if that resolves the problem. If you get a status code of 000
, the problem has been resolved and you are done. If you still get a 400
status code, the GKE Identity Service HTTP server isn't starting. In this case, continue.
kubectllogs-f-lk8s-app=ais-nanthos-identity-service\--kubeconfigKUBECONFIG
If the log contains an entry like the following, then you are affected by this issue:
I081122:32:03.58344832authentication_plugin.cc:295]StoppingOIDCauthenticationfor???.UnabletoverifytheOIDCIDtoken:JWTverificationfailed:TheJWTdoesnotappeartobefromthisidentityprovider.Tomatchthisprovider,the'aud'claimmustcontainoneofthefollowingaudiences:
kube-apiserver
logs for your clusters: In the following commands, KUBE_APISERVER_POD is the name of the kube-apiserver
Pod on the given cluster.
Admin cluster:
kubectl--kubeconfigADMIN_KUBECONFIGlogs\-nkube-systemKUBE_APISERVER_PODkube-apiserver
User cluster:
kubectl--kubeconfigADMIN_KUBECONFIGlogs\-nUSER_CLUSTER_NAMEKUBE_APISERVER_PODkube-apiserver
If the kube-apiserver
logs contain entries like the following, then you are affected by this issue:
E081122:30:22.6560851webhook.go:127]Failedtomakewebhookauthenticatorrequest:errortryingtoreachservice:net/http:TLShandshaketimeout E081122:30:22.6562661authentication.go:63]"Unable to authenticate the request"err="[invalid bearer token, error trying to reach service: net/http: TLS handshake timeout]"
Workaround
If you can't upgrade your clusters immediately to get the fix, you can identify and restart the offending pods as a workaround:
kubectlpatchdeploymentais-nanthos-identity-service--type=json\-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", \ "value":"--vmodule=cloud/identity/hybrid/charon/*=9"}]'\--kubeconfigKUBECONFIG
kubectllogs-f-lk8s-app=ais-nanthos-identity-service\--kubeconfigKUBECONFIG
kubectl-nkube-systemgetsecretSA_SECRET\--kubeconfigKUBECONFIG\-ojsonpath='{.data.token}'|base64--decode
The etcd maintenance pods that use etcddefrag:gke_master_etcddefrag_20210211.00_p0
image are affected. The `etcddefrag` container opens a new connection to etcd server during each defrag cycle and the old connections are not cleaned up.
Workaround:
Option 1: Upgrade to the latest patch version from 1.8 to 1.11 which contain the fix.
Option 2: If you are using patch version earlier than 1.9.6 and 1.10.3, you need to scale down the etcd-maintenance pod for admin and user cluster:
kubectlscale--replicas0deployment/gke-master-etcd-maintenance-nUSER_CLUSTER_NAME--kubeconfigADMIN_CLUSTER_KUBECONFIG kubectlscale--replicas0deployment/gke-master-etcd-maintenance-nkube-system--kubeconfigADMIN_CLUSTER_KUBECONFIG
Both the cluster health controller and the gkectl diagnose cluster
command perform a set of health checks including the pods health checks across namespaces. However, they start to skip the user control plane pods by mistake. If you use the control plane v2 mode, this won't affect your cluster.
Workaround:
This won't affect any workload or cluster management. If you want to check the control plane pods healthiness, you can run the following commands:
kubectlgetpods-owide-nUSER_CLUSTER_NAME--kubeconfigADMIN_CLUSTER_KUBECONFIG
k8s.gcr.io
-> registry.k8s.io
redirectKubernetes redirected the traffic from k8s.gcr.io
to registry.k8s.io
on 3/20/2023. In Google Distributed Cloud 1.6.x and 1.7.x, the admin cluster upgrades use the container image k8s.gcr.io/pause:3.2
. If you use a proxy for your admin workstation and the proxy doesn't allow registry.k8s.io
and the container image k8s.gcr.io/pause:3.2
is not cached locally, the admin cluster upgrades will fail when pulling the container image.
Workaround:
Add registry.k8s.io
to the allowlist of the proxy for your admin workstation.
gkectl create loadbalancer
fails with the following error message:
-ValidationCategory:SeesawLB-[FAILURE]Seesawvalidation:xxxclusterlbhealthcheckfailed:LB"xxx.xxx.xxx.xxx"isnothealthy:Get"http://xxx.xxx.xxx.xxx:xxx/healthz":dialtcpxxx.xxx.xxx.xxx:xxx:connect:noroutetohost
This is due to the seesaw group file already existing. And the preflight check tries to validate a non-existent seesaw load balancer.
Workaround:
Remove the existing seesaw group file for this cluster. The file name is seesaw-for-gke-admin.yaml
for the admin cluster, and seesaw-for-{CLUSTER_NAME}.yaml
for a user cluster.
Google Distributed Cloud version 1.14 is susceptible to netfilter connection tracking (conntrack) table insertion failures when using Ubuntu or COS operating system images. Insertion failures lead to random application timeouts and can occur even when the conntrack table has room for new entries. The failures are caused by changes in kernel 5.15 and higher that restrict table insertions based on chain length.
To see if you are affected by this issue, you can check the in-kernel connection tracking system statistics on each node with the following command:
sudoconntrack-S
The response looks like this:
cpu=0found=0invalid=4insert=0insert_failed=0drop=0early_drop=0error=0search_restart=0clash_resolve=0chaintoolong=0cpu=1found=0invalid=0insert=0insert_failed=0drop=0early_drop=0error=0search_restart=0clash_resolve=0chaintoolong=0cpu=2found=0invalid=16insert=0insert_failed=0drop=0early_drop=0error=0search_restart=0clash_resolve=0chaintoolong=0cpu=3found=0invalid=13insert=0insert_failed=0drop=0early_drop=0error=0search_restart=0clash_resolve=0chaintoolong=0cpu=4found=0invalid=9insert=0insert_failed=0drop=0early_drop=0error=0search_restart=0clash_resolve=0chaintoolong=0cpu=5found=0invalid=1insert=0insert_failed=0drop=0early_drop=0error=519search_restart=0clash_resolve=126chaintoolong=0 ...
If a chaintoolong
value in the response is a non-zero number, you're affected by this issue.
Workaround
The short term mitigation is to increase the size of both the netfiler hash table (nf_conntrack_buckets
) and the netfilter connection tracking table (nf_conntrack_max
). Use the following commands on each cluster node to increase the size of the tables:
sysctl-wnet.netfilter.nf_conntrack_buckets=TABLE_SIZE sysctl-wnet.netfilter.nf_conntrack_max=TABLE_SIZE
Replace TABLE_SIZE with new size in bytes. The default table size value is 262144
. We suggest that you set a value equal to 65,536 times the number of cores on the node. For example, if your node has eight cores, set the table size to 524288
.
With Controlplane V2 enabled, calico-typha
or anetd-operator
might be scheduled to Windows nodes and get into crash loop.
The reason is that the two deployments tolerate all taints including Windows node taint.
Workaround:
Either upgrade to 1.13.3+, or run the following commands to edit the `calico-typha` or `anetd-operator` deployment:
# If dataplane v2 is not used.kubectleditdeployment-nkube-systemcalico-typha--kubeconfigUSER_CLUSTER_KUBECONFIG# If dataplane v2 is used.kubectleditdeployment-nkube-systemanetd-operator--kubeconfigUSER_CLUSTER_KUBECONFIG
Remove the following spec.template.spec.tolerations
:
-effect:NoScheduleoperator:Exists-effect:NoExecuteoperator:Exists
And add the following toleration:
-key:node-role.kubernetes.io/masteroperator:Exists
You might not be able to create a user cluster if you specify the privateRegistry
section with credential fileRef
. Preflight might fail with the following message:
[FAILURE] Docker registry access: Failed to login.
Workaround:
privateRegistry
section in your user cluster config file.privateRegistry
section this way: privateRegistry:address:PRIVATE_REGISTRY_ADDRESScredentials:username:PRIVATE_REGISTRY_USERNAMEpassword:PRIVATE_REGISTRY_PASSWORDcaCertPath:PRIVATE_REGISTRY_CACERT_PATH
Dataplane V2 takes over load balancing and creates a kernel socket instead of a packet based DNAT. This means that Cloud Service Mesh cannot do packet inspection as the pod is bypassed and never uses IPTables.
This manifests in kube-proxy free mode by loss of connectivity or incorrect traffic routing for services with Cloud Service Mesh as the sidecar cannot do packet inspection.
This issue is present on all versions of Google Distributed Cloud 1.10, however some newer versions of 1.10 (1.10.2+) have a workaround.
Workaround:
Either upgrade to 1.11 for full compatibility or if running 1.10.2 or later, run:
kubectleditcm-nkube-systemcilium-config--kubeconfigUSER_CLUSTER_KUBECONFIG
Add bpf-lb-sock-hostns-only: true
to the configmap and then restart the anetd daemonset:
kubectlrolloutrestartdsanetd-nkube-system--kubeconfigUSER_CLUSTER_KUBECONFIG
kube-controller-manager
might detach persistent volumes forcefully after 6 minuteskube-controller-manager
might timeout when detaching PV/PVCs after 6 minutes, and forcefully detach the PV/PVCs. Detailed logs from kube-controller-manager
show events similar to the following:
$ cat kubectl_logs_kube-controller-manager-xxxx | grep "DetachVolume started" | grep expired kubectl_logs_kube-controller-manager-gke-admin-master-4mgvr_--container_kube-controller-manager_--kubeconfig_kubeconfig_--request-timeout_30s_--namespace_kube-system_--timestamps:2023-01-05T16:29:25.883577880Z W0105 16:29:25.883446 1 reconciler.go:224] attacherDetacher.DetachVolume started for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
To verify the issue, log into the node and run the following commands:
# See all the mounting points with disks lsblk-f # See some ext4 errors sudodmesg-T
In the kubelet log, errors like the following are displayed:
Error: GetDeviceMountRefs check failed for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" : the device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount
Workaround:
Connect to the affected node using SSH and reboot the node.
You might not be able to upgrade a cluster if you use a 3rd party CSI driver. The gkectl cluster diagnose
command might return the following error:
"virtual disk "kubernetes.io/csi/csi.netapp.io^pvc-27a1625f-29e3-4e4f-9cd1-a45237cc472c" IS NOT attached to machine "cluster-pool-855f694cc-cjk5c" but IS listed in the Node.Status"
Workaround:
Perform the upgrade using the --skip-validation-all
option.
gkectl repair admin-master
creates the admin master VM without upgrading its vm hardware versionThe admin master node created via gkectl repair admin-master
may use a lower VM hardware version than expected. When the issue happens, you will see the error from the gkectl diagnose cluster
report.
CSIPrerequisites [VM Hardware]: The current VM hardware versions are lower than vmx-15 which is unexpected. Please contact Anthos support to resolve this issue.
Workaround:
Shutdown the admin master node, follow https://kb.vmware.com/s/article/1003746 to upgrade the node to the expected version described in the error message, and then start the node.
In systemd v244, systemd-networkd
has a default behavior change on the KeepConfiguration
configuration. Before this change, VMs did not send a DHCP lease release message to the DHCP server on shutdown or reboot. After this change, VMs send such a message and return the IPs to the DHCP server. As a result, the released IP may be reallocated to a different VM and/or a different IP may be assigned to the VM, resulting in IP conflict (at Kubernetes level, not vSphere level) and/or IP change on the VMs, which can break the clusters in various ways.
For example, you may see the following symptoms.
kubectl get nodes -o wide
returns nodes with duplicate IPs. NAME STATUS AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready 28h v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13 node2 NotReady 71d v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13
calico-node
error 2023-01-19T22:07:08.817410035Z 2023-01-19 22:07:08.817 [WARNING][9] startup/startup.go 1135: Calico node 'node1' is already using the IPv4 address 10.180.85.130. 2023-01-19T22:07:08.817514332Z 2023-01-19 22:07:08.817 [INFO][9] startup/startup.go 354: Clearing out-of-date IPv4 address from this node IP="10.180.85.130/24" 2023-01-19T22:07:08.825614667Z 2023-01-19 22:07:08.825 [WARNING][9] startup/startup.go 1347: Terminating 2023-01-19T22:07:08.828218856Z Calico node failed to start
Workaround:
Deploy the following DaemonSet on the cluster to revert the systemd-networkd
default behavior change. The VMs that run this DaemonSet will not release the IPs to the DHCP server on shutdown/reboot. The IPs will be freed automatically by the DHCP server when the leases expire.
apiVersion:apps/v1kind:DaemonSetmetadata:name:set-dhcp-on-stopspec:selector:matchLabels:name:set-dhcp-on-stoptemplate:metadata:labels:name:set-dhcp-on-stopspec:hostIPC:truehostPID:truehostNetwork:truecontainers:-name:set-dhcp-on-stopimage:ubuntutty:truecommand:-/bin/bash--c-|set -xdatewhile true; doexport CONFIG=/host/run/systemd/network/10-netplan-ens192.network;grep KeepConfiguration=dhcp-on-stop "${CONFIG}" > /dev/nullif (( $? != 0 )) ; thenecho "Setting KeepConfiguration=dhcp-on-stop"sed -i '/\[Network\]/a KeepConfiguration=dhcp-on-stop' "${CONFIG}"cat "${CONFIG}"chroot /host systemctl restart systemd-networkdelseecho "KeepConfiguration=dhcp-on-stop has already been set"fi;sleep 3600donevolumeMounts:-name:hostmountPath:/hostresources:requests:memory:"10Mi"cpu:"5m"securityContext:privileged:truevolumes:-name:hosthostPath:path:/tolerations:-operator:Existseffect:NoExecute-operator:Existseffect:NoSchedule
This issue will only affect admin clusters which are upgraded from 1.11.x, and won't affect admin clusters which are newly created after 1.12.
After upgrading a 1.11.x cluster to 1.12.x, the component-access-sa-key
field in admin-cluster-creds
secret will be wiped out to empty. This can be checked by running the following command:
kubectl--kubeconfigADMIN_KUBECONFIG-nkube-systemgetsecretadmin-cluster-creds-oyaml|grep'component-access-sa-key'
After the component access service account key been deleted, installing new user clusters or upgrading existing user clusters will fail. The following lists some error messages you might encounter:
"Failed to create the test VMs: failed to get service account key: service account is not configured."
gkectl prepare
failed with error message: "Failed to prepare OS images: dialing: unexpected end of JSON input"
gkectl update admin --enable-preview-user-cluster-central-upgrade
to deploy the upgrade platform controller, the command fails with the message: "failed to download bundle to disk: dialing: unexpected end of JSON input"
(You can see this message in the status
field in the output of kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get onprembundle -oyaml
). Workaround:
Add the component access service account key back into the secret manually by running the following command:
kubectl--kubeconfigADMIN_KUBECONFIG-nkube-systemgetsecretadmin-cluster-creds-ojson|jq--argcasa"$(catCOMPONENT_ACESS_SERVICE_ACOOUNT_KEY_PATH|base64-w0)"'.data["component-access-sa-key"]=$casa'|kubectl--kubeconfigADMIN_KUBECONFIGapply-f-
For user clusters created with Controlplane V2 enabled, node pools with autoscaling enabled always use their autoscaling.minReplicas
in the user-cluster.yaml
. The log of the cluster-autoscaler pod shows an error similar to the following:
>kubectl--kubeconfig$USER_CLUSTER_KUBECONFIG-nkube-system\logs$CLUSTER_AUTOSCALER_POD--container_cluster-autoscaler TIMESTAMP1gkeonprem_provider.go:73]errorgettingonpremuserclusterreadystatus:Expectedtogetaonpremuserclusterwithidfoo-user-cluster-gke-onprem-mgmt/foo-user-cluster TIMESTAMP1static_autoscaler.go:298]Failedtogetnodeinfosforgroups:Expectedtogetaonpremuserclusterwithidfoo-user-cluster-gke-onprem-mgmt/foo-user-cluster
>kubectl--kubeconfig$USER_CLUSTER_KUBECONFIG-nkube-system\getpods|grepcluster-autoscaler cluster-autoscaler-5857c74586-txx2c4648017n48076Ki30s
Workaround:
Disable autoscaling in all the node pools with `gkectl update cluster` until upgrading to a version with the fix
When users use CIDR in the IP block file, the config validation will fail with the following error:
-ValidationCategory:ConfigCheck -[FAILURE]Config:AddressBlockforadminclusterspecisinvalid:invalidIP: 172.16.20.12/30
Workaround:
Include individual IPs in the IP block file until upgrading to a version with the fix: 1.12.5, 1.13.4, 1.14.1+.
When Updating control plane OS image type in the admin-cluster.yaml, and if its corresponding user cluster was created via Controlplane V2, the user control plane machines may not finish their re-creation when the gkectl command finishes.
Workaround:
After the update is finished, keep waiting for the user control plane machines to also finish their re-creation by monitoring their node os image types using kubectl --kubeconfig USER_KUBECONFIG get nodes -owide
. e.g. If updating from Ubuntu to COS, we should wait for all the control plane machines to completely change from Ubuntu to COS even after the update command is complete.
An issue with Calico in Google Distributed Cloud 1.14.0 causes Pod creation and deletion to fail with the following error message in the output of kubectl describe pods
:
error getting ClusterInformation: connection is unauthorized: Unauthorized
This issue is only observed 24 hours after the cluster is created or upgraded to 1.14 using Calico.
Admin clusters are always using Calico, while for user cluster there is a config field `enableDataPlaneV2` in user-cluster.yaml, if that field is set to `false`, or not specified, that means you are using Calico in user cluster.
The nodes' install-cni
container creates a kubeconfig with a token that is valid for 24 hours. This token needs to be periodically renewed by the calico-node
Pod. The calico-node
Pod is unable to renew the token as it doesn't have access to the directory that contains the kubeconfig file on the node.
Workaround:
This issue was fixed in Google Distributed Cloud version 1.14.1. Upgrade to this or a later version.
If you can't upgrade right away, apply the following patch on the calico-node
DaemonSet in your admin and user cluster:
kubectl-nkube-systemgetdaemonsetcalico-node\--kubeconfigADMIN_CLUSTER_KUBECONFIG-ojson\|jq'.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]'\|kubectlapply--kubeconfigADMIN_CLUSTER_KUBECONFIG-f- kubectl-nkube-systemgetdaemonsetcalico-node\--kubeconfigUSER_CLUSTER_KUBECONFIG-ojson\|jq'.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]'\|kubectlapply--kubeconfigUSER_CLUSTER_KUBECONFIG-f-
ADMIN_CLUSTER_KUBECONFIG
: the path of the admin cluster kubeconfig file.USER_CLUSTER_CONFIG_FILE
: the path of your user cluster configuration file.Cluster creation fails despite the user having the proper configuration. User sees creation failing due to the cluster not having enough IPs.
Workaround:
Split CIDR's into several smaller CIDR blocks, such as 10.0.0.0/30
becomes 10.0.0.0/31, 10.0.0.2/31
. As long as there are N+1 CIDR's, where N is the number of nodes in the cluster, this should suffice.
When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup
causes the following error:
ValidatingadminmasterVMxxx... Waitingforkube-apiservertobeaccessibleviaLBVIP(timeout"8m0s")...ERROR Failedtoaccesskube-apiserverviaLBVIP.Tryingtofixtheproblembyrebootingtheadminmaster Waitingforkube-apiservertobeaccessibleviaLBVIP(timeout"13m0s")...ERROR Failedtoaccesskube-apiserverviaLBVIP.Tryingtofixtheproblembyrebootingtheadminmaster Waitingforkube-apiservertobeaccessibleviaLBVIP(timeout"18m0s")...ERROR Failedtoaccesskube-apiserverviaLBVIP.Tryingtofixtheproblembyrebootingtheadminmaster
gkectl repair admin-master
) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command. If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update
operation then the gkectl repair admin-master
fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.
Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.
The disk_label
was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables
, which triggered an update on all MachineDeployment
s unexpectedly.
Workaround:
Upgrade user cluster to 1.10.0 might cause docker restart frequently.
You can detect this issue by running kubectl describe node NODE_NAME --kubeconfig USER_CLUSTER_KUBECONFIG
A node condition will show whether the docker restart frequently. Here is an example output:
NormalFrequentDockerRestart41m(x2over141m)systemd-monitorNodeconditionFrequentDockerRestartisnow:True,reason:FrequentDockerRestart
To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker
or sudo journalctl -x
Workaround:
If you are using a Google Distributed Cloud version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system
namespace for your cluster, the components are not preserved when you upgrade to version 1.12.x.
From version 1.12, GMP components in the gmp-system
namespace and CRDs are managed by stackdriver
object, with the enableGMPForApplications
flag set to false
by default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver
.
Workaround:
system
scenarioIn the system
scenario, the cluster snapshot doesn't include any resources under the default
namespace.
However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.
Workaround:
You can manually run the following commands to collect the debugging information.
exportKUBECONFIG=USER_CLUSTER_KUBECONFIG kubectlgetclusters.cluster.k8s.io-oyaml kubectlgetcontrolplanes.cluster.k8s.io-oyaml kubectlgetmachineclasses.cluster.k8s.io-oyaml kubectlgetmachinedeployments.cluster.k8s.io-oyaml kubectlgetmachines.cluster.k8s.io-oyaml kubectlgetmachinesets.cluster.k8s.io-oyaml kubectlgetservices-oyaml kubectldescribeclusters.cluster.k8s.io kubectldescribecontrolplanes.cluster.k8s.io kubectldescribemachineclasses.cluster.k8s.io kubectldescribemachinedeployments.cluster.k8s.io kubectldescribemachines.cluster.k8s.io kubectldescribemachinesets.cluster.k8s.io kubectldescribeservices
USER_CLUSTER_KUBECONFIG is the user cluster's kubeconfig file.
When deleting, updating or upgrading a user cluster, node drain may be stuck in the following scenarios:
To identify the symptom, run the command below:
kubectllogsclusterapi-controllers-POD_NAME_SUFFIX--kubeconfigADMIN_KUBECONFIG-nUSER_CLUSTER_NAMESPACE
Here is a sample error message from the above command:
E092020:27:43.0865671machine_controller.go:250]Errordeletingmachineobject[MACHINE];Failedtodeletemachine[MACHINE]:failedtodetachdisksfromVM"[MACHINE]":failedtoconvertdiskpath"kubevols"toUUIDpath:failedtoconvertfullpath"ds:///vmfs/volumes/vsan:[UUID]/kubevols":ServerFaultCode:Ageneralsystemerroroccurred:Invalidfault
kubevols
is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols
, since the current implementation assumes that kubevols
always exists.
Workaround:
Create the directory kubevols
in the datastore where the node is created. This is defined in the vCenter.datastore
field in the user-cluster.yaml
or admin-cluster.yaml
files.
Cluster Autoscaler
clusterrolebinding and clusterrole are deleted after deleting a user cluster.On user cluster deletion, the corresponding clusterrole
and clusterrolebinding
for cluster-autoscaler are also deleted. This affects all other user clusters on the same admin cluster with cluster autoscaler enabled. This is because the same clusterrole and clusterrolebinding are used for all cluster autoscaler pods within the same admin cluster.
The symptoms are the following:
cluster-autoscaler
logskubectllogs--kubeconfigADMIN_CLUSTER_KUBECONFIG-nkube-system\ cluster-autoscaler
2023-03-26T10:45:44.866600973ZW032610:45:44.8664631reflector.go:424]k8s.io/client-go/dynamic/dynamicinformer/informer.go:91:failedtolist*unstructured.Unstructured:onpremuserclusters.onprem.cluster.gke.ioisforbidden:User"..."cannotlistresource"onpremuserclusters"inAPIgroup"onprem.cluster.gke.io"attheclusterscope 2023-03-26T10:45:44.866646815ZE032610:45:44.8664941reflector.go:140]k8s.io/client-go/dynamic/dynamicinformer/informer.go:91:Failedtowatch*unstructured.Unstructured:failedtolist*unstructured.Unstructured:onpremuserclusters.onprem.cluster.gke.ioisforbidden:User"..."cannotlistresource"onpremuserclusters"inAPIgroup"onprem.cluster.gke.io"attheclusterscope
Workaround:
cluster-health-controller
and vsphere-metrics-exporter
do not work after deleting user clusterOn user cluster deletion, the corresponding clusterrole
is also deleted, which results in auto repair and vsphere metrics exporter not working
The symptoms are the following:
cluster-health-controller
logskubectllogs--kubeconfigADMIN_CLUSTER_KUBECONFIG-nkube-system\ cluster-health-controller
errorretrievingresourcelockdefault/onprem-cluster-health-leader-election:configmaps"onprem-cluster-health-leader-election"isforbidden:User"system:serviceaccount:kube-system:cluster-health-controller"cannotgetresource"configmaps"inAPIgroup""inthenamespace"default":RBAC:clusterrole.rbac.authorization.k8s.io"cluster-health-controller-role"notfound
vsphere-metrics-exporter
logskubectllogs--kubeconfigADMIN_CLUSTER_KUBECONFIG-nkube-system\ vsphere-metrics-exporter
vsphere-metrics-exporter/cmd/vsphere-metrics-exporter/main.go:68:Failedtowatch*v1alpha1.Cluster:failedtolist*v1alpha1.Cluster:clusters.cluster.k8s.ioisforbidden:User"system:serviceaccount:kube-system:vsphere-metrics-exporter"cannotlistresource"clusters"inAPIgroup"cluster.k8s.io"inthenamespace"default"
Workaround:
gkectl check-config
fails at OS image validationA known issue that could fail the gkectl check-config
without running gkectl prepare
. This is confusing because we suggest running the command before running gkectl prepare
The symptom is that the gkectl check-config
command will fail with the following error message:
Validatorresult:{Status:FAILUREReason:osimages[OS_IMAGE_NAME]don'texist,pleaserun`gkectlprepare`touploadosimages.UnhealthyResources:[]}
Workaround:
Option 1: run gkectl prepare
to upload the missing OS images.
Option 2: use gkectl check-config --skip-validation-os-images
to skip the OS images validation.
gkectl update admin/cluster
fails at updating anti affinity groupsA known issue that could fail the gkectl update admin/cluster
when updating anti affinity groups
.
The symptom is that the gkectl update
command will fail with the following error message:
Waitingformachinestobere-deployed...ERROR Exitwitherror: Failedtoupdatethecluster:timedoutwaitingforthecondition
Workaround:
Node registration fails during cluster creation, upgrade, update and node auto repair, when ipMode.type
is static
and the configured hostname in the IP block file contains one or more periods. In this case, Certificate Signing Requests (CSR) for a node are not automatically approved.
To see pending CSRs for a node, run the following command:
kubectlgetcsr-A-owide
Check the following logs for error messages:
clusterapi-controller-manager
container in the clusterapi-controllers
Pod: kubectllogsclusterapi-controllers-POD_NAME\-cclusterapi-controller-manager-nkube-system\--kubeconfigADMIN_CLUSTER_KUBECONFIG
kubectllogsclusterapi-controllers-POD_NAME\-cclusterapi-controller-manager-nUSER_CLUSTER_NAME\--kubeconfigADMIN_CLUSTER_KUBECONFIG
"msg"="failed to validate token id" "error"="failed to find machine for node node-worker-vm-1" "validate"="csr-5jpx9"
kubelet
logs on the problematic node: journalctl--ukubelet
"Error getting node" err="node \"node-worker-vm-1\" not found"
If you specify a domain name in the hostname field of an IP block file, any characters following the first period will be ignored. For example, if you specify the hostname as bob-vm-1.bank.plc
, the VM hostname and node name will be set to bob-vm-1
.
When node ID verification is enabled, the CSR approver compares the node name with the hostname in the Machine spec, and fails to reconcile the name. The approver rejects the CSR, and the node fails to bootstrap.
Workaround:
User cluster
Disable node ID verification by completing the following steps:
disableNodeIDVerification:true disableNodeIDVerificationCSRSigning:true
gkectlupdatecluster--kubeconfigADMIN_CLUSTER_KUBECONFIG\--configUSER_CLUSTER_CONFIG_FILE
ADMIN_CLUSTER_KUBECONFIG
: the path of the admin cluster kubeconfig file.USER_CLUSTER_CONFIG_FILE
: the path of your user cluster configuration file.Admin cluster
OnPremAdminCluster
custom resource for editing: kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\editonpremadmincluster-nkube-system
features.onprem.cluster.gke.io/disable-node-id-verification:enabled
kube-controller-manager
manifest in the admin cluster control plane: kube-controller-manager
manifest for editing: sudovi/etc/kubernetes/manifests/kube-controller-manager.yaml
controllers
: --controllers=*,bootstrapsigner,tokencleaner,-csrapproving,-csrsigning
--controllers=*,bootstrapsigner,tokencleaner
kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\editdeploymentclusterapi-controllers-nkube-system
node-id-verification-enabled
and node-id-verification-csr-signing-enabled
to false
: --node-id-verification-enabled=false--node-id-verification-csr-signing-enabled=false
The admin cluster creation/upgrade is stuck at the following log forever and eventually times out:
Waiting for Machine gke-admin-master-xxxx to become ready...
In the 1.11 version of the documentation, the Cluster API controller log in the external cluster snapshot includes the following log:
Invalid value 'XXXX' specified for property startup-data
Here is an example file path for the Cluster API controller log:
kubectlCommands/kubectl_logs_clusterapi-controllers-c4fbb45f-6q6g6_--container_vsphere-controller-manager_--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_--request-timeout_30s_--namespace_kube-system_--timestampsVMware has a 64k vApp property size limit. In the identified versions, the data passed via vApp property is close to the limit. When the private registry certificate contains a certificate bundle, it may cause the final data to exceed the 64k limit.
Workaround:
Only include the required certificates in the private registry certificate file configured in
privateRegistry.caCertPath
in the admin cluster config file.Or upgrade to a version with the fix when available.
NetworkGatewayNodes
marked unhealthy from concurrent status update conflictIn networkgatewaygroups.status.nodes
, some nodes switch between NotHealthy
and Up
.
Logs for the ang-daemon
Pod running on that node reveal repeated errors:
2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}
The NotHealthy
status prevents the controller from assigning additional floating IPs to the node. This can result in higher burden on other nodes or a lack of redundancy for high availability.
Dataplane activity is otherwise not affected.
Contention on the networkgatewaygroup
object causes some status updates to fail due to a fault in retry handling. If too many status updates fail, ang-controller-manager
sees the node as past its heartbeat time limit and marks the node NotHealthy
.
The fault in retry handling has been fixed in later versions.
Workaround:
Upgrade to a fixed version, when available.
A known issue that could cause the cluster upgrade or update to be stuck at waiting for the old machine object to be deleted. This is because the finalizer cannot be removed from the machine object. This affects any rolling update operation for node pools.
The symptom is that the gkectl
command times out with the following error message:
E0821 18:28:02.546121 61942 console.go:87] Exit with error: E0821 18:28:02.546184 61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
In clusterapi-controller
Pod logs, the errors are like below:
$ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1 -c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG] | grep "Error removing finalizer from machine object" [...] E0821 23:19:45.114993 1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again
The error repeats for the same machine for several minutes for successful runs even without this issue, for most of the time it can go through quickly, but for some rare cases it can be stuck at this race condition for several hours.
The issue is that the underlying VM is already deleted in vCenter, but the corresponding machine object cannot be removed, which is stuck at the finalizer removal due to very frequent updates from other controllers. This can cause the gkectl
command to timeout, but the controller keeps reconciling the cluster so the upgrade or update process eventually completes.
Workaround:
We have prepared several different mitigation options for this issue, which depends on your environment and requirements.
kubectl--kubeconfigCLUSTER_KUBECONFIGgetmachines
kubectlannotate--kubeconfigCLUSTER_KUBECONFIG\machineMACHINE_NAME\onprem.cluster.gke.io/repair-machine=true
If you encounter this issue and the upgrade or update still can't complete after a long time, contact our support team for mitigations.
gkectl
prepare OS image validation preflight failuregkectl prepare
command failed with:
- Validation Category: OS Images - [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.
The preflight checks of gkectl prepare
included an incorrect validation.
Workaround:
Run the same command with an additional flag --skip-validation-os-images
.
https://
or http://
prefix may cause cluster startup failureAdmin cluster creation failed with:
Exit with error: Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message: Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid: [data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]: Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]
The URL is used as part of a Secret key, which doesn't support "/" or ":".
Workaround:
Remove https://
or http://
prefix from the vCenter.Address
field in the admin cluster or user cluster config yaml.
gkectl prepare
panic on util.CheckFileExists
gkectl prepare
can panic with the following stacktrace:
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xde0dfa] goroutine 1 [running]: gke-internal.googlesource.com/syllogi/cluster-management/pkg/util.CheckFileExists(0xc001602210, 0x2b, 0xc001602210, 0x2b) pkg/util/util.go:226 +0x9a gke-internal.googlesource.com/syllogi/cluster-management/gkectl/pkg/config/util.SetCertsForPrivateRegistry(0xc000053d70, 0x10, 0xc000f06f00, 0x4b4, 0x1, 0xc00015b400)gkectl/pkg/config/util/utils.go:75 +0x85 ...
The issue is that gkectl prepare
created the private registry certificate directory with a wrong permission.
Workaround:
To fix this issue, please run the following commands on the admin workstation:
sudomkdir-p/etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS sudochmod0755/etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
gkectl repair admin-master
and resumable admin upgrade do not work togetherAfter a failed admin cluster upgrade attempt, don't run gkectl repair admin-master
. Doing so may cause subsequent admin upgrade attempts to fail with issues such as admin master power on failure or the VM being inaccessible.
Workaround:
If you've already encountered this failure scenario, contact support.
If the admin control plane machine isn't recreated after a resumed admin cluster upgrade attempt, the admin control plane VM template is deleted. The admin control plane VM template is the template of the admin master that is used to recover the control plane machine with gkectl repair admin-master
.
Workaround:
The admin control plane VM template will be regenerated during the next admin cluster upgrade.
In version 1.12.0, cgroup v2 (unified) is enabled by default for Container Optimized OS (COS) nodes. This could potentially cause instability for your workloads in a COS cluster.
Workaround:
We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are using COS nodes, we recommend that you upgrade to version 1.12.1 as soon as it is released.
gkectl update
reverts any manual changes that you have made to the ClientConfig custom resource.
Workaround:
We strongly recommend that you back up the ClientConfig resource after every manual change.
gkectl check-config
validation fails: can't find F5 BIG-IP partitionsValidation fails because F5 BIG-IP partitions can't be found, even though they exist.
An issue with the F5 BIG-IP API can cause validation to fail.
Workaround:
Try running gkectl check-config
again.
You might see an installation failure due to cert-manager-cainjector
in crashloop, when the apiserver/etcd is slow:
# These are logs from `cert-manager-cainjector`, from the command # `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system cert-manager-cainjector-xxx` I0923 16:19:27.911174 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition E0923 16:19:27.911110 1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded I0923 16:19:27.911593 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition E0923 16:19:27.911629 1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"
Workaround:
If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in vm information from vCenter is incorrect, and results in the machine being in an Unavailable
state. This eventually leads to the nodes being auto-repaired to create new ones.
Workaround:
This workaround is provided by VMware support:
For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config
has the following settings:
ClientAliveInterval 300 ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0
value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:
Connection to [IP] closed by remote host. Connection to [IP] closed.
Workaround:
You can either:
nohup
to prevent your command being terminated on SSH disconnection, nohupgkectlupgradeadmin--configadmin-cluster.yaml\--kubeconfigkubeconfig
sshd_config
to use a non-zero ClientAliveCountMax
value. The CIS rule recommends to use a value less than 3: sudosed-i's/ClientAliveCountMax 0/ClientAliveCountMax 1/g'\/etc/ssh/sshd_config sudosystemctlrestartsshd
Make sure you reconnect your SSH session.
cert-manager
installationIn 1.13 releases, monitoring-operator
will install cert-manager in the cert-manager
namespace. If for certain reasons, you need to install your own cert-manager, follow the following instructions to avoid conflicts:
You only need to apply this work around once for each cluster, and the changes will be preserved across cluster upgrade.
Note: One common symptom of installing your own cert-manager is that thecert-manager
version or image (for example v1.7.2) may revert back to its older version. This is caused by monitoring-operator
trying to reconcile the cert-manager
, and reverting the version in the process. Workaround:
Avoid conflicts during upgrade
cert-manager
. If you defined your own resources, you may want to backup them.cert-manager
.Restore your own cert-manager in user clusters
monitoring-operator
Deployment to 0: kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\-nUSER_CLUSTER_NAME\scaledeploymentmonitoring-operator--replicas=0
cert-manager
deployments managed by monitoring-operator
to 0: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager--replicas=0 kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager-cainjector\--replicas=0 kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager-webhook--replicas=0
cert-manager
. Restore your customized resources if you have.cert-manager
namespace. Otherwise, copy the metrics-ca
cert-manager.io/v1 Certificate and the metrics-pki.cluster.local
Issuer resources from cert-manager
to the cluster resource namespace of your installed cert-manager. relevant_fields='{ apiVersion: .apiVersion, kind: .kind, metadata: { name: .metadata.name, namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE" }, spec: .spec}'f1=$(mktemp)f2=$(mktemp) kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\getissuer-ncert-managermetrics-pki.cluster.local-ojson\|jq"${relevant_fields}">$f1 kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\getcertificate-ncert-managermetrics-ca-ojson\|jq"${relevant_fields}">$f2 kubectlapply--kubeconfigUSER_CLUSTER_KUBECONFIG-f$f1 kubectlapply--kubeconfigUSER_CLUSTER_KUBECONFIG-f$f2
Restore your own cert-manager in admin clusters
In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Google Distributed Cloud control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigee customer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.
monitoring-operator
deployment to 0. kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\-nkube-systemscaledeploymentmonitoring-operator--replicas=0
cert-manager
deployments managed by monitoring-operator
to 0. kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager\--replicas=0 kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager-cainjector\--replicas=0 kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\-ncert-managerscaledeploymentcert-manager-webhook\--replicas=0
cert-manager
. Restore your customized resources if you have.cert-manager
namespace. Otherwise, copy the metrics-ca
cert-manager.io/v1 Certificate and the metrics-pki.cluster.local
Issuer resources from cert-manager
to the cluster resource namespace of your installed cert-manager. relevant_fields='{ apiVersion: .apiVersion, kind: .kind, metadata: { name: .metadata.name, namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE" }, spec: .spec}'f3=$(mktemp)f4=$(mktemp) kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\ngetissuer-ncert-managermetrics-pki.cluster.local-ojson\|jq"${relevant_fields}">$f3 kubectl--kubeconfigADMIN_CLUSTER_KUBECONFIG\getcertificate-ncert-managermetrics-ca-ojson\|jq"${relevant_fields}">$f4 kubectlapply--kubeconfigADMIN_CLUSTER_KUBECONFIG-f$f3 kubectlapply--kubeconfigADMIN_CLUSTER_KUBECONFIG-f$f4
The Docker, containerd, and runc in the Ubuntu OS images shipped with Google Distributed Cloud are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Google Distributed Cloud before each release.
However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in Docker, containerd, and runc vulnerability scanning results.
For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Google Distributed Cloud.
Refer to the release notes] for any CVE fixes.
Workaround:
Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.
If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice that the kubectl exec
, kubectl log
and webhook against user clusters might be unavailable for a short time. This downtime can be up to one minute. This happens because the incoming request (kubectl exec, kubectl log and webhook) is handled by kube-apiserver for the user cluster. User kube-apiserver is a Statefulset. In a non-HA cluster, there is only one replica for the Statefulset. So during upgrade, there is a chance that the old kube-apiserver is unavailable while the new kube-apiserver is not yet ready.
Workaround:
This downtime only happens during upgrade process. If you want a shorter downtime during upgrade, we recommend you to switch to HA clusters.
If you are creating or upgrading an HA cluster and notice konnectivity readiness check failed in cluster diagnose, in most cases it will not affect the functionality of Google Distributed Cloud (kubectl exec, kubectl log and webhook). This happens because sometimes one or two of the konnectivity replicas might be unready for a period of time due to unstable networking or other issues.
Workaround:
The konnectivity will recover by itself. Wait for 30 minutes to 1 hour and rerun cluster diagnose.
/etc/cron.daily/aide
CPU and memory spike issueStarting from Google Distributed Cloud version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
As a result, the cron script /etc/cron.daily/aide
has been installed so that an aide
check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.
The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide
process.
Workaround:
If the spikes are affecting your workload, you can disable the daily cron job:
sudochmod-x/etc/cron.daily/aide
When deploying Google Distributed Cloud version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator
might fail to create gke-metrics-agent-conf
ConfigMap and cause gke-connect-agent
Pods to be in a crash loop.
The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Google Distributed Cloud releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K.
Workaround:
In the 1.13 version of the documentation, follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.
If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.
For Google Distributed Cloud versions 1.10 to 1.15, some customers have found unexpectedly high billing for Metrics volume
on the Billing page. This issue affects you only when all of the following circumstances apply:
enableStackdriverForApplications=true
) prometheus.io/scrap=true
annotation. (Installing Cloud Service Mesh can also add this annotation.)To confirm whether you are affected by this issue, list your user-defined metrics. If you see billing for unwanted metrics with external.googleapis.com/prometheus
name prefix and also see enableStackdriverForApplications
set to true in the response of kubectl -n kube-system get stackdriver stackdriver -o yaml
, then this issue applies to you.
Workaround
If you are affected by this issue, we recommend that you upgrade your clusters to version 1.12 or above, stop using the enableStackdriverForApplications
flag, and switch to new application monitoring solution managed-service-for-prometheus that no longer relies on the prometheus.io/scrap=true
annotation. With the new solution, you can also control logs and metrics collection separately for your applications, with the enableCloudLoggingForApplications
and enableGMPForApplications
flag, respectively.
To stop using the enableStackdriverForApplications
flag, open the `stackdriver` object for editing:
kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG --namespace kube-system edit stackdriver stackdriver
Remove the enableStackdriverForApplications: true
line, save and close the editor.
If you can't switch away from the annotation based metrics collection, use the following steps:
kubectl--kubeconfigKUBECONFIG\getpods-A-oyaml|grep'prometheus.io/scrape: "true"' kubectl--kubeconfigKUBECONFIGget\services-A-oyaml|grep'prometheus.io/scrape: "true"'
prometheus.io/scrap=true
annotation from the Pod or Service. If the annotation is added by Cloud Service Mesh, consider configuring Cloud Service Mesh without the Prometheus option, or turning off the Istio Metrics Merging feature.The Google Distributed Cloud installer can fail if custom roles are bound at the wrong permissions level.
When the role binding is incorrect, creating a vSphere datadisk with govc
hangs and the disk is created with a size equal to 0. To fix the issue, you should bind the custom role at the vSphere vCenter level (root).
Workaround:
If you want to bind the custom role at the DC level (or lower than root), you also need to bind the read-only role to the user at the root vCenter level.
For more information on role creation, see vCenter user account privileges.
You might see high network traffic to monitoring.googleapis.com
, even in a new cluster that has no user workloads.
This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.
Workaround:
gke-metrics-agent
has frequent CrashLoopBackOff errors For Google Distributed Cloud version 1.10 and above, `gke-metrics-agent` DaemonSet has frequent CrashLoopBackOff errors when `enableStackdriverForApplications` is set to `true` in the `stackdriver` object.
Workaround:
To mitigate this issue, disable application metrics collection by running the following commands. These commands will not disable application logs collection.
stackdriver-operator
: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemscaledeploystackdriver-operator\--replicas=0
gke-metrics-agent-conf
ConfigMap for editing: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemeditconfigmapgke-metrics-agent-conf
services.pipelines
, comment out the entire metrics/app-metrics
section: services:pipelines:#metrics/app-metrics:# exporters:# - googlecloud/app-metrics# processors:# - resource# - metric_to_resource# - infer_resource# - disk_buffer/app-metrics# receivers:# - prometheus/app-metricsmetrics/metrics:exporters:-googlecloud/metricsprocessors:-resource-metric_to_resource-infer_resource-disk_buffer/metricsreceivers:-prometheus/metrics
gke-metrics-agent
DaemonSet: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemrolloutrestartdaemonsetgke-metrics-agent
If deprecated metrics are used in your OOTB dashboards, you will see some empty charts. To find deprecated metrics in the Monitoring dashboards, run the following commands:
gcloudmonitoringdashboardslist>all-dashboard.json # find deprecated metrics catall-dashboard.json|grep-E\'kube_daemonset_updated_number_scheduled\ |kube_node_status_allocatable_cpu_cores\ |kube_node_status_allocatable_pods\ |kube_node_status_capacity_cpu_cores'
The following deprecated metrics should be migrated to their replacements.
Deprecated | Replacement |
---|---|
kube_daemonset_updated_number_scheduled | kube_daemonset_status_updated_number_scheduled |
kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes kube_node_status_allocatable_pods | kube_node_status_allocatable |
kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_capacity_pods | kube_node_status_capacity |
kube_hpa_status_current_replicas | kube_horizontalpodautoscaler_status_current_replicas |
Workaround:
To replace the deprecated metrics
This deprecation is due to the upgrade of kube-state-metrics agent from v1.9 to v2.4, which is required for Kubernetes 1.22. You can replace all deprecated kube-state-metrics
metrics, which have the prefix kube_
, in your custom dashboards or alerting policies.
For Google Distributed Cloud version 1.10 and above, the data for clusters in Cloud Monitoring may contain irrelevant summary metrics entries such as the following:
Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile
Other metrics types that may have irrelevant summary metrics include
:apiserver_admission_step_admission_duration_seconds_summary
go_gc_duration_seconds
scheduler_scheduling_duration_seconds
gkeconnect_http_request_duration_seconds_summary
alertmanager_nflog_snapshot_duration_seconds_summary
While these summary type metrics are in the metrics list, they are not supported by gke-metrics-agent
at this time.
You might find that the following metrics are missing on some, but not all, nodes:
kubernetes.io/anthos/container_memory_working_set_bytes
kubernetes.io/anthos/container_cpu_usage_seconds_total
kubernetes.io/anthos/container_network_receive_bytes_total
Workaround:
To fix this issue, perform the following steps as a workaround. For [version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent by following steps 1 - 4
stackdriver
resource for editing: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemeditstackdriverstackdriver
gke-metrics-agent
from 10m
to 50m
, CPU limit from 100m
to 200m
add the following resourceAttrOverride
section to the stackdriver
manifest : spec:resourceAttrOverride:gke-metrics-agent/gke-metrics-agent:limits:cpu:100mmemory:4608Mirequests:cpu:10mmemory:200Mi
spec:anthosDistribution:on-premclusterLocation:us-west1-aclusterName:my-clusterenableStackdriverForApplications:truegcpServiceAccountSecretName:...optimizedMetrics:trueportable:trueprojectID:my-project-191923proxyConfigSecretName:...resourceAttrOverride:gke-metrics-agent/gke-metrics-agent:limits:cpu:200mmemory:4608Mirequests:cpu:50mmemory:200Mi
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemgetdaemonsetgke-metrics-agent-oyaml\|grep"cpu: 50m"
cpu: 50m
if your edits have taken effect.If your admin cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing
# scheduler metric example scheduler_pending_pods # controller-manager metric example replicaset_controller_rate_limiter_use
Workaround:
Upgrade to v1.11.3+, v1.12.1+, or v1.13+.
If your user cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing:
# scheduler metric example scheduler_pending_pods # controller-manager metric example replicaset_controller_rate_limiter_use
Workaround:
This issue is fixed in Google Distributed Cloud version 1.13.0 and later. Upgrade your cluster to a version with the fix.
If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect
spec during its creation, you will get the following error.
Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.
failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migrate to first admin trust chain: failed to parse current version "": invalid version: ""
Workaround:
If you are using the GKE Identity Service feature to manage GKE Identity Service ClientConfig, the Connect Agent might restart unexpectedly.
Workaround:
If you have experienced this issue with an existing cluster, you can do one of the following:
gcloudcontainerfleetidentity-servicedisable\--projectPROJECT_ID
Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning.
Workaround:
A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).
You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.
VMWare has recently identified critical issues with the following vSphere 7.0 Update 3 releases:
Workaround:
VMWare has since removed these releases. You should upgrade the ESXi and vCenter Servers to a newer version.
exec
into Pod running on COS nodesFor Pods running on nodes that use Container-Optimized OS (COS) images, you cannot mount emptyDir volume as exec
. It mounts as noexec
and you will get the following error: exec user process caused: permission denied
. For example, you will see this error message if you deploy the following test Pod:
apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: test name: test spec: containers: - args: - sleep - "5000" image: gcr.io/google-containers/busybox:latest name: test volumeMounts: - name: test-volume mountPath: /test-volume resources: limits: cpu: 200m memory: 512Mi dnsPolicy: ClusterFirst restartPolicy: Always volumes: - emptyDir: {} name: test-volume
And in the test Pod, if you run mount | grep test-volume
, it would show noexec option:
/dev/sda1 on /test-volume type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30)
Workaround:
Node pool replicas do not update once autoscaling has been enabled and disabled on a node pool.
Workaround:
Removing the cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size
and cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size
annotations from the machine deployment of the corresponding node pool.
From version 1.11, on the out-of-the-box monitoring dashboards, the Windows Pod status dashboard and Windows node status dashboard also show data from Linux clusters. This is because the Windows node and Pod metrics are also exposed on Linux clusters.
stackdriver-log-forwarder
in constant CrashLoopBackOffFor Google Distributed Cloud version 1.10, 1.11, and 1.12, stackdriver-log-forwarder
DaemonSet might have CrashLoopBackOff
errors when there are broken buffered logs on the disk.
Workaround:
To mitigate this issue, we will need to clean up the buffered logs on the node.
stackdriver-log-forwarder
: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-nkube-systempatchdaemonsetstackdriver-log-forwarder-p'{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-nkube-system-f-<< EOFapiVersion: apps/v1kind: DaemonSetmetadata: name: fluent-bit-cleanup namespace: kube-systemspec: selector: matchLabels: app: fluent-bit-cleanup template: metadata: labels: app: fluent-bit-cleanup spec: containers: - name: fluent-bit-cleanup image: debian:10-slim command: ["bash", "-c"] args: - | rm -rf /var/log/fluent-bit-buffers/ echo "Fluent Bit local buffer is cleaned up." sleep 3600 volumeMounts: - name: varlog mountPath: /var/log securityContext: privileged: true tolerations: - key: "CriticalAddonsOnly" operator: "Exists" - key: node-role.kubernetes.io/master effect: NoSchedule - key: node-role.gke.io/observability effect: NoSchedule volumes: - name: varlog hostPath: path: /var/logEOF
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \logs -n kube-system -l app=fluent-bit-cleanup | grep "cleaned up" | wc -lkubectl --kubeconfig USER_CLUSTER_KUBECONFIG \-n kube-system get pods -l app=fluent-bit-cleanup --no-headers | wc -l
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-nkube-systemdeletedsfluent-bit-cleanup
stackdriver-log-forwarder
: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\-nkube-systempatchdaemonsetstackdriver-log-forwarder--typejson-p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
stackdriver-log-forwarder
doesn't send logs to Cloud LoggingIf you don't see logs in Cloud Logging from your clusters, and you notice the following error in your logs:
2023-06-02T10:53:40.444017427Z[2023/06/0210:53:40][error][inputchunk]chunk1-1685703077.747168499.flbwouldexceedtotallimitsizeinpluginstackdriver.0 2023-06-02T10:53:40.444028047Z[2023/06/0210:53:40][error][inputchunk]noavailablechunk
stackdriver-log-forwarder
to not send logs. This issue occurs in all Google Distributed Cloud versions.Workaround:
To mitigate this issue, you need to increase the resource limit on the logging agent.
stackdriver
resource for editing: kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemeditstackdriverstackdriver
stackdriver-log-forwarder
, add the following resourceAttrOverride
section to the stackdriver
manifest : spec:resourceAttrOverride:stackdriver-log-forwarder/stackdriver-log-forwarder:limits:cpu:1200mmemory:600Mirequests:cpu:600mmemory:600Mi
spec:anthosDistribution:on-premclusterLocation:us-west1-aclusterName:my-clusterenableStackdriverForApplications:truegcpServiceAccountSecretName:...optimizedMetrics:trueportable:trueprojectID:my-project-191923proxyConfigSecretName:...resourceAttrOverride:stackdriver-log-forwarder/stackdriver-log-forwarder:limits:cpu:1200mmemory:600Mirequests:cpu:600mmemory:600Mi
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIG\--namespacekube-systemgetdaemonsetstackdriver-log-forwarder-oyaml\|grep"cpu: 1200m"
cpu: 1200m
if your edits have taken effect.there is a short period where node is ready but kubelet server certificate is not ready. kubectl exec
and kubectl logs
are unavailable during this tens of seconds. This is because it takes time for the new server certificate approver to see the updated valid IPs of the node.
This issue affects kubelet server certificate only, it will not affect Pod scheduling.
User cluster upgrade failed with:
.LBKind in body is required (Check the status of OnPremUserCluster 'cl-stg-gdl-gke-onprem-mgmt/cl-stg-gdl' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
The admin cluster is not fully upgraded, and the status version is still 1.10. User cluster upgrade to 1.12 won't be blocked by any preflight check, and fails with version skew issue.
Workaround:
Complete to upgrade the admin cluster to 1.11 first, and then upgrade the user cluster to 1.12.
gkectl diagnose cluster
command failed with:
Checking VSphere Datastore FreeSpace...FAILURE Reason: vCenter datastore: [DATASTORE_NAME] insufficient FreeSpace, requires at least [NUMBER] GB
The validation of datastore free space should not be used for existing cluster node pools, and was added in gkectl diagnose cluster
by mistake.
Workaround:
You can ignore the error message or skip the validation using --skip-validation-infra
.
You may not be able to add a new user cluster if your admin cluster is set up with a MetalLB load balancer configuration.
The user cluster deletion process may get stuck for some reason which results in an invalidation of the MatalLB ConfigMap. It won't be possible to add a new user cluster in this state.
Workaround:
You can force delete your user cluster.
If osImageType
is using cos
for admin cluster, and when gkectl check-config
is executed after admin cluster creation and before user cluster creation, it would fail on:
Failed to create the test VMs: VM failed to get IP addresses on the network.
The test VM created for user cluster check-config
by default uses the same osImageType
from admin cluster, and currently test VM is not compatible with COS yet.
Workaround:
To avoid the slow preflight check which creates the test VM, using gkectl check-config --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG --fast
.
This issue affects customers using Grafana in the admin cluster to monitor user clusters in Google Distributed Cloud versions 1.12.0 and 1.12.1. It comes from a mismatch of pushprox-client certificates in user clusters and the allowlist in the pushprox-server in the admin cluster. The symptom is pushprox-client in user clusters printing error logs like the following:
level=error ts=2022-08-02T13:34:49.41999813Z caller=client.go:166 msg="Error reading request:" err="invalid method \"RBAC:\""
Workaround:
gkectl repair admin-master
does not provide the VM template to be used for recoverygkectl repair admin-master
command failed with:
Failed to repair: failed to select the template: no VM templates is available for repairing the admin master (check if the admin cluster version >= 1.4.0 or contact support
gkectl repair admin-master
is not able to fetch the VM template to be used for repairing the admin control plane VM if the name of the admin control plane VM ends with the characters t
, m
, p
, or l
.
Workaround:
Rerun the command with --skip-validation
.
Cloud Audit Logs needs a special permission setup that is currently only automatically performed for user clusters through GKE Hub. It is recommended to have at least one user cluster that uses the same project ID and service account with the admin cluster for Cloud Audit Logs so the admin cluster will have the required permission.
However in cases where the admin cluster uses a different project ID or different service account than any user cluster, audit logs from the admin cluster would fail to be injected into Google Cloud. The symptom is a series of Permission Denied
errors in the audit-proxy
Pod in the admin cluster.
Workaround:
gkectl diagnose
checking certificates failureIf your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose
:
Checking user cluster certificates...FAILURE Reason: 3 user cluster certificates error(s). Unhealthy Resources: Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
If your work station does not have access to admin cluster worker nodes or admin cluster worker nodes, it will get the following failures when running gkectl diagnose
:
Checking admin cluster certificates...FAILURE Reason: 3 admin cluster certificates error(s). Unhealthy Resources: Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Workaround:
If is safe to ignore these messages.
/var/log/audit/
filling up disk space on VMs/var/log/audit/
is filled with audit logs. You can check the disk usage by running sudo du -h -d 1 /var/log/audit
.
Certain gkectl
commands on the admin workstation, for example, gkectl diagnose snapshot
contribute to disk space usage.
Since Google Distributed Cloud v1.8, the Ubuntu image is hardened with CIS Level 2 Benchmark. And one of the compliance rules, "4.1.2.2 Ensure audit logs are not automatically deleted", ensures the auditd setting max_log_file_action = keep_logs
. This results in all the audit rules kept on the disk.
Workaround:
NetworkGatewayGroup
Floating IP conflicts with node addressUsers are unable to create or update NetworkGatewayGroup
objects because of the following validating webhook error:
[1] admission webhook "vnetworkgatewaygroup.kb.io" denied the request: NetworkGatewayGroup.networking.gke.io "default" is invalid: [Spec.FloatingIPs: Invalid value: "10.0.0.100": IP address conflicts with node address with name: "my-node-name"
In affected versions, the kubelet can erroneously bind to a floating IP address assigned to the node and report it as a node address in node.status.addresses
. The validating webhook checks NetworkGatewayGroup
floating IP addresses against all node.status.addresses
in the cluster and sees this as a conflict.
Workaround:
In the same cluster where create or update of NetworkGatewayGroup
objects is failing, temporarily disable the ANG validating webhook and submit your change:
kubectl-nkube-systemgetvalidatingwebhookconfiguration\ang-validating-webhook-configuration-oyaml>webhook-config.yaml
kubectl-nkube-systemeditvalidatingwebhookconfiguration\ang-validating-webhook-configuration
vnetworkgatewaygroup.kb.io
item from the webhook config list and close to apply the changes.NetworkGatewayGroup
object.kubectl-nkube-systemapply-fwebhook-config.yaml
During an admin cluster upgrade attempt, the admin control plane VM might get stuck during creation. The admin control plane VM goes into an infinite waiting loop during the boot up, and you will see the following infinite loop error in the /var/log/cloud-init-output.log
file:
+ echo 'waiting network configuration is applied' waiting network configuration is applied ++ get-public-ip +++ ip addr show dev ens192 scope global +++ head -n 1 +++ grep -v 192.168.231.1 +++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}' +++ awk '{print $2}' ++ echo + '[' -n '' ']' + sleep 1 + echo 'waiting network configuration is applied' waiting network configuration is applied ++ get-public-ip +++ ip addr show dev ens192 scope global +++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}' +++ awk '{print $2}' +++ grep -v 192.168.231.1 +++ head -n 1 ++ echo + '[' -n '' ']' + sleep 1
This is because when Google Distributed Cloud tries to get the node IP address in the startup script, it uses grep -v ADMIN_CONTROL_PLANE_VIP
to skip the admin cluster control-plane VIP which can be assigned to the NIC too. However, the command also skips over any IP address that has a prefix of the control-plane VIP, which causes the startup script to hang.
For example, suppose that the admin cluster control-plane VIP is 192.168.1.25. If the IP address of the admin cluster control-plane VM has the same prefix, for example,192.168.1.254, then the control-plane VM will get stuck during creation. This issue can also be triggered if the broadcast address has the same prefix as the control-plane VIP, for example, 192.168.1.255
.
Workaround:
ipaddradd${ADMIN_CONTROL_PLANE_NODE_IP}/32devens192
ipaddrdel${ADMIN_CONTROL_PLANE_NODE_IP}/32devens192
DataDisk can't be mounted correctly to admin cluster master node when using COS image and the state of the admin cluster using COS image will get lost upon admin cluster upgrade or admin master repair. (admin cluster using COS image is a preview feature)
Workaround:
Re-create admin cluster with osImageType set to ubuntu_containerd
After you create the admin cluster with osImageType set to cos, grab the admin cluster SSH key and SSH into admin master node. df -h
result contains /dev/sdb1 98G 209M 93G 1% /opt/data
. And lsblk
result contains -sdb1 8:17 0 100G 0 part /opt/data
.local
domainsIn Google Distributed Cloud version 1.10.0, name resolutions on Ubuntu are routed to local systemd-resolved listening on 127.0.0.53
by default. The reason is that on the Ubuntu 20.04 image used in version 1.10.0, /etc/resolv.conf
is sym-linked to /run/systemd/resolve/stub-resolv.conf
, which points to the 127.0.0.53
localhost DNS stub.
As a result, the localhost DNS name resolution refuses to check the upstream DNS servers (specified in /run/systemd/resolve/resolv.conf
) for names with a .local
suffix, unless the names are specified as search domains.
This causes any lookups for .local
names to fail. For example, during node startup, kubelet
fails on pulling images from a private registry with a .local
suffix. Specifying a vCenter address with a .local
suffix will not work on an admin workstation.
Workaround:
You can avoid this issue for cluster nodes if you specify the searchDomainsForDNS
field in your admin cluster configuration file and the user cluster configuration file to include the domains.
Currently gkectl update
doesn't support updating the searchDomainsForDNS
field yet.
Therefore, if you haven't set up this field before cluster creation, you must SSH into the nodes and bypass the local systemd-resolved stub by changing the symlink of /etc/resolv.conf
from /run/systemd/resolve/stub-resolv.conf
(which contains the 127.0.0.53
local stub) to /run/systemd/resolve/resolv.conf
(which points to the actual upstream DNS):
sudoln-sf/run/systemd/resolve/resolv.conf/etc/resolv.conf
As for the admin workstation, gkeadm
doesn't support specifying search domains, so must work around this issue with this manual step.
This solution for does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.
172.17.0.1/16
instead of 169.254.123.1/24
Google Distributed Cloud specifies a dedicated subnet for the Docker bridge IP address that uses --bip=169.254.123.1/24
, so that it won't reserve the default 172.17.0.1/16
subnet. However, in version 1.10.0, there is a bug in Ubuntu OS image that caused the customized Docker config to be ignored.
As a result, Docker picks the default 172.17.0.1/16
as its bridge IP address subnet. This might cause an IP address conflict if you already have workload running within that IP address range.
Workaround:
To work around this issue, you must rename the following systemd config file for dockerd, and then restart the service:
sudomv/etc/systemd/system/docker.service.d/50-cloudimg-settings.cfg\/etc/systemd/system/docker.service.d/50-cloudimg-settings.conf sudosystemctldaemon-reload sudosystemctlrestartdocker
Verify that Docker picks the correct bridge IP address:
ipa|grepdocker0
This solution does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.
In Google Distributed Cloud version 1.11.0, there are changes in the definition of custom resources related to logging and monitoring:
stackdriver
custom resource changed from addons.sigs.k8s.io
to addons.gke.io
;monitoring
and metricsserver
custom resources changed from addons.k8s.io
to addons.gke.io
;stackdriver
custom resource need to have string type in the values of the cpu, memory and storage size requests and limits.The group name changes are made to comply with CustomResourceDefinition updates in Kubernetes 1.22.
There is no action required if you do not have additional logic that applies or edits the affected custom resources. The Google Distributed Cloud upgrade process will take care of the migration of the affected resources and keep their existing specs after the group name change.
However if you run any logic that applies or edits the affected resources, special attention is needed. First, they need to be referenced with the new group name in your manifest file. For example:
apiVersion:addons.gke.io/v1alpha1## instead of `addons.sigs.k8s.io/v1alpha1`kind:Stackdriver
Secondly, make sure the resourceAttrOverride
and storageSizeOverride
spec values are of string type. For example:
spec:resourceAttrOverride:stackdriver-log-forwarder/stackdriver-log-forwarderlimits:cpu:1000m# or "1"# cpu: 1 # integer value like this would not workmemory:3000Mi
Otherwise, the applies and edits will not take effect and may lead to unexpected status in logging and monitoring components. Potential symptoms may include:
onprem-user-cluster-controller
, for example: potential reconciliation error: Apply bundle components failed, requeue after 10s, error: failed to apply addon components: failed to apply bundle objects from stackdriver-operator-addon 1.11.2-gke.53 to cluster my-cluster: failed to create typed live object: .spec.resourceAttrOverride.stackdriver-log-forwarder/stackdriver-log-forwarder.limits.cpu: expected string, got &value.valueUnstructured{Value:1}
kubectl edit stackdriver stackdriver
, for example: Error from server (NotFound): stackdrivers.addons.gke.io "stackdriver" not found
If you encounter the above errors, it means an unsupported type under stackdriver CR spec was already present before the upgrade. As a workaround, you could manually edit the stackdriver CR under the old group name kubectl edit stackdrivers.addons.sigs.k8s.io stackdriver
and do the following:
addons.gke.io/migrated-and-deprecated: true
annotation if present.Whenever there is a fault in a ESXi server and the vCenter HA function has been enabled for the server, all VMs in the faulty ESXi server trigger the vMotion mechanism and are moved to another normal ESXi server. Migrated COS VMs would lose their IPs.
Workaround:
Reboot the VM
The periodic GARP (Gratuitous ARP) sent by Seesaw every 20s doesn't set the target IP in the ARP header. Some networks might not accept such packets (like Cisco ACI). It can cause longer service down time after a split brain (due to VRRP packet drops) is recovered.
Workaround:
Trigger a Seeaw failover by running sudo seesaw -c failover
on either of the Seesaw VMs. This should restore the traffic.
"staticPodPath" was mistakenly set for worker nodes
Workaround:
Manually create the folder "/etc/kubernetes/manifests"
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-04-25 UTC.