Scale to zero using KEDA


This tutorial shows how to scale your GKE workloads down to zero Pods using KEDA. Scaling the deployments to zero Pods saves resources during periods of inactivity (such as weekends and non-office hours), or for intermittent workloads such as periodic jobs.

Objectives

This tutorial describes the following use cases:

  • Scale your Pub/Sub workload to zero: Scale the number of Pods in proportion to the number of messages queued on the Pub/Sub topic. When the queue is empty, the workload automatically scales down to zero Pods.
  • Scale your LLM workload to zero. Deploy your LLM model servers on nodes with GPU. When the service is idle, the workload automatically scales down to zero Pods.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

In this tutorial, you use Cloud Shell to run commands. Cloud Shell is a shell environment for managing resources hosted on Google Cloud. It comes preinstalled with the Google Cloud CLI, kubectl, Helm and Terraform command-line tools. If you don't use Cloud Shell, you must install the Google Cloud CLI and Helm.

  1. To run the commands on this page, set up the gcloud CLI in one of the following development environments:

    Cloud Shell

    To use an online terminal with the gcloud CLI already set up, activate Cloud Shell:

    At the bottom of this page, a Cloud Shell session starts and displays a command-line prompt. It can take a few seconds for the session to initialize.

    Local shell

    To use a local development environment, follow these steps:

    1. Install the gcloud CLI.
    2. Initialize the gcloud CLI.
    3. Install Helm, a Kubernetes package management tool.
  2. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.

    Enable the APIs

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.

    Enable the APIs

Setting up your environment

To set up your environment with Cloud Shell, follow these steps:

  1. Set environment variables:

    exportPROJECT_ID=PROJECT_IDexportPROJECT_NUMBER=$(gcloudprojectsdescribe$PROJECT_ID--format'get(projectNumber)')exportLOCATION=LOCATION

    Replace PROJECT_ID with your Google Cloud project ID and LOCATION with the regions or zones where your GKE cluster should be created.

    If you don't follow the entire tutorial in a single session, or if your environment variables are unset for some reason, make sure to run this command again to set the variables again.

  2. Create a Standard GKE cluster with cluster autoscaling and Workload Identity Federation for GKE enabled:

    gcloudcontainerclusterscreatescale-to-zero\--project=${PROJECT_ID}--location=${LOCATION}\--machine-type=n1-standard-2\--enable-autoscaling--min-nodes=1--max-nodes=5\--workload-pool=${PROJECT_ID}.svc.id.goog 

Install KEDA

KEDA is a component that complements Kubernetes Horizontal Pod Autoscaler. With KEDA, you can scale a Deployment to zero Pods and up from zero Pods to one Pod. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster. The standard Horizontal Pod Autoscaler algorithm applies after GKE creates at least one Pod.

After GKE scales the Deployment to zero Pods, because no Pods are running, autoscaling cannot rely on Pod metrics such as CPU utilization. As a consequence, KEDA allows fetching metrics originating from outside the cluster using an implementation of the Kubernetes External Metrics API. You can use this API to autoscale based on metrics such as the number of outstanding messages on a Pub/Sub subscription. See the KEDA documentation for a list of all supported metric sources.

Install KEDA on your cluster with Helm or with kubectl.

Helm

Run the following commands to add the KEDA Helm repository, install the KEDA Helm chart, and give the KEDA service account read access to Cloud Monitoring:

helmrepoaddkedacorehttps://kedacore.github.io/charts helmrepoupdate helminstallkedakedacore/keda--create-namespace--namespacekeda gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--roleroles/monitoring.viewer\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator 

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

kubectl

Run the following commands to install KEDA using kubectl apply and to give the KEDA service account read access to Cloud Monitoring:

kubectlapply--server-side-fhttps://github.com/kedacore/keda/releases/download/v2.15.1/keda-2.15.1.yaml gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--roleroles/monitoring.viewer\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator 

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

Confirm that all KEDA resources appear under the keda namespace:

kubectlgetall-nkeda 

For more information about KEDA design and resources, see the KEDA documentation.

Scale your Pub/Sub workload to zero

This section describes a workload that processes messages from a Pub/Sub subscription, handling each message and acknowledging its completion. The workload scales dynamically: as the number of unacknowledged messages increases, autoscaling instantiates more Pods to ensure timely processing.

Scaling to zero ensures that no Pods are instantiated when no messages have been received for a while. This saves resources as no Pods stay idle for long periods of time.

Deploy a Pub/Sub workload

Deploy a sample workload that processes messages queued on a Pub/Sub topic. To simulate a realistic workload, this sample program waits three seconds before acknowledging a message. The workload is configured to run under the keda-pubsub-sa service account.

Run the following commands to create the Pub/Sub topic and subscription, configure their permission, and create the Deployment starting the workload under the keda-pubub namespace.

gcloudpubsubtopicscreatekeda-echo gcloudpubsubsubscriptionscreatekeda-echo-read--topic=keda-echo gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--role=roles/pubsub.subscriber\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda-pubsub/sa/keda-pubsub-sa kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-with-workload-identity.yaml 

Configure scale-to-zero

To configure your Pub/Sub workload to scale to zero, use KEDA to define a ScaledObject resource to specify how the deployment should scale. KEDA will then automatically create and manage the underlying HorizontalPodAutoscaler (HPA) object.

  1. Create the ScaledObject resource to describe the expected autoscaling behavior:

    curlhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-scaledobject.yaml|envsubst|kubectlapply-f- 

    This creates the following object:

    apiVersion:keda.sh/v1alpha1kind:ScaledObjectmetadata:name:keda-pubsubnamespace:keda-pubsubspec:maxReplicaCount:5scaleTargetRef:name:keda-pubsubtriggers:-type:gcp-pubsubauthenticationRef:name:keda-authmetadata:subscriptionName:"projects/${PROJECT_ID}/subscriptions/keda-echo-read"
  2. Inspect the HorizontalPodAutoscaler (HPA) object that KEDA creates based on the ScaledObject object:

    kubectl get hpa keda-hpa-keda-pubsub -n keda-pubsub -o yaml

    You can read more about autoscaling in the Kubernetes documentation.

  3. Wait until KEDA acknowledges that the Pub/Sub subscription is empty, and scales the Deployment to zero replicas.

    Inspect the workload autoscaler:

    kubectl describe hpa keda-hpa-keda-pubsub -n keda-pubsub

    Observe that in the command response, the ScalingActive condition is false. The associated message shows that the Horizontal Pod Autoscaler acknowledges that KEDA scaled the deployment to zero, at which point it stops operating until the Deployment scales back up to one Pod.

    Name:keda-hpa-keda-pubsubNamespace:keda-pubsubMetrics:( current / target )"s0-gcp-ps-projects-[...]]" (target average value):0 / 10Min replicas:1Max replicas:5Deployment pods:5 current / 5 desiredConditions:Type Status Reason Message---- ------ ------ -------AbleToScale True ScaleDownStabilized recent recommendations were higher than current one [...]ScalingActive False ScalingDisabled scaling is disabled since the replica count of the target is zeroScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count

Trigger the scale-up

To stimulate the Deployment to scale up:

  1. Enqueue messages on the Pub/Sub topic:

    fornumin{1..20}dogcloudpubsubtopicspublishkeda-echo--project=${PROJECT_ID}--message="Test"done
  2. Verify that the Deployment is scaling up:

    kubectlgetdeployments-nkeda-pubsub 

    In the output, observe that the 'Ready' column shows one replica:

    NAME READY UP-TO-DATE AVAILABLE AGE keda-pubsub 1/1 1 1 2d 

KEDA scales up the Deployment after it observes that the queue is not empty.

Scale your LLM workload to zero

This section describes a Large Language Model (LLM) workload that deploys an Ollama server with attached GPU. Ollama allows running popular LLMs such as Gemma and Lamma 2, and exposes its features primarily through HTTP.

Install KEDA-HTTP add-on

Scaling an HTTP service down to zero Pods during periods of inactivity causes request failures, since there's no backend to handle the requests.

This section shows how to solve this problem using the KEDA-HTTP add-on. KEDA-HTTP starts an HTTP proxy that receives user requests and forwards them to the Services configured to scale-to-zero. When the Service has no Pod, the proxy triggers the Service to scale up, and buffers the request until the Service has scaled up to at least one Pod.

Install the KEDA-HTTP add-on using Helm. For more information, refer to KEDA-HTTP documentation.

helmrepoaddollama-helmhttps://otwld.github.io/ollama-helm/ helmrepoupdate # Set the proxy timeout to 120s, giving Ollama time to start. helminstallhttp-add-onkedacore/keda-add-ons-http\--create-namespace--namespacekeda\--setinterceptor.responseHeaderTimeout=120s 

Deploy an Ollama LLM workload

To deploy an Ollama LLM workload:

  1. Create a node pool containing g2-standard-4 nodes with attached GPUs, and configure cluster autoscaling to provide between zero and two nodes:

    gcloudcontainernode-poolscreategpu--machine-type=g2-standard-4\--location=${LOCATION}--cluster=scale-to-zero\--min-nodes0--max-nodes2--num-nodes=1--enable-autoscaling
  2. Add the official Ollama Helm chart repository, and update your local Helm client's repository:

    helmrepoaddollama-helmhttps://otwld.github.io/ollama-helm/ helmrepoupdate 
  3. Deploy the Ollama server using the Helm chart:

    helminstallollamaollama-helm/ollama--create-namespace--namespaceollama\-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/helm-values-ollama.yaml 

    The helm-values-ollama.yaml configuration specifies the LLM models to load, the GPU requirements, and the TCP port for the Ollama server.

Configure scale-to-zero

To configure your Ollama workload to scale to zero, KEDA-HTTP uses an HTTPScaledObject.

  1. Create the HTTPScaledObject resource to describe the expected autoscaling behavior:

    kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/keda-ollama-httpscaledobject.yaml 

    This creates the HTTPScaledObject object that defines the following fields:

    • scaleTargetRef: specifies the Service to which KEDA-HTTP should forward the requests. In this example, all requests with the host ollama.ollama are routed to the Ollama server.
    • scaledownPeriod: specifies (in seconds) how fast to scale down when no requests are received.
    • replicas: specifies the minimum and maximum number of Pods to maintain for the Ollama deployment.
    • scalingMetric: specifies the metrics used to drive autoscaling, such as request rate in this example. For more metric options, see the KEDA-HTTP documentation.
    kind:HTTPScaledObjectapiVersion:http.keda.sh/v1alpha1metadata:namespace:ollamaname:ollamaspec:hosts:-ollama.ollamascaleTargetRef:name:ollamakind:DeploymentapiVersion:apps/v1service:ollamaport:11434replicas:min:0max:2scaledownPeriod:3600scalingMetric:requestRate:targetValue:20
  2. Run the following command to verify that KEDA-HTTP has successfully processed the HTTPScaledObject created in the previous step.

    kubectlgethpa,scaledobject-nollama 

    The output shows the HorizontalPodAutoscaler (created by KEDA), and the ScaledObject (created by KEDA-HTTP) resources:

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE horizontalpodautoscaler.autoscaling/keda-hpa-ollama Deployment/ollama 0/100 (avg) 1 2 1 2d NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE scaledobject.keda.sh/ollama apps/v1.Deployment ollama 0 2 external-push True False False Unknown 2d 
  3. Verify that the Deployment scales down to zero Pods.

    Wait the period of time set in the scaledownPeriod field and run the command:

    kubectlgetdeployments-nollama 

    The output shows that KEDA scaled down the Ollama deployment, and that no Pods are running:

    NAME READY UP-TO-DATE AVAILABLE AGE ollama 0/0 0 0 2d 

Trigger the scale-up

To stimulate the Deployment to scale up, call the Ollama service using the proxy set up by the KEDA-HTTP add-on. This causes the value of the request rate metric to increase, and triggers the creation of a first Pod.

Use kubectl port forwarding capabilities to access the proxy because the proxy is not exposed externally.

kubectlport-forwardsvc/keda-add-ons-http-interceptor-proxy-nkeda8080:8080& # Set the 'Host' HTTP header so that the proxy routes requests to the Ollama server. curl-H"Host: ollama.ollama"\http://localhost:8080/api/generate\-d'{ "model": "gemma:7b", "prompt": "Hello!" }'

The curl command sends the prompt "Hello!" to a Gemma model. Observe the answer tokens coming back in the response. For the specification of the API, see the Ollama guide.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  1. Clean up the Pub/Sub subscription and topic:

    gcloudpubsubsubscriptionsdeletekeda-echo-read gcloudpubsubtopicsdeletekeda-echo 
  2. Delete your GKE cluster:

    gcloudcontainerclustersdeletescale-to-zero--location=${LOCATION}

What's next