Scale to zero using KEDA
Stay organized with collections Save and categorize content based on your preferences.

This tutorial shows how to scale your GKE workloads down to zero Pods using KEDA. Scaling the deployments to zero Pods saves resources during periods of inactivity (such as weekends and non-office hours), or for intermittent workloads such as periodic jobs.

Objectives

This tutorial describes the following use cases:

Scale your Pub/Sub workload to zero: Scale the number of Pods in proportion to the number of messages queued on the Pub/Sub topic. When the queue is empty, the workload automatically scales down to zero Pods.
Scale your LLM workload to zero. Deploy your LLM model servers on nodes with GPU. When the service is idle, the workload automatically scales down to zero Pods.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

In this tutorial, you use Cloud Shell to run commands. Cloud Shell is a shell environment for managing resources hosted on Google Cloud. It comes preinstalled with the Google Cloud CLI, kubectl, Helm and Terraform command-line tools. If you don't use Cloud Shell, you must install the Google Cloud CLI and Helm.

To run the commands on this page, set up the gcloud CLI in one of the following development environments:
Cloud Shell
To use an online terminal with the gcloud CLI already set up, activate Cloud Shell:
At the bottom of this page, a Cloud Shell session starts and displays a command-line prompt. It can take a few seconds for the session to initialize.
Local shell
To use a local development environment, follow these steps:

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.

Enable the APIs

Setting up your environment

To set up your environment with Cloud Shell, follow these steps:

Set environment variables:
```
exportPROJECT_ID=PROJECT_IDexportPROJECT_NUMBER=$(gcloudprojectsdescribe$PROJECT_ID--format'get(projectNumber)')exportLOCATION=LOCATION
```
Replace PROJECT_ID with your Google Cloud project ID and LOCATION with the regions or zones where your GKE cluster should be created.
Note: For deploying an Ollama workload, choose a location that supports the g2-standard-4 machine type. For more information, see Available regions and zones.
If you don't follow the entire tutorial in a single session, or if your environment variables are unset for some reason, make sure to run this command again to set the variables again.

Create a Standard GKE cluster with cluster autoscaling and Workload Identity Federation for GKE enabled:

gcloudcontainerclusterscreatescale-to-zero\--project=${PROJECT_ID}--location=${LOCATION}\--machine-type=n1-standard-2\--enable-autoscaling--min-nodes=1--max-nodes=5\--workload-pool=${PROJECT_ID}.svc.id.goog

Install KEDA

KEDA is a component that complements Kubernetes Horizontal Pod Autoscaler. With KEDA, you can scale a Deployment to zero Pods and up from zero Pods to one Pod. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster. The standard Horizontal Pod Autoscaler algorithm applies after GKE creates at least one Pod.

After GKE scales the Deployment to zero Pods, because no Pods are running, autoscaling cannot rely on Pod metrics such as CPU utilization. As a consequence, KEDA allows fetching metrics originating from outside the cluster using an implementation of the Kubernetes External Metrics API. You can use this API to autoscale based on metrics such as the number of outstanding messages on a Pub/Sub subscription. See the KEDA documentation for a list of all supported metric sources.

Install KEDA on your cluster with Helm or with kubectl.

Helm

Run the following commands to add the KEDA Helm repository, install the KEDA Helm chart, and give the KEDA service account read access to Cloud Monitoring:

helmrepoaddkedacorehttps://kedacore.github.io/charts helmrepoupdate helminstallkedakedacore/keda--create-namespace--namespacekeda gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--roleroles/monitoring.viewer\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

`kubectl`

Run the following commands to install KEDA using kubectl apply and to give the KEDA service account read access to Cloud Monitoring:

kubectlapply--server-side-fhttps://github.com/kedacore/keda/releases/download/v2.15.1/keda-2.15.1.yaml gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--roleroles/monitoring.viewer\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

Confirm that all KEDA resources appear under the keda namespace:

kubectlgetall-nkeda

For more information about KEDA design and resources, see the KEDA documentation.

Scale your Pub/Sub workload to zero

This section describes a workload that processes messages from a Pub/Sub subscription, handling each message and acknowledging its completion. The workload scales dynamically: as the number of unacknowledged messages increases, autoscaling instantiates more Pods to ensure timely processing.

Scaling to zero ensures that no Pods are instantiated when no messages have been received for a while. This saves resources as no Pods stay idle for long periods of time.

Deploy a Pub/Sub workload

Deploy a sample workload that processes messages queued on a Pub/Sub topic. To simulate a realistic workload, this sample program waits three seconds before acknowledging a message. The workload is configured to run under the keda-pubsub-sa service account.

Run the following commands to create the Pub/Sub topic and subscription, configure their permission, and create the Deployment starting the workload under the keda-pubub namespace.

gcloudpubsubtopicscreatekeda-echo gcloudpubsubsubscriptionscreatekeda-echo-read--topic=keda-echo gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--role=roles/pubsub.subscriber\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda-pubsub/sa/keda-pubsub-sa kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-with-workload-identity.yaml

Configure scale-to-zero

To configure your Pub/Sub workload to scale to zero, use KEDA to define a ScaledObject resource to specify how the deployment should scale. KEDA will then automatically create and manage the underlying HorizontalPodAutoscaler (HPA) object.

Create the ScaledObject resource to describe the expected autoscaling behavior:

curlhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-scaledobject.yaml|envsubst|kubectlapply-f-

This creates the following object:

apiVersion:keda.sh/v1alpha1kind:ScaledObjectmetadata:name:keda-pubsubnamespace:keda-pubsubspec:maxReplicaCount:5scaleTargetRef:name:keda-pubsubtriggers:-type:gcp-pubsubauthenticationRef:name:keda-authmetadata:subscriptionName:"projects/${PROJECT_ID}/subscriptions/keda-echo-read"

Inspect the HorizontalPodAutoscaler (HPA) object that KEDA creates based on the ScaledObject object:
```
kubectl get hpa keda-hpa-keda-pubsub -n keda-pubsub -o yaml
```
Note: Even if the minReplicas field is not set to zero, KEDA overrides it by removing any Pods from the target Deployment when scaling to zero.
You can read more about autoscaling in the Kubernetes documentation.

Wait until KEDA acknowledges that the Pub/Sub subscription is empty, and scales the Deployment to zero replicas.

Inspect the workload autoscaler:

kubectl describe hpa keda-hpa-keda-pubsub -n keda-pubsub

Observe that in the command response, the ScalingActive condition is false. The associated message shows that the Horizontal Pod Autoscaler acknowledges that KEDA scaled the deployment to zero, at which point it stops operating until the Deployment scales back up to one Pod.

Name:keda-hpa-keda-pubsubNamespace:keda-pubsubMetrics:( current / target )"s0-gcp-ps-projects-[...]]" (target average value):0 / 10Min replicas:1Max replicas:5Deployment pods:5 current / 5 desiredConditions:Type Status Reason Message---- ------ ------ -------AbleToScale True ScaleDownStabilized recent recommendations were higher than current one [...]ScalingActive False ScalingDisabled scaling is disabled since the replica count of the target is zeroScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count

Trigger the scale-up

To stimulate the Deployment to scale up:

Enqueue messages on the Pub/Sub topic:

fornumin{1..20}dogcloudpubsubtopicspublishkeda-echo--project=${PROJECT_ID}--message="Test"done

Verify that the Deployment is scaling up:
```
kubectlgetdeployments-nkeda-pubsub 
```
In the output, observe that the 'Ready' column shows one replica:
```
NAME READY UP-TO-DATE AVAILABLE AGE keda-pubsub 1/1 1 1 2d 
```

KEDA scales up the Deployment after it observes that the queue is not empty.

Scale your LLM workload to zero

This section describes a Large Language Model (LLM) workload that deploys an Ollama server with attached GPU. Ollama allows running popular LLMs such as Gemma and Lamma 2, and exposes its features primarily through HTTP.

Install KEDA-HTTP add-on

Scaling an HTTP service down to zero Pods during periods of inactivity causes request failures, since there's no backend to handle the requests.

This section shows how to solve this problem using the KEDA-HTTP add-on. KEDA-HTTP starts an HTTP proxy that receives user requests and forwards them to the Services configured to scale-to-zero. When the Service has no Pod, the proxy triggers the Service to scale up, and buffers the request until the Service has scaled up to at least one Pod.

Install the KEDA-HTTP add-on using Helm. For more information, refer to KEDA-HTTP documentation.

helmrepoaddollama-helmhttps://otwld.github.io/ollama-helm/ helmrepoupdate # Set the proxy timeout to 120s, giving Ollama time to start. helminstallhttp-add-onkedacore/keda-add-ons-http\--create-namespace--namespacekeda\--setinterceptor.responseHeaderTimeout=120s

Deploy an Ollama LLM workload

To deploy an Ollama LLM workload:

Create a node pool containing g2-standard-4 nodes with attached GPUs, and configure cluster autoscaling to provide between zero and two nodes:
```
gcloudcontainernode-poolscreategpu--machine-type=g2-standard-4\--location=${LOCATION}--cluster=scale-to-zero\--min-nodes0--max-nodes2--num-nodes=1--enable-autoscaling
```
Note: Your location must support the g2-standard-4 machine type. For more information, see Available regions and zones.
Add the official Ollama Helm chart repository, and update your local Helm client's repository:
```
helmrepoaddollama-helmhttps://otwld.github.io/ollama-helm/ helmrepoupdate 
```

Deploy the Ollama server using the Helm chart:

helminstallollamaollama-helm/ollama--create-namespace--namespaceollama\-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/helm-values-ollama.yaml

The helm-values-ollama.yaml configuration specifies the LLM models to load, the GPU requirements, and the TCP port for the Ollama server.

Configure scale-to-zero

To configure your Ollama workload to scale to zero, KEDA-HTTP uses an HTTPScaledObject.

Create the HTTPScaledObject resource to describe the expected autoscaling behavior:
```
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/keda-ollama-httpscaledobject.yaml 
```
This creates the HTTPScaledObject object that defines the following fields:
- scaleTargetRef: specifies the Service to which KEDA-HTTP should forward the requests. In this example, all requests with the host ollama.ollama are routed to the Ollama server.
- scaledownPeriod: specifies (in seconds) how fast to scale down when no requests are received.
- replicas: specifies the minimum and maximum number of Pods to maintain for the Ollama deployment.
- scalingMetric: specifies the metrics used to drive autoscaling, such as request rate in this example. For more metric options, see the KEDA-HTTP documentation.
```
kind:HTTPScaledObjectapiVersion:http.keda.sh/v1alpha1metadata:namespace:ollamaname:ollamaspec:hosts:-ollama.ollamascaleTargetRef:name:ollamakind:DeploymentapiVersion:apps/v1service:ollamaport:11434replicas:min:0max:2scaledownPeriod:3600scalingMetric:requestRate:targetValue:20
```

Run the following command to verify that KEDA-HTTP has successfully processed the HTTPScaledObject created in the previous step.

kubectlgethpa,scaledobject-nollama

The output shows the HorizontalPodAutoscaler (created by KEDA), and the ScaledObject (created by KEDA-HTTP) resources:

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE horizontalpodautoscaler.autoscaling/keda-hpa-ollama Deployment/ollama 0/100 (avg) 1 2 1 2d NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE scaledobject.keda.sh/ollama apps/v1.Deployment ollama 0 2 external-push True False False Unknown 2d

Verify that the Deployment scales down to zero Pods.
Wait the period of time set in the scaledownPeriod field and run the command:
```
kubectlgetdeployments-nollama 
```
The output shows that KEDA scaled down the Ollama deployment, and that no Pods are running:
```
NAME READY UP-TO-DATE AVAILABLE AGE ollama 0/0 0 0 2d 
```

Trigger the scale-up

To stimulate the Deployment to scale up, call the Ollama service using the proxy set up by the KEDA-HTTP add-on. This causes the value of the request rate metric to increase, and triggers the creation of a first Pod.

Use kubectl port forwarding capabilities to access the proxy because the proxy is not exposed externally.

kubectlport-forwardsvc/keda-add-ons-http-interceptor-proxy-nkeda8080:8080& # Set the 'Host' HTTP header so that the proxy routes requests to the Ollama server. curl-H"Host: ollama.ollama"\http://localhost:8080/api/generate\-d'{ "model": "gemma:7b", "prompt": "Hello!" }'

The curl command sends the prompt "Hello!" to a Gemma model. Observe the answer tokens coming back in the response. For the specification of the API, see the Ollama guide.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Clean up the Pub/Sub subscription and topic:

gcloudpubsubsubscriptionsdeletekeda-echo-read gcloudpubsubtopicsdeletekeda-echo

Delete your GKE cluster:

gcloudcontainerclustersdeletescale-to-zero--location=${LOCATION}

What's next

Learn more about autoscaling LLM inference workloads in GKE.
Explore the KEDA GitHub repository and documentation.

Scale to zero using KEDA Stay organized with collections Save and categorize content based on your preferences.

Objectives

Costs

Before you begin

Cloud Shell

Local shell

Setting up your environment

Install KEDA

Helm

kubectl

Scale your Pub/Sub workload to zero

Deploy a Pub/Sub workload

Configure scale-to-zero

Trigger the scale-up

Scale your LLM workload to zero

Install KEDA-HTTP add-on

Deploy an Ollama LLM workload

Configure scale-to-zero

Trigger the scale-up

Clean up

What's next

Scale to zero using KEDA
Stay organized with collections Save and categorize content based on your preferences.

`kubectl`