To perform a node update roll out, perform the following steps:
Save the following sample manifest as routes-to-llm.yaml
:
apiVersion:gateway.networking.k8s.io/v1kind:`HTTPRoute`metadata:name:routes-to-llmspec:parentRefs:-name:my-inference-gatewayrules:backendRefs:-name:llmkind:InferencePoolweight:90-name:llm-newkind:InferencePoolweight:10
Apply the sample manifest to your cluster:
kubectlapply-froutes-to-llm.yaml
The original llm
InferencePool
receives most of the traffic, while the llm-new
InferencePool
receives the rest of the traffic. Increase the traffic weight gradually for the llm-new
InferencePool
to complete the node update roll out.
Base model updates roll out in phases to a new base LLM, retaining compatibility with existing LoRA adapters. You can use base model update roll outs to upgrade to improved model architectures or to address model-specific issues.
To roll out a base model update:
InferencePool
configured with the new base model that you chose.HTTPRoute
to split traffic between the existing InferencePool
(which uses the old base model) and the new InferencePool
(using the new base model). The backendRefs weight
field controls the traffic percentage allocated to each pool.InferenceModel
integrity: keep your InferenceModel
configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.InferencePool
during the roll out to facilitate a rollback if necessary.You create a new InferencePool
named llm-pool-version-2
. This pool deploys a new version of the base model on a new set of nodes. By configuring an HTTPRoute
, as shown in the provided example, you can incrementally split traffic between the original llm-pool
and llm-pool-version-2
. This lets you control base model updates in your cluster.
To perform a base model update roll out, perform the following steps:
Save the following sample manifest as routes-to-llm.yaml
:
apiVersion:gateway.networking.k8s.io/v1kind:HTTPRoutemetadata:name:routes-to-llmspec:parentRefs:-name:my-inference-gatewayrules:backendRefs:-name:llm-poolkind:InferencePoolweight:90-name:llm-pool-version-2kind:InferencePoolweight:10
Apply the sample manifest to your cluster:
kubectlapply-froutes-to-llm.yaml
The original llm-pool
InferencePool
receives most of the traffic, while the llm-pool-version-2
InferencePool
receives the rest. Increase the traffic weight gradually for the llm-pool-version-2
InferencePool
to complete the base model update roll out.
LoRA adapter update roll outs let you deploy new versions of fine-tuned models in phases, without altering the underlying base model or infrastructure. Use LoRA adapter update roll outs to test improvements, bug fixes, or new features in your LoRA adapters.
To update a LoRA adapter, follow these steps:
Make adapters available: Ensure that the new LoRA adapter versions are available on the model servers. For more information, see Adapter roll out.
Modify the InferenceModel
configuration: in your existing InferenceModel
configuration, define multiple versions of your LoRA adapter. Assign a unique modelName
to each version (for example, llm-v1
, llm-v2
).
Distribute traffic: use the weight
field in the InferenceModel
specification to control the traffic distribution among the different LoRA adapter versions.
Maintain a consistent poolRef
: ensure that all LoRA adapter versions reference the same InferencePool
. This prevents node or InferencePool
redeployments. Retain previous LoRA adapter versions in the InferenceModel
configuration to enable rollbacks.
The following example shows two LoRA adapter versions, llm-v1
and llm-v2
. Both versions use the same base model. You define llm-v1
and llm-v2
within the same InferenceModel
. You assign weights to incrementally shift traffic from llm-v1
to llm-v2
. This control allows a controlled roll out without requiring any changes to your nodes or InferencePool
configuration.
To roll out LoRA adapter updates, run the following command:
Save the following sample manifest as inferencemodel-sample.yaml
:
apiVersion:inference.networking.x-k8s.io/v1alpha2kind:InferenceModelmetadata:name:inferencemodel-samplespec:versions:-modelName:llm-v1criticality:Criticalweight:90poolRef:name:llm-pool-modelName:llm-v2criticality:Criticalweight:10poolRef:name:llm-pool
Apply the sample manifest to your cluster:
kubectlapply-finferencemodel-sample.yaml
The llm-v1
version receives most of the traffic, while the llm-v2
version receives the rest. Increase the traffic weight gradually for the llm-v2
version to complete the LoRA adapter update roll out.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-04-17 UTC.