vLLM: Easy, Fast, and Cheap LLM Inference#

vLLM

This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by up to 24x.

Prerequisites#

Install the latest SkyPilot and check your setup of the cloud credentials:

pipinstallgit+https://github.com/skypilot-org/skypilot.git skycheck 

See the vLLM SkyPilot YAMLs.

Serving Llama-2 with vLLM’s OpenAI-compatible API server#

Before you get started, you need to have access to the Llama-2 model weights on huggingface. Please check the prerequisites section in Llama-2 example for more details.

Start serving the Llama-2 model:

skylaunch-cvllm-llama2serve-openai-api.yaml--envHF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN 

Optional: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the --gpus flag to request other GPUs. For example, to use H100 GPUs:

skylaunch-cvllm-llama2serve-openai-api.yaml--gpusH100:1--envHF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN 

Tip: You can also use the vLLM docker container for faster setup. Refer to serve-openai-api-docker.yaml for more.

Check the IP for the cluster with:

IP=$(sky status --ip vllm-llama2)

You can now use the OpenAI API to interact with the model.

Query the models hosted on the cluster:

curlhttp://$IP:8000/v1/models

Query a model with input prompts for text completion:

curlhttp://$IP:8000/v1/completions\-H"Content-Type: application/json"\-d'{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'

You should get a similar response as the following:

{ "id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f", "object":"text_completion","created":1692427390, "model":"meta-llama/Llama-2-7b-chat-hf", "choices":[{ "index":0, "text":"city in Northern California that is known", "logprobs":null,"finish_reason":"length" }], "usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

Query a model with input prompts for chat completion:

curlhttp://$IP:8000/v1/chat/completions\-H"Content-Type: application/json"\-d'{ "model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Who are you?" } ] }'

You should get a similar response as the following:

{ "id": "cmpl-879a58992d704caf80771b4651ff8cb6", "object": "chat.completion", "created": 1692650569, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [{ "index": 0, "message": { "role": "assistant", "content": " Hello! I'm just an AI assistant, here to help you" }, "finish_reason": "length" }], "usage": { "prompt_tokens": 31, "total_tokens": 47, "completion_tokens": 16 }}

Serving Llama-2 with vLLM for more traffic using SkyServe#

To scale up the model serving for more traffic, we introduced SkyServe to enable a user to easily deploy multiple replica of the model:

Adding an service section in the above serve-openai-api.yaml file to make it an SkyServeServiceYAML:

# The newly-added `service` section to the `serve-openai-api.yaml` file.service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:/v1/models# How many replicas to manage.replicas:2

The entire Service YAML can be found here: service.yaml.

Start serving by using SkyServe CLI:

skyserveup-nvllm-llama2service.yaml 

Use skyservestatus to check the status of the serving:

skyservestatusvllm-llama2 

You should get a similar output as the following:

ServicesNAME UPTIME STATUS REPLICAS ENDPOINTvllm-llama2 7m 43s READY 2/2 3.84.15.251:30001Service ReplicasSERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGIONvllm-llama2 1 34.66.255.4 11 mins ago 1x GCP({'L4': 1}) READY us-central1vllm-llama2 2 35.221.37.64 15 mins ago 1x GCP({'L4': 1}) READY us-east4

Check the endpoint of the service:

ENDPOINT=$(skyservestatus--endpointvllm-llama2)

Once it status is READY, you can use the endpoint to interact with the model:

curl$ENDPOINT/v1/chat/completions\-H"Content-Type: application/json"\-d'{ "model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Who are you?" } ] }'

Notice that it is the same with previously curl command. You should get a similar response as the following:

{ "id": "cmpl-879a58992d704caf80771b4651ff8cb6", "object": "chat.completion", "created": 1692650569, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [{ "index": 0, "message": { "role": "assistant", "content": " Hello! I'm just an AI assistant, here to help you" }, "finish_reason": "length" }], "usage": { "prompt_tokens": 31, "total_tokens": 47, "completion_tokens": 16 }}

Serving Mistral AI’s Mixtral 8x7b model with vLLM#

Please refer to the Mixtral 8x7b example for more details.

Included files#

serve-openai-api-docker.yaml

envs:MODEL_NAME:meta-llama/Llama-2-7b-chat-hfHF_TOKEN:# TODO: Fill with your own huggingface token, or use --env to pass.resources:image_id:docker:vllm/vllm-openai:latestaccelerators:{L4:1,A10G:1,A10:1,A100:1,A100-80GB:1}ports:-8000setup:|conda deactivatepython3 -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"run:|conda deactivateecho 'Starting vllm openai api server...'python -m vllm.entrypoints.openai.api_server \--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \--host 0.0.0.0

serve-openai-api.yaml

envs:MODEL_NAME:meta-llama/Llama-2-7b-chat-hfHF_TOKEN:# TODO: Fill with your own huggingface token, or use --env to pass.resources:accelerators:{L4:1,A10G:1,A10:1,A100:1,A100-80GB:1}ports:-8000setup:|conda activate vllmif [ $? -ne 0 ]; thenconda create -n vllm python=3.10 -yconda activate vllmfipip install transformers==4.38.0pip install vllm==0.3.2python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"run:|conda activate vllmecho 'Starting vllm openai api server...'python -m vllm.entrypoints.openai.api_server \--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \--host 0.0.0.0

serve.yaml

envs:MODEL_NAME:decapoda-research/llama-65b-hfresources:accelerators:A100-80GB:8setup:|conda activate vllmif [ $? -ne 0 ]; thenconda create -n vllm python=3.10 -yconda activate vllmfi# Install fschat and accelerate for chat completiongit clone https://github.com/vllm-project/vllm.git || truepip install transformers==4.38.0pip install vllm==0.3.2pip install gradiorun:|conda activate vllmecho 'Starting vllm api server...'python -u -m vllm.entrypoints.api_server \--model $MODEL_NAME \--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \--tokenizer hf-internal-testing/llama-tokenizer 2>&1 | tee api_server.log &echo 'Waiting for vllm api server to start...'while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; doneecho 'Starting gradio server...'python vllm/examples/gradio_webserver.py

service-with-auth.yaml

# service.yaml# The newly-added `service` section to the `serve-openai-api.yaml` file.service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:path:/v1/models# Set authorization headers here if needed.headers:Authorization:Bearer $AUTH_TOKEN# How many replicas to manage.replicas:1# Fields below are the same with `serve-openai-api.yaml`.envs:MODEL_NAME:meta-llama/Llama-2-7b-chat-hfHF_TOKEN:# TODO: Fill with your own huggingface token, or use --env to pass.AUTH_TOKEN:# TODO: Fill with your own auth token (a random string), or use --env to pass.resources:accelerators:{L4:1,A10G:1,A10:1,A100:1,A100-80GB:1}ports:8000setup:|conda activate vllmif [ $? -ne 0 ]; thenconda create -n vllm python=3.10 -yconda activate vllmfipip install transformers==4.38.0pip install vllm==0.3.2python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"run:|conda activate vllmecho 'Starting vllm openai api server...'python -m vllm.entrypoints.openai.api_server \--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \--host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN

service.yaml

# service.yaml# The newly-added `service` section to the `serve-openai-api.yaml` file.service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:/v1/models# How many replicas to manage.replicas:2# Fields below are the same with `serve-openai-api.yaml`.envs:MODEL_NAME:meta-llama/Llama-2-7b-chat-hfHF_TOKEN:# TODO: Fill with your own huggingface token, or use --env to pass.resources:accelerators:{L4:1,A10G:1,A10:1,A100:1,A100-80GB:1}ports:-8000setup:|conda activate vllmif [ $? -ne 0 ]; thenconda create -n vllm python=3.10 -yconda activate vllmfipip install transformers==4.38.0pip install vllm==0.3.2python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"run:|conda activate vllmecho 'Starting vllm openai api server...'python -m vllm.entrypoints.openai.api_server \--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \--host 0.0.0.0