Source: llm/deepseek-r1
Distributed DeepSeek-R1 Serving with High Throughput using SGLang and SkyPilot#
On Jan 20, 2025, DeepSeek AI released the DeepSeek-R1, including a family of models up to 671B parameters.
DeepSeek-R1 naturally emerged with numerous powerful and interesting reasoning behaviors. It outperforms state-of-the-art proprietary models such as OpenAI-o1-mini and becomes the first open LLM to closely rival closed-source models like OpenAI-o1.
We use SGLang to serve the model distributedly with high throughput in this example.
Note: This example is for the original DeepSeek-R1 671B model. For smaller distilled models, please refer to deepseek-r1-distilled.
Run 671B DeepSeek-R1 on Kubernetes or any Cloud#
SkyPilot allows you to run the model distributedly with a single command, leveraging the framework SGLang.
skylaunch-cr1llm/deepseek-r1/deepseek-r1-671B.yaml--retry-until-up
Below is the SkyPilot YAML configuration for DeepSeek-R1 671B, as provided in llm/deepseek-r1/deepseek-r1-671B.yaml
:
name:deepseek-r1resources:accelerators:{H200:8,H100:8}disk_size:1024# Large disk for model weightsdisk_tier:bestports:30000any_of:-use_spot:true-use_spot:falsenum_nodes:2# Specify number of nodes to launch; requirements may vary based on acceleratorssetup:|# Install sglang with all dependencies using uvuv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer# Set up shared memory for better performancesudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"sudo sysctl -prun:|# Launch the server with appropriate configurationMASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)# TP should be number of GPUs per node times number of nodesTP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))python -m sglang.launch_server \--model deepseek-ai/DeepSeek-R1 \--tp $TP \--dist-init-addr ${MASTER_ADDR}:5000 \--nnodes ${SKYPILOT_NUM_NODES} \--node-rank ${SKYPILOT_NODE_RANK} \--trust-remote-code \--enable-dp-attention \--enable-torch-compile \--torch-compile-max-bs 8 \--host 0.0.0.0 \--port 30000
You can also adjust the accelerators
and num_nodes
to fit your needs. Common configurations include:
GPU | Num Nodes |
---|---|
H200:8 | 1 |
H100:8 | 2 |
A100-80GB:8 | 4 |
A100:8 | 8 |
You can override num_nodes
in the command line without modifying the YAML file. For example:
skylaunch-cr1-A100llm/deepseek-r1/deepseek-r1-671B-A100.yaml--retry-until-up--gpusA100-80GB:8--num-nodes4
[!NOTE] For A100 GPUs, use deepseek-r1-671B-A100.yaml, which includes a preprocessing step to convert the model from FP8 to BF16, as A100 does not support FP8. This conversion process takes an additional 30-40 minutes. Alternatively, you can use a pre-converted BF16 model from the Hugging Face community to skip the conversion step.
Since BF16 models consume more memory, A100 deployments require twice the number of nodes compared to H100. That is, if an H100 setup requires 2 nodes, an A100-80GB setup requires 4 nodes, and an A100-40GB setup requires 8 nodes.
For more configuration options, refer to the DeepSeek SGLang Docs.
SkyPilot finds the cheapest candidate resources for you, and automatically fails over through different regions, clouds, or Kubernetes clusters to find the resources to launch the model.
It may take a while (30-40 minutes) for SGLang to download the model weights, compile, and start the server.
Query the endpoint#
After the initialization, you can access the model with the endpoint:
ENDPOINT=$(sky status --endpoint 30000 r1) curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-671B", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "how many rs are in strawberry" } ] }' | jq .
You will get the following answer, which interestingly does not trigger any chain of thoughts.
How many Rs are in strawberry: So, the answer is 3. 🍓
Okay, let’s figure out how many times the letter “r” appears in the word “strawberry.” First, I need to make sure I’m spelling “strawberry” correctly. Sometimes people might miss letters or add extra ones. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let’s double-check. Strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Yes, that’s correct. Now, I need to go through each letter one by one and count the number of “r”s.\n\nStarting with the first letter: S (no), T (no), R (yes, that’s one). Then A (no), W (no), B (no), E (no), R (that’s two), R (that’s three), Y (no). Wait, wait, hold on. Let me write out the letters with their positions to be precise.\n\nBreaking down “strawberry” letter by letter:\n1. S\n2. T\n3. R\n4. A\n5. W\n6. B\n7. E\n8. R\n9. R\n10. Y\n\nSo, looking at positions 3, 8, and 9: that’s three “r”s. But wait, does that match the actual spelling? Let me confirm again. The word is strawberry. Sometimes people might think it’s “strawberry” with two “r”s, but actually, according to correct spelling, it’s S-T-R-A-W-B-E-R-R-Y. So after the B and E, there are two R’s, right? Let me check a dictionary or maybe think of the pronunciation. Straw-ber-ry. The “ber” part is one R, but the correct spelling includes two R’s after the E. So yes, that makes three R’s in total. Hmm, but let me make sure I’m not miscounting. So positions 3, 8, 9: R, then two R’s at the end before Y. That’s three R’s. Wait, actually, in the breakdown above, position 3 is R, then positions 8 and 9 are the two R’s. So total three. Yes, that’s right. So the answer should be three. Let me see if I can find any source that confirms this. Alternatively, I can write the word again and count: S T R A W B E R R Y. So R appears once at the beginning (third letter) and then twice towards the end (8th and 9th letters). So total of three times. Therefore, the correct answer is three.\n\n\nThe word “strawberry” contains 3 instances of the letter “r”. Here’s the breakdown:\n\n1. S \n2. T \n3. R (1st “r”) \n4. A \n5. W \n6. B \n7. E \n8. R (2nd “r”) \n9. R (3rd “r”) \n10. Y \n\nSo, the answer is 3. 🍓
```console {"id":"01add72820794f5c884c4d5c126d2a62","object":"chat.completion","created":1739493784,"model":"deepseek-ai/DeepSeek-R1-671B","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, let's figure out how many times the letter \"r\" appears in the word \"strawberry.\" First, I need to make sure I'm spelling \"strawberry\" correctly. Sometimes people might miss letters or add extra ones. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let's double-check. Strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Yes, that's correct. Now, I need to go through each letter one by one and count the number of \"r\"s.\n\nStarting with the first letter: S (no), T (no), R (yes, that's one). Then A (no), W (no), B (no), E (no), R (that's two), R (that's three), Y (no). Wait, wait, hold on. Let me write out the letters with their positions to be precise.\n\nBreaking down \"strawberry\" letter by letter:\n1. S\n2. T\n3. R\n4. A\n5. W\n6. B\n7. E\n8. R\n9. R\n10. Y\n\nSo, looking at positions 3, 8, and 9: that's three \"r\"s. But wait, does that match the actual spelling? Let me confirm again. The word is strawberry. Sometimes people might think it's \"strawberry\" with two \"r\"s, but actually, according to correct spelling, it's S-T-R-A-W-B-E-R-R-Y. So after the B and E, there are two R's, right? Let me check a dictionary or maybe think of the pronunciation. Straw-ber-ry. The \"ber\" part is one R, but the correct spelling includes two R's after the E. So yes, that makes three R's in total. Hmm, but let me make sure I'm not miscounting. So positions 3, 8, 9: R, then two R's at the end before Y. That's three R's. Wait, actually, in the breakdown above, position 3 is R, then positions 8 and 9 are the two R's. So total three. Yes, that's right. So the answer should be three. Let me see if I can find any source that confirms this. Alternatively, I can write the word again and count: S T R A W B E R R Y. So R appears once at the beginning (third letter) and then twice towards the end (8th and 9th letters). So total of three times. Therefore, the correct answer is three.\n</think>\n\nThe word \"strawberry\" contains **3** instances of the letter \"r\". Here's the breakdown:\n\n1. **S** \n2. **T** \n3. **R** (1st \"r\") \n4. **A** \n5. **W** \n6. **B** \n7. **E** \n8. **R** (2nd \"r\") \n9. **R** (3rd \"r\") \n10. **Y** \n\nSo, the answer is **3**. 🍓","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":17,"total_tokens":688,"completion_tokens":671,"prompt_tokens_details":null}} ```
Speed for Generation#
You can find the generation speed in the log of the server.
Example speed for 2 H100:8 nodes on GCP with a single request (you may get better performance with gvnic enabled):
(head,rank=0,pid=18260)[2025-02-1400:42:22DP2TP2]Decodebatch.#running-req: 1, #token: 210, token usage: 0.00, gen throughput (token/s): 11.45, #queue-req: 0(head,rank=0,pid=18260)[2025-02-1400:42:25DP2TP2]Decodebatch.#running-req: 1, #token: 250, token usage: 0.00, gen throughput (token/s): 11.53, #queue-req: 0(head,rank=0,pid=18260)[2025-02-1400:42:29DP2TP2]Decodebatch.#running-req: 1, #token: 290, token usage: 0.00, gen throughput (token/s): 11.42, #queue-req: 0
Deploy the Service with Multiple Replicas#
The launching command above only starts a single replica (with 2 nodes) for the service. SkyServe helps deploy the service with multiple replicas with out-of-the-box load balancing, autoscaling and automatic recovery. Importantly, it also enables serving on spot instances resulting in 30% lower cost.
The only change needed is to add a service section for serving specific configuration:
service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:path:/health# Allow up to 1 hour for cold startinitial_delay_seconds:3600# Autoscaling from 0 to 2 replicasreplica_policy:min_replicas:0max_replicas:2
And run the SkyPilot YAML with a single command:
skyserveup-nr1-servedeepseek-r1-671B.yaml
Included files#
deepseek-r1-671B-A100.yaml
# Ajusted on deepseek-r1-671B.yaml for A100.name:deepseek-r1-A100resources:accelerators:{A100-80GB:8}disk_size:2048# The model in BF16 format takes about 1.3TBdisk_tier:bestports:30000any_of:-use_spot:true-use_spot:falsenum_nodes:4# Specify number of nodes to launch, the requirement might be different for different acceleratorssetup:|# Install sglang with all dependencies using uvuv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer# Set up shared memory for better performancesudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"sudo sysctl -pecho "FP8 is not supported on A100, we need to convert the model to BF16 format"# Conversion scriptgit clone https://github.com/deepseek-ai/DeepSeek-V3.git deepseek_repo# A workaround for running conversion script on A100. See https://github.com/deepseek-ai/DeepSeek-V3/issues/4CONVERSION_SCRIPT="deepseek_repo/inference/fp8_cast_bf16.py"sed -i 's/new_state_dict\[weight_name\] = weight_dequant(weight, scale_inv)/new_state_dict[weight_name] = weight_dequant(weight.float(), scale_inv)/' $CONVERSION_SCRIPTuv venv venv_convert && source venv_convert/bin/activate# setuptools is needed by tritonuv pip install huggingface_hub setuptools -r deepseek_repo/inference/requirements.txt# Download the model weights and convert to BF16 formatecho "Downloading model weights..."FP8_MODEL_DIR="DeepSeek-R1-FP8"python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='deepseek-ai/DeepSeek-R1', local_dir='./$FP8_MODEL_DIR')"# Convert the model to BF16 formatMODEL_DIR="DeepSeek-R1-BF16"python $CONVERSION_SCRIPT \--input-fp8-hf-path $FP8_MODEL_DIR \--output-bf16-hf-path $MODEL_DIRif [ $? -ne 0 ]; thenecho "BF16 conversion failed"exit 1fiMODEL_FILES=("config.json""generation_config.json""modeling_deepseek.py""configuration_deepseek.py""tokenizer.json""tokenizer_config.json"# the bf16 directory has its own model.safetensors.index.json)cp "${MODEL_FILES[@]/#/$FP8_MODEL_DIR/}" $MODEL_DIR/# See https://github.com/sgl-project/sglang/issues/3491sed -i '/"quantization_config": {/,/}/d' $MODEL_DIR/config.jsonecho "BF16 conversion completed. Model saved to $(realpath $MODEL_DIR)"ls -lh "$MODEL_DIR" # List files for verificationrun:|# Launch the server with appropriate configurationMASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)# TP should be number of GPUs per node times number of nodesTP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))# For A100, we only export the head node for serving requestsif [ "$SKYPILOT_NODE_RANK" -eq 0 ]; thenHEAD_NODE_ARGS="--host 0.0.0.0 --port 30000"elseHEAD_NODE_ARGS=""fipython -m sglang.launch_server \--model-path DeepSeek-R1-BF16 \--tp $TP \--dist-init-addr ${MASTER_ADDR}:5000 \--nnodes ${SKYPILOT_NUM_NODES} \--node-rank ${SKYPILOT_NODE_RANK} \--trust-remote-code \--enable-dp-attention \--enable-torch-compile \--torch-compile-max-bs 8 \$HEAD_NODE_ARGS# Optional: Service configuration for SkyServe deployment# This will be ignored when deploying with `sky launch`service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:path:/health# Allow up to 1 hour for cold startinitial_delay_seconds:3600# Autoscaling from 0 to 2 replicasreplica_policy:min_replicas:0max_replicas:2
deepseek-r1-671B.yaml
name:deepseek-r1resources:accelerators:{H200:8,H100:8,A100-80GB:8}disk_size:1024# Large disk for model weightsdisk_tier:bestports:30000any_of:-use_spot:true-use_spot:falsenum_nodes:2# Specify number of nodes to launchsetup:|# Install sglang with all dependencies using uvuv pip install "sglang[all]==0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer# Set up shared memory for better performancesudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"sudo sysctl -prun:|# Launch the server with appropriate configurationMASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)# TP should be number of GPUs per node times number of nodesTP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))python -m sglang.launch_server \--model deepseek-ai/DeepSeek-R1 \--tp $TP \--dist-init-addr ${MASTER_ADDR}:5000 \--nnodes ${SKYPILOT_NUM_NODES} \--node-rank ${SKYPILOT_NODE_RANK} \--trust-remote-code \--enable-dp-attention \--enable-torch-compile \--torch-compile-max-bs 8 \--host 0.0.0.0 \--port 30000# Optional: Service configuration for SkyServe deployment# This will be ignored when deploying with `sky launch`service:# Specifying the path to the endpoint to check the readiness of the service.readiness_probe:path:/health# Allow up to 1 hour for cold startinitial_delay_seconds:3600# Autoscaling from 0 to 2 replicasreplica_policy:min_replicas:0max_replicas:2