前言

最近在测试部署 Qwen3 系列模型时发现，vLLM 已经升级到 v0.8.5 版本了。上一篇相关的教程 vLLM 版本还停留在 v0.4.2，本篇教程主要集中在新版本 vLLM 的编译和部署。

系统要求

系统镜像：docker pull nvidia/cuda:12.4.1-cudnn-devel-rockylinux8
GPU：A100-SXM4-40GB

安装

2.7.0 及以上版本的 PyTorch 已经不支持 CUDA 12.4 了，因此需要安装 2.6.0 版本的 PyTorch。

conda create -n vllm python=3.10
conda activate vllm
pip install uv
uv -v pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 xformers==0.0.29.post2 flashinfer-python --index-url https://download.pytorch.org/whl/cu124 --extra-index-url https://flashinfer.ai/whl/cu124/torch2.6

预编译 vLLM（推荐）

pip install https://github.com/vllm-project/vllm/releases/download/v0.8.5.post1/vllm-0.8.5.post1-cp38-abi3-manylinux1_x86_64.whl

手动编译 vLLM

vLLM 后续预编译版本将依赖 2.7.0 及以上版本的 PyTorch，需要手动编译 vLLM 以支持 PyTorch 2.6.0 + CUDA 12.4。

<=v0.9.2

git clone -b v0.9.2 https://github.com/vllm-project/vllm.git
cd vllm
# 修改 requirements/cuda.txt 依赖包的版本号
# 确保和 pytorch 2.6.0 匹配，避免后续安装的时候重新拉取不匹配的 pytorch 版本
# torch==2.6.0
# torchaudio==2.6.0
# torchvision==0.21.0
# xformers>=0.0.29.post2,<=0.0.29.post3

# 安装依赖
pip install -U cmake packaging pip ninja setuptools setuptools-scm regex
yum install -y git git-lfs gcc-toolset-11
source /opt/rh/gcc-toolset-11/enable

# 编译安装
# TORCH_CUDA_ARCH_LIST 请根据实际 GPU 型号架构修改
# https://developer.nvidia.com/cuda-gpus
TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0a" \
  MAX_JOBS=4 \
  NVCC_THREADS=2 \
  CMAKE_BUILD_TYPE=Release \
  pip wheel \
  -v \
  --no-build-isolation \
  --no-cache-dir \
  --no-deps \
  .
pip install ./vllm-*.whl

>=v0.10.0

v0.10.0 版本编译还需要手动 patch 部分代码以保证自定义算子注册成功。

requirements/cuda.txt

--- a/requirements/cuda.txt
+++ b/requirements/cuda.txt
@@ -6,9 +6,8 @@ numba == 0.61.2; python_version > '3.9'
 
 # Dependencies for NVIDIA GPUs
 ray[cgraph]>=2.43.0, !=2.44.* # Ray Compiled Graph, required for pipeline parallelism in V1.
-torch==2.7.1
-torchaudio==2.7.1
+torch==2.6.0
+torchaudio==2.6.0
 # These must be updated alongside torch
-torchvision==0.22.1 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
-# https://github.com/facebookresearch/xformers/releases/tag/v0.0.31
-xformers==0.0.31; platform_system == 'Linux' and platform_machine == 'x86_64'  # Requires PyTorch >= 2.7
+torchvision==0.21.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
+xformers>=0.0.29.post2,<=0.0.29.post3; platform_system == 'Linux' and platform_machine == 'x86_64'  # Requires PyTorch >= 2.6

vllm/utils/__init__.py

--- a/vllm/utils/__init__.py
+++ b/vllm/utils/__init__.py
@@ -46,7 +46,7 @@ from concurrent.futures.process import ProcessPoolExecutor
 from dataclasses import dataclass, field
 from functools import cache, lru_cache, partial, wraps
 from types import MappingProxyType
-from typing import (TYPE_CHECKING, Any, Callable, Generic, Literal, NamedTuple,
+from typing import (TYPE_CHECKING, Any, Callable, Generic, List, Literal, NamedTuple,
                     Optional, Tuple, TypeVar, Union, cast, overload)
 from urllib.parse import urlparse
 from uuid import uuid4
@@ -2516,6 +2516,12 @@ def direct_register_custom_op(
             "the required dependencies.")
         return
 
+    # https://github.com/vllm-project/vllm-ascend/pull/837
+    for k, v in op_func.__annotations__.items():
+        if v == list[int]:
+            op_func.__annotations__[k] = List[int]
+        if v == Optional[list[int]]:
+            op_func.__annotations__[k] = Optional[List[int]]
     import torch.library
     if hasattr(torch.library, "infer_schema"):
         schema_str = torch.library.infer_schema(op_func,

部署

启动一个与 OpenAI 接口兼容的服务只需要一行命令，且命令行参数与 SGLang 基本完全一致，以 Qwen3-32B 模型为例（v0.8.5.post1 还不支持 --reasoning-parser qwen3 参数，需要手动编译最新的 main 分支），

vllm serve Qwen3-32B --enable-reasoning --reasoning-parser qwen3 -tp 4 --trust-remote-code --host 0.0.0.0 --port 8081

注意：从 v0.10.0 版本开始，--enable-reasoning 选项已经被废弃，需要从启动命令中删除。

测试

请求：

curl http://127.0.0.1:8081/v1/chat/completions \
     -X POST \
     -H "Content-Type: application/json" \
     -d '{"messages":[{"role":"user","content":"你是谁?"}]}'

响应：

{
  "id": "chatcmpl-7e212cb8dbd7493e820c06d850dfc5c5",
  "object": "chat.completion",
  "created": 1746348395,
  "model": "Qwen3-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "\n嗯，用户问“你是谁？”，我需要先确定用户的需求是什么。可能他们想了解我的基本功能，或者想确认我的身份。首先，我应该按照指示给出明确的身份说明，包括我的中文名和英文名，以及我的开发公司。然后，要强调我是超大规模的语言模型，能够处理多种任务，比如回答问题、创作文字、编程等。接下来，需要分点列出我的主要功能，这样用户看起来更清晰。同时，要提到我支持多语言，这样用户知道如果他们需要用其他语言交流，我也能应对。最后，用友好的语气邀请用户提问或给出任务，这样可以促进进一步的互动。要注意避免使用技术术语，保持回答简洁易懂。另外，用户可能有不同的使用场景，比如学习、工作或娱乐，所以需要保持回答的通用性，同时准备好根据后续问题调整回答。还要确保没有遗漏任何关键信息，比如我的训练数据截止日期，但根据之前的指示，可能不需要提到这个。总之，回答要准确、友好，并引导用户继续互动。\n",
        "content": "\n\n你好！我是通义千问，英文名Qwen，是由阿里巴巴集团旗下的通义实验室自主研发的超大规模语言模型。我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、逻辑推理、编程等等，还能表达观点，玩游戏等。我支持多种语言，包括但不限于中文、英文、德语、法语、西班牙语等。\n\n如果你有任何问题或需要帮助，随时告诉我，我会尽力为你提供支持！😊",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 336,
    "completion_tokens": 325,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

vllm serve 一分钟上手教程

前言

系统要求

安装

预编译 vLLM（推荐）

手动编译 vLLM

<=v0.9.2

>=v0.10.0

部署

测试

参考资料