参考代码

环境版本

笔者更新本文时，环境中各个库版本如下：

Python: 3.10.14
GCC: 9.4.0
G++: 9.4.0
LLaMA-Factory: 0.7.1
PyTorch: 2.2.2+cuda11.8
vLLM: 0.4.2

启动脚本

CUDA_VISIBLE_DEVICES=0,1 \
    API_PORT=8081 \
    llamafactory-cli api \
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --adapter_name_or_path saves/xxx/checkpoint-yyy \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager \
    --vllm_maxlen 16000 \
    --vllm_max_lora_rank 64 \
    --finetuning_type lora

vLLM 入门

环境配置

如果直接使用 pip 安装，会强制更新用户环境中已安装的 pytorch 版本（比如 v0.4.2 版本的 vLLM 官方依赖的 pytorch 版本为 2.3.0），建议采用手动编译的方式安装。笔者测试时， v0.4.2 版本的 vLLM 无法在 CUDA 11.6 的环境下编译，在 CUDA 11.8 环境下编译成功。

# 配置 CUDA 环境
export CUDA_HOME=/usr/local/cuda-11.8
export PATH="${CUDA_HOME}/bin:$PATH"
nvcc --version  # Cuda compilation tools, release 11.8, V11.8.89

# 安装依赖包
pip install -U cmake ninja packaging setuptools wheel

# --no-build-isolation 在编译过程中允许访问当前环境已安装的 pytorch
# --no-deps 避免当前环境已安装的 pytorch 版本被强制更新
# 如需 LoRA 支持，需要添加 `VLLM_INSTALL_PUNICA_KERNELS=1` 环境变量编译
VLLM_INSTALL_PUNICA_KERNELS=1 pip install git+https://github.com/vllm-project/vllm.git@v0.4.2 -v --no-build-isolation --no-cache-dir --no-deps

模型推理

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

LoRA 推理

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"]
)

prompts = [
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

常见错误

ImportError: punica LoRA kernels could not be imported. If you built vLLM from source, make sure VLLM_INSTALL_PUNICA_KERNELS=1 env var was set.

原因：手动编译的 vLLM 没有开启 LoRA 支持。

解决方案：添加 VLLM_INSTALL_PUNICA_KERNELS=1 环境变量重新编译。
Input prompt (4992 tokens) is too long and exceeds limit of 4096.

原因：输入的 prompt 长度超过当前引擎设置的模型最大上下文长度。

解决方案：创建 LLMEngine 对象时，提高 max_model_len 参数。

参考资料：
1. https://github.com/hiyouga/LLaMA-Factory/issues/3241#issuecomment-2051014172
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664).

原因：当前的显存不支持以 4096 的上下文长度进行推理。

解决方案：可以在创建 LLMEngine 对象时，提高 gpu_memory_utilization 比例，或者减少 max_model_len 参数到 3664 以下。

参考资料：
1. https://github.com/vllm-project/vllm/issues/2418#issuecomment-1895696282
ValueError: LoRA rank 256 is greater than max_lora_rank 16.

原因：加载的 LoRA adapter 的 rank 超过当前引擎配置的最大值（注意，LoRA rank 是在 adapter 训练时已经确定的，推理时是无法更改的）。

解决方案：创建 LLMEngine 对象时，添加 max_lora_rank=64 参数，截至 v0.4.2 版本，目前 max_lora_rank 支持的值为 (8, 16, 32, 64)。如果需加载的 adapter 的 rank 超过了 64，则可以考虑先将 adapter 的权重和原始模型权重合并后再进行推理。

参考资料：
1. https://github.com/vllm-project/vllm/issues/2847#issuecomment-2009845554
2. https://docs.vllm.ai/en/latest/getting_started/examples/multilora_inference.html#multilora-inference

VllmEngine 模型加载

配置 Engine 参数

# 参考：https://github.com/vllm-project/vllm/blob/0650e5935b0f6af35fb2acf71769982c47b804d7/vllm/entrypoints/llm.py#L30
engine_args = {
    "model": model_args.model_name_or_path,
    "trust_remote_code": True,
    "download_dir": model_args.cache_dir,
    "dtype": infer_dtype,  # bfloat16 / float16 / float32
    # The model context length.
    # 一般可以查看模型目录下 `tokenizer_config.json` 中的 `model_max_length` 配置确定，
    # 比如 Qwen1.5-32B 是 32k，但一般还需要结合显存大小适当降低，不超过 KV cache 大小，
    # 当使用双卡 A100 且 `gpu_memory_utilization=0.9` 时，至少可以开到 16k。
    "max_model_len": model_args.vllm_maxlen,
    # Distributed tensor-parallel inference and serving.
    # 参考：https://docs.vllm.ai/en/latest/serving/distributed_serving.html
    "tensor_parallel_size": get_device_count() or 1,
    "gpu_memory_utilization": model_args.vllm_gpu_util,
    "disable_log_stats": True,
    "disable_log_requests": True,
    # Whether to enforce eager execution. If True, we will
    # disable CUDA graph and always execute the model in eager mode.
    # If False, we will use CUDA graph and eager execution in hybrid.
    # 参考：https://github.com/vllm-project/vllm/issues/4449
    "enforce_eager": model_args.vllm_enforce_eager,
    # 支持 LoRA adapters 推理
    "enable_lora": model_args.adapter_name_or_path is not None,
    # 最大支持的 lora adapter rank
    "max_lora_rank": model_args.vllm_max_lora_rank,
}

初始化 Engine

from vllm import AsyncEngineArgs, AsyncLLMEngine

self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))

VllmEngine 模型推理

Tokenize input prompt

# 添加 assistant 输出提示 prompt
paired_messages = messages + [{"role": "assistant", "content": ""}]
prompt_ids, _ = self.template.encode_oneturn(
    tokenizer=self.tokenizer, messages=paired_messages, system=system, tools=tools
)
prompt_length = len(prompt_ids)

配置 SamplingParams

# 解析请求参数
# 模型输出序列的数量（模型有多少个回答）
num_return_sequences = input_kwargs.pop("num_return_sequences", 1)

# 温度系数，设置为 0 时对应 greedy sampling，
# 参考：https://huggingface.co/blog/how-to-generate#greedy-search
temperature = input_kwargs.pop("temperature", self.generating_args["temperature"])
# top p 采样，参考：https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling
top_p = input_kwargs.pop("top_p", self.generating_args["top_p"])
# top k 采样，参考：https://huggingface.co/blog/how-to-generate#top-k-sampling
top_k = input_kwargs.pop("top_k", self.generating_args["top_k"])
# 设置 > 1 时，模型倾向使用新的 token，< 1 倾向使用重复的（旧的）token
repetition_penalty = input_kwargs.pop("repetition_penalty", self.generating_args["repetition_penalty"])

# 是否使用 beam search，参考：https://huggingface.co/blog/how-to-generate#beam-search
use_beam_search = self.generating_args["num_beams"] > 1
# 仅当使用 beam search 时有效，根据输出序列长度，对 beam search 打分进行惩罚（加权）
# `self.get_cumulative_logprob() / (seq_len**length_penalty)`
# 参考：https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/sequence.py#L338
length_penalty = input_kwargs.pop("length_penalty", self.generating_args["length_penalty"])

# 用于计算模型每个输出序列的最大长度（max_tokens）
max_length = input_kwargs.pop("max_length", None)
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
max_tokens = self.generating_args["max_new_tokens"] or self.generating_args["max_length"]
if max_length:
    max_tokens = max_length - prompt_length if max_length > prompt_length else 1
if max_new_tokens:
    max_tokens = max_new_tokens

# List of strings that stop the generation when they are generated.
# The returned output will not contain the stop strings.
stop = input_kwargs.pop("stop", None)

# 将 generating_args 转为 SamplingParams
sampling_params = SamplingParams(
    n=num_return_sequences,
    repetition_penalty=repetition_penalty,
    temperature=temperature,
    top_p=top_p,
    top_k=top_k,
    use_beam_search=use_beam_search,
    length_penalty=length_penalty,
    # List of tokens that stop the generation when they are generated.
    # The returned output will contain the stop tokens unless the stop tokens are special tokens.
    stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
    # Whether to skip special tokens in the output.
    skip_special_tokens=True,
    max_tokens=max_tokens,
    stop=stop,
)

生成内容

if model_args.adapter_name_or_path is not None:
    # The first parameter of LoRARequest is a human identifiable name,
    # the second parameter is a globally unique ID for the adapter
    # and the third parameter is the path to the LoRA adapter.
    self.lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
else:
    self.lora_request = None

# Start the generation.
results_generator = self.model.generate(
    prompt=None,
    sampling_params=sampling_params,
    request_id=request_id,
    prompt_token_ids=prompt_ids,
    # LoRA request to use for generation, if any.
    lora_request=self.lora_request,
    multi_modal_data=None,
)

# Get the results.
final_output = None
async for request_output in results_generator:
    final_output = request_output
results = []
for output in final_output.outputs:
    results.append(
        Response(
            response_text=output.text,
            response_length=len(output.token_ids),
            prompt_length=len(final_output.prompt_token_ids),
            finish_reason=output.finish_reason,
        )
    )