参考代码
src/llamafactory/chat/vllm_engine.py
环境版本
笔者更新本文时,环境中各个库版本如下:
- Python: 3.10.14
- GCC: 9.4.0
- G++: 9.4.0
- LLaMA-Factory: 0.7.1
- PyTorch: 2.2.2+cuda11.8
- vLLM: 0.4.2
启动脚本
CUDA_VISIBLE_DEVICES=0,1 \
API_PORT=8081 \
llamafactory-cli api \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--adapter_name_or_path saves/xxx/checkpoint-yyy \
--template llama3 \
--infer_backend vllm \
--vllm_enforce_eager \
--vllm_maxlen 16000 \
--vllm_max_lora_rank 64 \
--finetuning_type lora
vLLM 入门
环境配置
如果直接使用 pip
安装,会强制更新用户环境中已安装的 pytorch
版本(比如 v0.4.2
版本的 vLLM 官方依赖的 pytorch
版本为 2.3.0
),建议采用手动编译的方式安装。笔者测试时, v0.4.2
版本的 vLLM 无法在 CUDA 11.6
的环境下编译,在 CUDA 11.8
环境下编译成功。
# 配置 CUDA 环境
export CUDA_HOME=/usr/local/cuda-11.8
export PATH="${CUDA_HOME}/bin:$PATH"
nvcc --version # Cuda compilation tools, release 11.8, V11.8.89
# 安装依赖包
pip install -U cmake ninja packaging setuptools wheel
# --no-build-isolation 在编译过程中允许访问当前环境已安装的 pytorch
# --no-deps 避免当前环境已安装的 pytorch 版本被强制更新
# 如需 LoRA 支持,需要添加 `VLLM_INSTALL_PUNICA_KERNELS=1` 环境变量编译
VLLM_INSTALL_PUNICA_KERNELS=1 pip install git+https://github.com/vllm-project/vllm.git@v0.4.2 -v --no-build-isolation --no-cache-dir --no-deps
模型推理
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
LoRA 推理
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
常见错误
-
ImportError: punica LoRA kernels could not be imported. If you built vLLM from source, make sure VLLM_INSTALL_PUNICA_KERNELS=1 env var was set.
原因:手动编译的 vLLM 没有开启 LoRA 支持。
解决方案:添加
VLLM_INSTALL_PUNICA_KERNELS=1
环境变量重新编译。 -
Input prompt (4992 tokens) is too long and exceeds limit of 4096.
原因:输入的 prompt 长度超过当前引擎设置的模型最大上下文长度。
解决方案:创建
LLMEngine
对象时,提高max_model_len
参数。参考资料:
-
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664).
原因:当前的显存不支持以 4096 的上下文长度进行推理。
解决方案:可以在创建
LLMEngine
对象时,提高gpu_memory_utilization
比例,或者减少max_model_len
参数到 3664 以下。参考资料:
-
ValueError: LoRA rank 256 is greater than max_lora_rank 16.
原因:加载的 LoRA adapter 的 rank 超过当前引擎配置的最大值(注意,LoRA rank 是在 adapter 训练时已经确定的,推理时是无法更改的)。
解决方案:创建
LLMEngine
对象时,添加max_lora_rank=64
参数,截至v0.4.2
版本,目前max_lora_rank
支持的值为(8, 16, 32, 64)
。如果需加载的 adapter 的 rank 超过了 64,则可以考虑先将 adapter 的权重和原始模型权重合并后再进行推理。参考资料:
VllmEngine 模型加载
配置 Engine 参数
# 参考:https://github.com/vllm-project/vllm/blob/0650e5935b0f6af35fb2acf71769982c47b804d7/vllm/entrypoints/llm.py#L30
engine_args = {
"model": model_args.model_name_or_path,
"trust_remote_code": True,
"download_dir": model_args.cache_dir,
"dtype": infer_dtype, # bfloat16 / float16 / float32
# The model context length.
# 一般可以查看模型目录下 `tokenizer_config.json` 中的 `model_max_length` 配置确定,
# 比如 Qwen1.5-32B 是 32k,但一般还需要结合显存大小适当降低,不超过 KV cache 大小,
# 当使用双卡 A100 且 `gpu_memory_utilization=0.9` 时,至少可以开到 16k。
"max_model_len": model_args.vllm_maxlen,
# Distributed tensor-parallel inference and serving.
# 参考:https://docs.vllm.ai/en/latest/serving/distributed_serving.html
"tensor_parallel_size": get_device_count() or 1,
"gpu_memory_utilization": model_args.vllm_gpu_util,
"disable_log_stats": True,
"disable_log_requests": True,
# Whether to enforce eager execution. If True, we will
# disable CUDA graph and always execute the model in eager mode.
# If False, we will use CUDA graph and eager execution in hybrid.
# 参考:https://github.com/vllm-project/vllm/issues/4449
"enforce_eager": model_args.vllm_enforce_eager,
# 支持 LoRA adapters 推理
"enable_lora": model_args.adapter_name_or_path is not None,
# 最大支持的 lora adapter rank
"max_lora_rank": model_args.vllm_max_lora_rank,
}
初始化 Engine
from vllm import AsyncEngineArgs, AsyncLLMEngine
self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
VllmEngine 模型推理
Tokenize input prompt
# 添加 assistant 输出提示 prompt
paired_messages = messages + [{"role": "assistant", "content": ""}]
prompt_ids, _ = self.template.encode_oneturn(
tokenizer=self.tokenizer, messages=paired_messages, system=system, tools=tools
)
prompt_length = len(prompt_ids)
配置 SamplingParams
# 解析请求参数
# 模型输出序列的数量(模型有多少个回答)
num_return_sequences = input_kwargs.pop("num_return_sequences", 1)
# 温度系数,设置为 0 时对应 greedy sampling,
# 参考:https://huggingface.co/blog/how-to-generate#greedy-search
temperature = input_kwargs.pop("temperature", self.generating_args["temperature"])
# top p 采样,参考:https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling
top_p = input_kwargs.pop("top_p", self.generating_args["top_p"])
# top k 采样,参考:https://huggingface.co/blog/how-to-generate#top-k-sampling
top_k = input_kwargs.pop("top_k", self.generating_args["top_k"])
# 设置 > 1 时,模型倾向使用新的 token,< 1 倾向使用重复的(旧的)token
repetition_penalty = input_kwargs.pop("repetition_penalty", self.generating_args["repetition_penalty"])
# 是否使用 beam search,参考:https://huggingface.co/blog/how-to-generate#beam-search
use_beam_search = self.generating_args["num_beams"] > 1
# 仅当使用 beam search 时有效,根据输出序列长度,对 beam search 打分进行惩罚(加权)
# `self.get_cumulative_logprob() / (seq_len**length_penalty)`
# 参考:https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/sequence.py#L338
length_penalty = input_kwargs.pop("length_penalty", self.generating_args["length_penalty"])
# 用于计算模型每个输出序列的最大长度(max_tokens)
max_length = input_kwargs.pop("max_length", None)
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
max_tokens = self.generating_args["max_new_tokens"] or self.generating_args["max_length"]
if max_length:
max_tokens = max_length - prompt_length if max_length > prompt_length else 1
if max_new_tokens:
max_tokens = max_new_tokens
# List of strings that stop the generation when they are generated.
# The returned output will not contain the stop strings.
stop = input_kwargs.pop("stop", None)
# 将 generating_args 转为 SamplingParams
sampling_params = SamplingParams(
n=num_return_sequences,
repetition_penalty=repetition_penalty,
temperature=temperature,
top_p=top_p,
top_k=top_k,
use_beam_search=use_beam_search,
length_penalty=length_penalty,
# List of tokens that stop the generation when they are generated.
# The returned output will contain the stop tokens unless the stop tokens are special tokens.
stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
# Whether to skip special tokens in the output.
skip_special_tokens=True,
max_tokens=max_tokens,
stop=stop,
)
生成内容
if model_args.adapter_name_or_path is not None:
# The first parameter of LoRARequest is a human identifiable name,
# the second parameter is a globally unique ID for the adapter
# and the third parameter is the path to the LoRA adapter.
self.lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
else:
self.lora_request = None
# Start the generation.
results_generator = self.model.generate(
prompt=None,
sampling_params=sampling_params,
request_id=request_id,
prompt_token_ids=prompt_ids,
# LoRA request to use for generation, if any.
lora_request=self.lora_request,
multi_modal_data=None,
)
# Get the results.
final_output = None
async for request_output in results_generator:
final_output = request_output
results = []
for output in final_output.outputs:
results.append(
Response(
response_text=output.text,
response_length=len(output.token_ids),
prompt_length=len(final_output.prompt_token_ids),
finish_reason=output.finish_reason,
)
)