前言
vLLM是一个快速且易于使用的LLM推理和服务库。
vLLM速度很快:
最先进的服务吞吐量
使用PagedNote有效管理注意力键和值内存
传入请求的连续批处理
使用CUDA/HIP图快速执行模型
量化:GPTQ、AWQ、INT4、INT8和FP8
优化了CUDA内核,包括与FlashNote和FlashInfer的集成。
推测译码
组块预填充
vLLM灵活且易于使用:
与流行的HuggingFace模型无缝集成
高吞吐量服务于各种解码算法,包括并行采样、波束搜索等
分布式推理的张量并行性和管道并行性支持
流式输出
与OpenAI兼容的API服务器
支持英伟达图形处理器、AMD图形处理器和图形处理器、英特尔图形处理器、高迪®加速器和图形处理器、PowerPC图形处理器、TPU和AWS训练和推理加速器。
前缀缓存支持
多lora支持
Operating System: Ubuntu 22.04.4 LTS
参考文档
- Welcome to vLLM!
- vllm-project/vllm
- Quickstart
- Distributed Inference and Serving
- OpenAI Chat Completion Client
Offline Batched Inference
# pip install vllm
from vllm import LLM, SamplingParams
prompts = [
"write a quick sort algorithm.",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=2048)
llm = LLM(model="Qwen/Qwen2.5-Coder-0.5B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
OpenAI-Compatible Server
"""
$ pip install vllm
$ vllm serve Qwen/Qwen2.5-Coder-0.5B-Instruct --tensor-parallel-size 8
$ curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-0.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-0.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
"""
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(
model="Qwen/Qwen2.5-Coder-0.5B-Instruct",
prompt="write a quick sort algorithm in Python.",
max_tokens=256
)
print(f"Completion result:\n{completion}")
print(f"{'-'*42}")
print(f"response:\n{completion.choices[0].text}")
print(f"{'-'*42}{'qwq'}{'-'*42}")
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-Coder-0.5B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "write a quick sort algorithm in Python."},
]
)
print(f"Chat response:\n{chat_response}")
print(f"{'-'*42}")
print(chat_response.choices[0].message.content)
print(f"{'-'*42}{'qwq'}{'-'*42}")
models = client.models.list()
model = models.data[0].id
chat_completion = client.chat.completions.create(
messages=[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "Who won the world series in 2020?"
}, {
"role":
"assistant",
"content":
"The Los Angeles Dodgers won the World Series in 2020."
}, {
"role": "user",
"content": "Where was it played?"
}],
model=model,
)
print(f"Chat completion results:\n{chat_completion}")
print(f"{'-'*42}")
print(chat_completion.choices[0].message.content)
结语
第二百四十二篇博文写完,开心!!!!
今天,也是充满希望的一天。