vLLM
vLLM is a fast and user-friendly library for LLM inference and serving. It provides an OpenAI-compatible server interface, allowing the use of OpenAI kinds for chat and embedding, while offering a specialized interface for completions.
Important requirements for all model types:
model_namemust exactly match the one used to run vLLMapi_endpointshould follow the formathttp://host:port/v1api_keyshould be identical to the one used to run vLLM
Please note that models differ in their capabilities for completion or chat. Some models can serve both purposes. For detailed information, please refer to the Model Registry.
Chat modelβ
vLLM provides an OpenAI-compatible chat API interface.
~/.tabby/config.toml
[model.chat.http]
kind = "openai/chat"
model_name = "your_model"   # Please make sure to use a chat model
api_endpoint = "http://localhost:8000/v1"
api_key = "your-api-key"
Completion modelβ
Due to implementation differences, vLLM uses its own completion API interface that requires a specific prompt template based on the model being used.
~/.tabby/config.toml
[model.completion.http]
kind = "vllm/completion"
model_name = "your_model"  # Please make sure to use a completion model
api_endpoint = "http://localhost:8000/v1"
api_key = "your-api-key"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"  # Example prompt template for the CodeLlama model series
Embeddings modelβ
vLLM provides an OpenAI-compatible embeddings API interface.
~/.tabby/config.toml
[model.embedding.http]
kind = "openai/embedding"
model_name = "your_model"
api_endpoint = "http://localhost:8000/v1"
api_key = "your-api-key"