Skip to main content

llamafile

llamafile is a Mozilla Builders project that allows you to distribute and run LLMs with a single file.

llamafile embeds a llama.cpp server and provides an OpenAI API-compatible chat-completions endpoint, allowing us to use the openai/chat, llama.cpp/completion, and llama.cpp/embedding types.

By default, llamafile uses port 8080, which is also used by Tabby. Therefore, it is recommended to run llamafile with the --port option to serve on a different port, such as 8081.

For embeddings, the embedding endpoint is no longer supported in the standard llamafile server, so you need to run llamafile with the --embedding and --port options.

Below is an example configuration:

~/.tabby/config.toml
# Chat model
[model.chat.http]
kind = "openai/chat" # llamafile uses openai/chat kind
model_name = "your_model"
api_endpoint = "http://localhost:8081/v1" # Please add and conclude with the `v1` suffix
api_key = ""

# Completion model
[model.completion.http]
kind = "llama.cpp/completion" # llamafile uses llama.cpp/completion kind
model_name = "your_model"
api_endpoint = "http://localhost:8081" # DO NOT append the `v1` suffix
api_key = "secret-api-key"
prompt_template = "<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>" # Example prompt template for the Qwen2.5 Coder model series.

# Embedding model
[model.embedding.http]
kind = "llama.cpp/embedding" # llamafile uses llama.cpp/embedding kind
model_name = "your_model"
api_endpoint = "http://localhost:8082" # DO NOT append the `v1` suffix
api_key = ""