llama.cpp
llama.cpp is a popular C++ library for serving gguf-based models. It provides a server implementation that supports completion, chat, and embedding functionalities through HTTP APIs.
Chat modelβ
llama.cpp provides an OpenAI-compatible chat API interface.
~/.tabby/config.toml
[model.chat.http]
kind = "openai/chat"
api_endpoint = "http://localhost:8888"
Completion modelβ
llama.cpp offers a specialized completion API interface for code completion tasks.
~/.tabby/config.toml
[model.completion.http]
kind = "llama.cpp/completion"
api_endpoint = "http://localhost:8888"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>" # Example prompt template for the CodeLlama model series.
Embeddings modelβ
llama.cpp provides embedding functionality through its HTTP API.
The llama.cpp embedding API interface and response format underwent some changes in version b4356
.
Therefore, we have provided two different kinds to accommodate the various versions of the llama.cpp embedding interface.
You can refer to the configuration as follows:
~/.tabby/config.toml
[model.embedding.http]
kind = "llama.cpp/embedding"
api_endpoint = "http://localhost:8888"
For the versions prior to b4356
~/.tabby/config.toml
[model.embedding.http]
kind = "llama.cpp/before_b4356_embedding"
api_endpoint = "http://localhost:8888"