Skip to main content

Β· 5 min read

Excited to share that Tabby Agent now available on npm and supports running as a language server πŸŽ‰. This new feature provides a uniform protocol to easily integrate Tabby into different text editors. Let's dive deeper together to unfold the stories behind!

What is Tabby Agent​

Tabby Agent is a Node.js package that communicates with the Tabby server and implements several essential features for code completion, including:

  • Debouncing: Tabby Agent handles code completion requests by implementing an appropriate debouncing mechanism. This ensures that requests are not sent too frequently, reducing server load and improving performance, as the automatic inline code completion often listens for text input which can be very frequent.
  • Caching: Tabby Agent prevents redundant completion requests with KV caching. When a completion is dismissed but requested again at the same location, the cached content is used directly. Cached completions are also matched when the prefix of a request aligns with a previously cached completion, no need for additional server requests. This is especially useful when users type the same text as the ghost text suggestions.
  • Post-processing: Tabby Agent enhances completion results through post-processings including filtering out low-quality completions, removing duplicate suggestions, and limiting the length of suggestions to the focused scope. All of these post-processings help users focus on the most relevant suggestions.

These features were initially developed as part of the Tabby VSCode extension. Now it's desirable to reuse the client code logic as Tabby expands to more text editors. Therefore, we are building Tabby Agent as a standalone Node.js package that can be used by other editors to communicate with Tabby.


Why Language Server​

Tabby Agent previously utilized a customized protocol based on JSON Lines, designed to be compatible with VIM's JSON mode channel. However, this protocol was not widely adopted, making it hard to integrate Tabby Agent to different editors. With a more universal protocol, we can offer a more flexible and streamlined experience in creating Tabby plugins for various editors.

The Language Server Protocol defines a standardized protocol for communication between a language server and its clients. It provides methods to implement a wide range of features, including code completion.


Running Tabby as a Language Server provides code completion with the standard textDocument/completion protocol. It can suggest code completions based on the context of the code, whether it's a line or a block, rather than just a single word.

I'm also looking forward to the proposed textDocument/inlineCompletion feature in the upcoming version 3.18 of the LSP Specification. It will provide better support for multi-line code completions. Stay tuned for more updates on this topic in the future!

Running Tabby as a Language Server​

To run Tabby as a language server, follow these steps:

  1. Set up your Tabby server following this documentation.

  2. Make sure you have Node.js version 18 or above installed on your system.

  3. Run the following command in your terminal:

    npx tabby-agent --lsp --stdio

    Follow the instructions displayed in the console. Once the installation is complete, the Tabby agent will start listening for requests on StdIO. If there are no error messages, you can assume that the Tabby Agent script is running correctly. You can stop it by pressing Ctrl+C. npx-run-tabby-agent

    Alternatively, you can install tabby-agent as a global package and run it by using the following command:

    # Install tabby-agent as a global package
    npm install --global tabby-agent
    # Run the agent as a language server
    tabby-agent --lsp --stdio
    # Press `Ctrl+C` to stop the agent
  4. You can configure the agent's settings by editing the config file located at ~/.tabby-client/agent/config.toml. If your Tabby server uses a different port or requires authentication, modify these settings accordingly:

    endpoint = "" # Replace with your server's endpoint
    token = "your_token"

    For more details on configuration options, refer to this documentation.

Connect Your Editor to Tabby​

Most text editors support built-in LSP clients or popular LSP client plugins, making it easy to connect them to the Tabby agent language server. Let's take NeoVim and coc.nvim as an example to show you how to configure your editor to connect to Tabby.

  1. Install coc.nvim by following the guide

  2. Start NeoVim, and use the :CocConfig command to open the configuration file. Add the following configuration:

    "languageserver": {
    "tabby-agent": {
    "command": "npx",
    "args": ["tabby-agent", "--lsp", "--stdio"],
    "filetypes": ["*"]

    The "filetypes": ["*"] setting enables Tabby for all filetypes. You can modify it according to your needs.

  3. Save the configuration file, and restart NeoVim.

  4. Open a file and start typing code to see code completion suggestions from Tabby.


For more examples of connecting Tabby to other editors, refer to the Tabby Agent documentation. If you have configurations for your favorite editors that you'd like to share, feel free to submit a pull request!

Create a Plugin for a New Editor​

In the previous examples, Tabby completions are displayed in the dropdown completion list. However, this method may not be very convenient for displaying multi-line code completions. As most LSP clients do not yet support inline completion, you may want to create a plugin for an editor that provides inline completion. To demonstrate how to communicate with Tabby via LSP, we have provided an example project here.

Please note that language server support is still in its early stages, and your feedback will be invaluable in making it even better. If you have any ideas or suggestions, feel free to create an issue or join our Slack community.

Happy coding with Tabby! πŸ±πŸ’»

Β· 2 min read

Tabby's ROCm support is currently only in our nightly builds. It will become stable in version 0.9.

For those using (compatible) AMD graphics cards, you can now run Tabby locally with GPU acceleration using AMD's ROCm toolkit! πŸŽ‰

ROCm is AMD's equivalent of NVidia's CUDA library, making it possible to run highly parallelized computations on the GPU. Cuda is open source and supports using multiple GPUs at the same time to perform the same computation.

Currently, Tabby with ROCm is only supported on Linux, and can only be run directly from a compiled binary. In the future, Tabby will be able to run with ROCm on Windows, and we will distribute a Docker container capable of running with ROCm on any platform.

Install ROCm​

Before starting, please make sure you are on a supported system and have ROCm installed. The AMD website details how to install it, find the instructions for your given platform. Shown below is a successful installation of ROCm packages on Arch Linux.

ROCm installed on Arch Linux

Deploy Tabby with ROCm from Docker​

Once you've installed ROCm, you're ready to start using Tabby! Simply use the following command to run the container with GPU passthrough:

docker run \
--device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video \
-p 8080:8080 -v $HOME/.tabby:/data \
tabbyml/tabby-rocm \
serve --device rocm --model StarCoder-1B

The command output should look similar to the below:

Tabby running inside Docker

Build Tabby with ROCm locally​

If you would rather run Tabby directly on your machine, you can compile Tabby yourself. If compiling yourself, make sure to use the flag --features rocm to enable it.

Once you have a compiled binary, you can run it with this command:

./tabby serve --model TabbyML/StarCoder-1B --device rocm

If the command is used correctly and the environment is configured properly, you should see command output similar to the following:
Tabby running
And enjoy GPU-accelerated code completions! This should be considerably faster than with CPU (I saw a ~5x speedup with StarCoder-1B using a Ryzen 7 5800X and an RX 6950XT).

Completions demo

Β· 2 min read

In our previous post on Cracking the Coding Evaluation, we shed light on the limitations of relying on HumanEval pass@1 as a code completion benchmark. In response, we've launched the Coding LLMs Leaderboard, embracing Next Line Accuracy as a metric inspired by academic works such as RepoCoder, RepoBench, and CCEval.


But what exactly is line accuracy? In code completion, model predicts a block of code spanning multiple lines. A naive approach would involve comparing the predicted block with the actual code being committed directly. While this approach might seem ideal, it is often considered "too sparse" as a revealing metric. On the other hand, next-line accuracy serves as a dependable proxy for overall block match accuracy.

Next Line Accuracy

Only content inside red box are used to compared with ground truth to compute accuracy metric

CCEval utilizes the next statement, but based on our observations, it strongly correlates with next line exact match. Therefore, we've opted for next line accuracy due to its ease of implementation across languages, eliminating the need for language-specific Tree Sitter parsers.

For data preparation, our initial release exclusively leverages the dataset from CCEval. This dataset provides well-structured left context, right context, cross-file context with BM25, and oracle information.

At present, evaluation is limited to prefix text + cross-file context. Our future plans involve more in-depth analyses:

  1. Comparing accuracy in completing a function's argument list.
  2. Computing accuracy in completing a function's docstring.

We genuinely believe that this leaderboard can assist Tabby's users in navigating the tradeoff between service cost, quality, and other factors. We are committed to enhancing and refining this leaderboard in the future.

Β· 7 min read

Tabby offers an open-source alternative solution to GitHub Copilot with easy setup and self-host options. We embrace an open ecosystem to support major open source coding LLMs (e.g. StarCoder, CodeLlama, WizardCoder, etc.), and enable easy integration of proprietary models. In addition, Tabby performs retrieval-augmented code completion to suggest code from your private codebase. We firmly believe in the continuous advancement in open source coding LLMs, yet we need quantitative measurements to guide the direction of product improvement, and help developers decide their model of choice.

Evaluation coding LLMs has also been a hot topic in academics. Many different metrics targeting different coding tasks have been proposed over the past year. At Tabby, we prioritize on metrics that best resemble real-world development workflow, and of course, the metrics should be constructed with non-biased data sources. In this blogpost, we will discuss our thoughts for desired code completion benchmarks, and also review latest academic progress in this area.

Exisiting Paradigms​

Existing coding LLM benchmark mostly focus on Pass@k metric - generating k code samples and measuring how often the results successfully pass given unit tests. OpenAI initially introduced this metric in Evaluating Large Language Models Trained on Code in July 2021, along with the release of HumanEval bechmark dataset.

πŸ€– HumanEval​

HumanEval is a hand-crafted dataset, consisting of 164 Python programming problems with unit tests. An example task looks like:

from typing import List 

def below_zero(operations: List[int]) -> bool:

You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False.

>>> below_zero([1, 2, 3]) False

>>> below_zero([1, 2, -4, 5]) True


HumanEval was a pioneer research effort, but now suffers from some unfortunate drawbacks:

  1. Data is likely contaminated. HumanEval dataset has been around for over two years and it has been discussed and documented widely online. The latest coding LLMs are likely to have included its test data in training data crawling, which would make the evaluation no longer valid.

  2. Trivial coding questions that aren't mimicing real engineering setups. HumanEval includes mostly LeetCode's interview-style questions, where they include a single function for LLMs to fill in the body. In a more realistic corporate setup, developers often add code in multiple files in a single PR, and constantly refer to functions implemented in other files. These are indeed more interesting yet challenging tasks for LLMs to perform, but are critical scenarios for AI coding assitants to land in enterprises.

  3. Unit tests are too weak. Researchers noticed that test cases in HumanEval tasks (on average 7.7 tests per problem) aren't enough to guarantee the correctness of the generated code (e.g. a wrong implementation could still pass all existing tests), and thus augmented test cases in HumanEval benchmark by 80x in HumanEvalPlus.


  1. Limited coverage in programming languages. This one is obvious as HumanEval only includes Python code. We ❀️ all programming languages!

🧩 Mostly Basic Programming Problems (MBPP)​

MBPP is another popular benchmark for code generation. Researchers from Google introduced it in the paper Program Synthesis with Large Language Models in August 2021, one month after the release of HumanEval. It contains 974 entry-level Python (as the name clearly suggests) programming tasks. An example looks like:

Write a python function to remove first and last occurrence of a given character from the string.

"assert remove_Occ(\"hello\",\"l\") == \"heo\""
"assert remove_Occ(\"abcda\",\"a\") == \"bcd\""
"assert remove_Occ(\"PHP\",\"P\") == \"H\""


Unlike HumanEval, MBPP targets basic tasks commonly encountered by engineers, such as string manipulation, simple arithmetic, and basic data structure operations. However it still faces similar drawbacks as HumanEval mentioned above.

What we are looking for in coding LLM evaluations?​

πŸ”¬ Scientific and Relevant Setup​

The top thing in our mind is metric setup. Like mentioned above, most existing coding LLM evaluations focus on function-level code generation - given a docstring or a function signature at most, the LLM is expected to generate the entire function body.

Here are what we think a trustworthy evaluation setup should cover:

  1. Non-trivial code. Definitely no more Leetcode-style coding questions! The ideal evaluation should target projects with substantial engineering complexity. Evidences like lines of code, number of files, or number of contributors could serve as good indicators to estimate the code complexity.

  2. Cross-file references. This is a key factor to differentiate a more reliable and practical evaluation from something that only scratches the surface of the coding world. Engineers do not code in silo, but are greatly encouraged to reuse a function or API implemented in the existing codebase.

  3. Code completion. Code completion is the most widely adopted LLM-powered feature in developer tools. Millions of developers worldwide have employed AI code completions in their daily workflow. Tabby provides a low-barrier solution in code completion, and is committed to continue to improve the end-to-end product quality.

βš–οΈ Ease and Low-Cost to Run​

The ease and cost to run evaluations is directly correlated to the number of models we can evaluate, and the frequency we can afford to update the results (in the case of refreshed evaluation date, for example). There are efforts to leverage crowdsourcing to rate the quality of LLM responses (e.g. Glaive arena) which excels at receiving high-quality human feedbacks and provides valuable insights to understand user behaviors. However it's harder to scale crowdsourcing ratings and takes longer to receive results. We are iterating quickly on Tabby, and decided that scalability and ease are critical to us now.

πŸ” Data Quality and Inclusion​

The data quality is critical to maintain the legitimacy of such evaluation. Here's what's important for evaluation data:

  1. Train/Eval Data Split. It's one of the most important concepts in your Machine Learning 101 course. Yet often times it gets so basic that folks neglect the challenges to ensure it in real-world applications over time. For example, HumanEval started as a manually drafted dataset to firmly ensure the data separation. Nevertheless over time, it still faces data contamination issue.

  2. Evaluation Quality. HumanEvalPlus mentioned above is a great example for this. Understanding the quality of the evaluation is important for developing a fair sense of the true model performance. We also encourage continuous efforts in improving evaluation quality!πŸ’ͺ🏻

  3. Data Inclusion / Coverage. In the case of coding, inclusion includes efforts like increasing the support of different programming languages. In practice, choosing the reasonable ratio of each programming language is also tricky yet important.

Highlights of recent research innovations​

In this section, we showcase a few recent research work of from the academics toward building reliable and sound evaluations for coding LLMs. Following these papers, we observe a growing emphasize in evaluating coding LLMs with repository-level context, which indeed aligns with what we have been looking for.

πŸ—‚οΈ CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion​

CrossCodeEval benchmark specially targets to address the gap that "existing code completion datasets such as HumanEval and MBPP mostly focus on single-file tasks, neglecting the real-world complexity of multi-file software projects". To achieve this goal, CrossCodeEval uses a static-analysis-based method to strictly require cross-file context for accurate code completion. Experiments show that cross-file context improves end-to-end system performance (LLM + code retriever), yet there's still a lot of room to improve.


πŸ§ͺ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems​

RepoBech also recognizes that current benchmarks primarily focus on single-file tasks, which creates a gap in assessing these systems in more complex, real-world, multi-file programming scenarios. Therefore, RepoBench introduces three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline) to measure the quality of each module and also the end-to-end system.


πŸ’Ύ RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation​

RepoCoder presents an innovative approach of combining similarity-based retriever and LLM prediction into an *iterative retrieval-generation pipeline*. To demostrate the effectiveness of this method, authors also introduced RepoEval, covering scenarios like line, API invocation, and function body completion from high quality real-world repositories.


Β· 4 min read

In the context of the Transformer model, which is widely used across LLMs, decoding refers to the process of generating an output sequence from an encoded input. Tabby recently implemented incremental decoding as part of the greedy search. This blog will explain our thoughts behind this πŸ› οΈπŸ’‘.

Common Decoding Methods​

Here's an example to facilitate understanding different decoding methods:

Let's say a developer starts to write a list comprehension in Python to get even numbers from a list:

numbers = [1, 2, 3, 4, 5]
evens = [x for x in numbers

To simplify the scenario, we assume that the language model maintains a probability distribution as shown below,


Here's how different decoding methods might suggest πŸ”:

Beam Search πŸŒˆβ€‹

Beam search maintains multiple possible sequences (beams) of active candidates at each time step. By increasing the beam size, the decoding performance can increase at the expense of higher computation cost.

Assuming num_beams=2, in this case, it'll produce if x % 2, as 0.4 * 0.7 = 0.28 gives the highest probability when considering a sequence of 2.


Greedy Decoding πŸ†β€‹

Greedy decoding selects the most probable next token at each step, which is most intuitive method but can sometimes lead to sub-optimal sequences. This is because it only considers one token at a time and makes such choice greedily.

In this particular case, greedy decoding would complete the code with ]\n print as each of the token here has maximum probability given the chosen token before.


Sampling-based methods πŸŽ²β€‹

The two methods above always produce deterministic results given the language model probability distribution. Often times this isn't the ideal case, especially in conversational scenarios where users often retry to expect an alternative answer (or think about language translation). Alternatively, sampling-based methods like random sampling, top-k, and top-p sampling introduce randomness to achieve diverse outputs.

However, as it's now an undeterministic approach, sometimes the models could generate incoherent gibberish results. There are many different sampling methods to sharp the distribution or redistribute the probability mass to ensure higher chance of generating meaningful tasks. Here we also want to emphasize that in practical implementations, sampling-based methods are often applied on top of beam search or greedy decoding to combine the best of both worlds.

Era of Streaming for LLM​

Latency is key in user experience for many LLM applications. In order to minimize the idle time for users, streaming response is commonly adopted. In LLM streaming, we start decoding the response as soon as it's available, instead of waiting for the entire response to be returned.

Considering streaming process in LLM decoding, although greedy decoding often produces sub-optimal results compared to beam decoding or sampling-methods methods, it wins with its fast and parallelizable computation. Most LLM applications today (e.g. ChatGPT, Bard, Anthropic, etc.) have adopted greedy decoding with certain samplings and carefully tuned them for different tasks: creative tasks such as chatbots or writing articles receives diverse responses from samplings; input-grounded tasks such as translation or coding benefit from greedy decoding to get the immediate "correct" result. (Indeed, ⌨️ coding tasks emphasize more on the consistency with given context - lines of code you just wrote, than the variations of possible responses.πŸ˜†)

Incremental Decoding ⏩​

However, often times decoding a sequence of tokens one-by-one without considering previous decoded results could produce undesired results. For example,

Decoding first token:                ......, 211       ->   "......[ llo]"
Indepently decoding the next token: ......, 207, 211 -> "......[ he][ llo]"

In the case above, the final decoded string would be " he llo" with an awkward space in between. To resolve issues like this, we could cache the already-decoded prefix and append it to the current token to decode together. It is the core idea of incremental decoding to take the prefix token into consideration for decoding current tokens. With incremental decoding, we get the desired result for the example above:

Incremental decoding:  ......, 207, 211  ->   "......[ hello]"  βœ…

For interested folks, you can refer to Tabby's exact implementation in IncrementalDecoding function in creates/tabby-inference/src/

Have you found our new decoding methods effective? Share your thoughts with us in our Slack channel 🌍😊!

Β· 6 min read

Using a Language Model (LLM) pretrained on coding data proves incredibly useful for "self-contained" coding tasks, like conjuring up a completely new function that operates independently πŸš€.

However, employing LLM for code completion within a vast and intricate pre-existing codebase poses certain challenges πŸ€”. To tackle this, LLM needs to comprehend the dependencies and APIs that intricately link its subsystems. We must provide this "repository context" to LLM when requesting it to complete a snippet.

To be more specific, we should:

  1. Aid LLM in understanding the overall codebase, allowing it to grasp the intricate code with dependencies and generate fresh code that utilizes existing abstractions.

  2. Efficiently convey all of this "code context" in a manner that fits within the context window (~2000 tokens), keeping completion latency reasonably low.

To demonstrate the effectiveness of this approach, below is an example showcasing TabbyML/StarCoder-1B performing code completion within Tabby's own repository.

Completion request
.unwrap_or_else(|err| fatal!("Error happens during serving: {}", err))

fn api_router(args: &ServeArgs) -> Router {
let index_server = Arc::new(IndexServer::new());
let completion_state = {
let (
EngineInfo {
prompt_template, ..
) = create_engine(&args.model, args);
let engine = Arc::new(engine);
let state = completions::CompletionState::new(

Without access to the repository context, LLM can only complete snippets based on the current editor window, generating a wrong function call to CompletionState::new.

Without repository context
fn api_router(args: &ServeArgs) -> Router {
let engine = Arc::new(engine);
let state = completions::CompletionState::new(

However, with the repository context (Specifically, if we include the entire file of crates/tabby/src/serve/ into the prompt).

Prepend to the completion request
// === crates/tabby/serve/ ===
// ......
// ......

We can generate a snippet that properly calls CompletionState::new (with the second parameter being index_server.clone()).

With repository context
fn api_router(args: &ServeArgs) -> Router {
let engine = Arc::new(engine);
let state = completions::CompletionState::new(

The Problem: Repository Context​

One obvious solution is to pack the whole codebase into LLM with each completion request. Voila✨! LLM has all the context it needs! But alas, this approach falls short for even moderately sized repositories. They're simply too massive to squeeze into the context window, causing a slowdown in inference speed.

A more efficient approach is to be selective, hand-picking the snippets to send. For instance, in the example above, we send the file containing the declaration of the CompletionState::new method. This strategy works like a charm, as illustrated in the example.

However, manually pinpointing the right set of context to transmit to LLM isn't ideal. Plus, sending entire files is a bulky way to relay code context, wasting the precious context window. LLM doesn't need a grand tour of the complete, only a robust enough understanding to utilize it effectively. If you continually dispatch multiple files' worth of code just for context, you'll soon hit a wall with the context window limit.

Code snippet to provide context.​

In the v0.3.0 release, we introduced Retrieval Augmented Code Completion, a nifty feature that taps into the repository context to enhance code suggestions. Here's a sneak peek of a snippet we pulled from the repository context:

Snippet from the Repository Context: A Glimpse into the Magic
// Path: crates/tabby/src/serve/
// impl CompletionState {
// pub fn new(
// engine: Arc<Box<dyn TextGeneration>>,
// index_server: Arc<IndexServer>,
// prompt_template: Option<String>,
// ) -> Self {
// Self {
// engine,
// prompt_builder: prompt::PromptBuilder::new(prompt_template, Some(index_server)),
// }
// }
// }
// Path: crates/tabby/src/serve/
// Router::new()
// .merge(api_router(args))

By snagging snippets like this, LLM gets to peek into variables, classes, methods, and function signatures scattered throughout the repo. This context allows LLM to tackle a multitude of tasks. For instance, it can cleverly decipher how to utilize APIs exported from a module, all thanks to the snippet defining / invoking that API.

Use tree-sitter to create snippets​

Tabby, under the hood, leverages 🌳 Tree-sitter query to construct its index. Tree-sitter is capable of scanning source code written in various languages and extracting data about all the symbols defined in each file.

Historically, Tree-sitter was utilized by IDEs or code editors to facilitate the creation of language formatters or syntax highlighters, among other things. However, we're taking a different approach and using Tree-sitter to aid LLM in understanding the codebase.

Here's an example of the output you'll get when you run following query on go source code:

Tree-sitter query to collect all type definitions
(type_declaration (type_spec name: (type_identifier) @name)) @definition.type
Snippets captured by the above query
type payload struct {
Data string `json:"data"`

These snippets are then compiled into an efficient token reverse index for use during querying. For each request, we tokenize the text segments and perform a BM25 search in the repository to find relevant snippets. We format these snippets in the line comment style, as illustrated in the example above. This format ensures it doesn't disrupt the existing semantics of the code, making it easy for LLM to understand.


The current approach to extracting snippets and performing ranking is relatively simple. We're actively working on various aspects to fully iterate through this approach and elevate its efficiency and effectiveness:

  1. Snippet Indexing: We are aiming to achieve a detailed understanding of what snippets should be incorporated into the index for each programming language. πŸ“š

  2. Retrieval Algorithm: Our focus is on refining the retrieval algorithm using attention weight heatmaps. Ideally, snippets with higher attention weights from Language Models (LLMs) should be prioritized in the retrieval process. βš™οΈ

We are incredibly enthusiastic about the potential for enhancing the quality and are eager to delve deeper into this exciting development! 🌟

Give it a try​

To use this repository context feature:

  1. Installing tabby.
  2. Navigate to the Configuration page and set up your ~/.tabby/config.toml
  3. Finally, run tabby scheduler to build an index and unlock the full potential of this innovative feature! πŸ› οΈ

Β· 2 min read

We are excited to announce that TabbyML has raised a $3.2M seed round to move towards our goal of building an open ecosystem to supercharge developer experience with LLM πŸŽ‰πŸŽ‰πŸŽ‰.

Why Tabby 🐾 ?​

With over 10 years coding experience, we recognize the transformative potential of LLMs in reshaping developer toolchains. While many existing products lean heavily on cloud-based end-to-end solutions, we firmly believe that for AI to be genuinely the core of every developer's toolkit, the next-gen LLM-enhanced developer tools should embrace an open ecosystem. This approach promotes not just flexibility for easy customization, but also fortifies security.

Today, Tabby stands out as the most popular and user-friendly solution to enable coding assistant experience fully owned by users. Looking ahead, we are poised to delve even further into the developer lifecycle, and innovate across the full spectrum. At TabbyML, developers aren't just participants β€” they are at the heart of the LLM revolution.

Release v0.3.0 - Retrieval Augmented Code Completion πŸŽβ€‹

Tabby also comes to a v0.3.0 release, with the support of retrieval-augmented code completion enabled by default. Enhanced by repo-level retrieval, Tabby gets smarter at your codebase and will quickly reference to a related function / code example from another file in your repository.

A blog series detailing the technical designs of retrieval-augmented code completion will be published soon. Stay tuned!πŸ””

Example prompt for retrieval-augmented code completion:

// Path: crates/tabby/src/serve/
// fn create_llama_engine(model_dir: &ModelDir) -> Box<dyn TextGeneration> {
// let options = llama_cpp_bindings::LlamaEngineOptionsBuilder::default()
// .model_path(model_dir.ggml_q8_0_file())
// .tokenizer_path(model_dir.tokenizer_file())
// .build()
// .unwrap();
// Box::new(llama_cpp_bindings::LlamaEngine::create(options))
// }
// Path: crates/tabby/src/serve/
// create_local_engine(args, &model_dir, &metadata)
// Path: crates/tabby/src/serve/
// args.device.to_string()
// Path: crates/tabby/src/serve/
// download_model(&args.model, &args.device)
} else {

fn create_ctranslate2_engine(
args: &crate::serve::ServeArgs,
model_dir: &ModelDir,
metadata: &Metadata,
) -> Box<dyn TextGeneration> {
let device = format!("{}", args.device);
let options = CTranslate2EngineOptionsBuilder::default()

Β· 4 min read

This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload.

What is streaming?​

Let's begin by setting up a simple example program:


const express = require('express');

function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));

async function* llm() {
let i = 1;
while (true) {
console.log(`producing ${i}`);
yield i++;

// Mimic LLM inference latency.
await sleep(1000);

function server(llm) {
const app = express();
app.get('/', async (req, res) => {
res.writeHead(200, {
'Content-Type': 'application/jsonstream',
'Transfer-Encoding': 'chunked',

let value, done;
do {
({ value, done } = await;
} while (!done);


async function client() {
const resp = await fetch('http://localhost:8080');

// Read values from our stream
const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader();
// We're only reading 3 items this time:
for (let i = 0; i < 3; i++) {
// we know our stream is infinite, so there's no need to check `done`.
const { value } = await;
console.log(`read ${value}`);
await sleep(10ms);


In this example, we are creating an async generator to mimic a LLM that produces string tokens. We then create an HTTP endpoint that wraps the generator, as well as a client that reads values from the HTTP stream. It's important to note that our generator logs producing ${i}, and our client logs read ${value}. The LLM inference could take an arbitrary amount of time to complete, simulated by a 1000ms sleep in the generator.

Stream Laziness​

If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output producing ${i} even after the client has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out.

Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors?

This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources.


How to handle cancellation?​

The core idea is straightforward: on the server side, we need to listen to the close event and check if the connection is still valid before pulling data from the LLM stream.

app.get('/', async (req, res) => {
let canceled;
req.on('close', () => canceled = true);
do {
({ value, done } = await;
} while (!done && !canceled);

Implement cancellation for Tabby​

In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance.

On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server.

// Demo code in the client side

let controller;

const callServer = (prompt) => {
controller = new AbortController();
const signal = controller.signal;
// 2. calling server API to get the result with the prompt
const response = await fetch("/v1/completions", {
method: "POST",
headers: {
"Content-Type": "application/json"
body: JSON.stringify({ prompt })

const onChange = (e) => {
if (controller) controller.abort(); // Abort the previous request

// 1. Debounce the input 100ms for example
<input onChange={debounce(onChange)} />

By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently!


That's it​

We would love to invite to join our Slack community! Please feel free to reach out to us on Slack - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation.

Happy hacking 😁πŸ’ͺ🏻

Β· 2 min read

We are thrilled to announce the release of Tabby v0.1.1 πŸ‘πŸ».

Staring tabby riding on llama.cpp

Created with SDXL-botw and a twitter post of BigCode

Apple M1/M2 Tabby users can now harness Metal inference support on Apple's M1 and M2 chips by using the --device metal flag, thanks to llama.cpp's awesome metal support.

The Tabby team made a contribution by adding support for the StarCoder series models (1B/3B/7B) in llama.cpp, enabling more appropriate model usage on the edge for completion use cases.

llama_print_timings:        load time =   105.15 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 25.07 ms / 6 tokens ( 4.18 ms per token, 239.36 tokens per second)
llama_print_timings: eval time = 311.80 ms / 28 runs ( 11.14 ms per token, 89.80 tokens per second)
llama_print_timings: total time = 340.25 ms

Inference benchmarking with StarCoder-1B on Apple M2 Max now takes approximately 340ms, compared to the previous time of around 1790ms. This represents a roughly 5x speed improvement.

This enhancement leads to a significant inference speed upgradeπŸš€, for example, It marks a meaningful milestone in Tabby's adoption on Apple devices. Check out our Model Directory to discover LLM models with Metal support! 🎁


Check out latest Tabby updates on Linkedin and Slack community! Our Tabby community is eager for your participation. ❀️

Β· 4 min read

Hugging Face Spaces offers an easy-to-use Nvidia GPU hosting runtime, allowing anyone to host their machine learning models or AI applications.

In this blog post, we are going to show you how to deploy a Tabby instance in Hugging Face Spaces. If you have not heard of Tabby, it’s an open-source Github Copilot alternative that supports code completion. Check out more details here.

How it works​

Let’s firstly take a look at what steps are needed to deploy a Tabby instance in Hugging Face. It’s super easy and you don’t need much coding knowledge. Buckle up and let’s get started.

Step 1: Create a new Hugging Face Space (link). Spaces are code repositories that host application code.

Step 2: Create a Dockerfile to capture your machine learning models' logic, and bring up a server to serve requests.

Step 3: After space is built, you will be able to send requests to the APIs.

That's it! With the hosted APIs, now you can connect Tabby's IDE extensions to the API endpoint. Next, we will deep dive into each step with screenshots!! Everything will be done in the Hugging Face UI. No local setup is needed.


Looking to quickly start a Tabby instance? You can skip the tutorials entirely and simply create space from this template.

Deep Dive​

Create a new Space​

After you create a Hugging Face account, you should be able to see the following page by clicking this link. The owner name will be your account name. Fill in a Space name, e.g. "tabbyml", and select Docker as Space SDK. Then click "Create Space" at the bottom.

In this walkthrough we recommend using Nvidia T4 instance to deploying a model of ~1B parameter size.

Create a new Space

Uploading Dockerfile​

For advanced users, you can leverage the Git workspace. In this blog post, we will show you the UI flow instead. After you click the "Create a Space" in the last step, you will be directed to this page. Just ignore the main text and click the "Files" on the top right corner.

Docker Space

After clicking on the "Files", you will be able to see a "Add file" button, click that, then click on "Create a new file"

Empty Space

Then you will be redirected to the page below. Set the filename to β€œDockerfile” and copy the content to the β€œEdit” input box. You can copy the code from the appendix here to bring up the SantaCoder-1B model. Once ready, click the button β€œCommit new file to main” on the bottom.

Edit Dockerfile

Edit Readme​

You also need to add a new line the Click the "edit" button in the file.


Add this line "app_port: 8080" after "sdk: docker"


Click the button "Commit to main" to save the changes.

Verify Tabby is running​

Click on the "App" button, you should be able to see the container is building:

Space Building

If the App is up successfully, you should see this page:

Tabby Swagger

Call code completion API​

Now, you are able to call the completion API. The full URL is https://{YOUR-ACCOUNT-NAME} In this post, the URL is

To test if your APIs are up and running, use this online tool to send curl commands:


The complete curl command can also be located in the appendix. Ensure that you have adjusted the URL to align with your Hugging Face Spaces settings!

(If you are setting the space to private, you will need to fill in your Huggingface Access Token as bearer token in HTTP Headers, like Authorization: Bearer $HF_ACCESS_TOKEN.)


In this post, we covered the detailed steps for deploying a Tabby instance to Hugging Face Spaces. By following these steps, anyone is able to bring up their own code completion APIs easily.



FROM tabbyml/tabby

USER root
RUN mkdir -p /data
RUN chown 1000 /data

USER 1000
CMD ["serve", "--device", "cuda", "--model", "TabbyML/SantaCoder-1B"]

CURL Command​

curl -L '' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
"language": "python",
"segments": {
"prefix": "def fib(n):\n ",
"suffix": "\n return fib(n - 1) + fib(n - 2)"