Blog | Tabby

Decode the Decoding in Tabby

October 21, 2023 · 4 min read

In the context of the Transformer model, which is widely used across LLMs, decoding refers to the process of generating an output sequence from an encoded input. Tabby recently implemented incremental decoding as part of the greedy search. This blog will explain our thoughts behind this 🛠️💡.

Common Decoding Methods

Here's an example to facilitate understanding different decoding methods:

Let's say a developer starts to write a list comprehension in Python to get even numbers from a list:

numbers = [1, 2, 3, 4, 5]
evens = [x for x in numbers

To simplify the scenario, we assume that the language model maintains a probability distribution as shown below,

probability

Here's how different decoding methods might suggest 🔍:

Beam Search 🌈

Beam search maintains multiple possible sequences (beams) of active candidates at each time step. By increasing the beam size, the decoding performance can increase at the expense of higher computation cost.

Assuming num_beams=2, in this case, it'll produce if x % 2, as 0.4 * 0.7 = 0.28 gives the highest probability when considering a sequence of 2.

beam

Greedy Decoding 🏆

Greedy decoding selects the most probable next token at each step, which is most intuitive method but can sometimes lead to sub-optimal sequences. This is because it only considers one token at a time and makes such choice greedily.

In this particular case, greedy decoding would complete the code with ]\n print as each of the token here has maximum probability given the chosen token before.

greedy

Sampling-based methods 🎲

The two methods above always produce deterministic results given the language model probability distribution. Often times this isn't the ideal case, especially in conversational scenarios where users often retry to expect an alternative answer (or think about language translation). Alternatively, sampling-based methods like random sampling, top-k, and top-p sampling introduce randomness to achieve diverse outputs.

However, as it's now an undeterministic approach, sometimes the models could generate incoherent gibberish results. There are many different sampling methods to sharp the distribution or redistribute the probability mass to ensure higher chance of generating meaningful tasks. Here we also want to emphasize that in practical implementations, sampling-based methods are often applied on top of beam search or greedy decoding to combine the best of both worlds.

Era of Streaming for LLM

Latency is key in user experience for many LLM applications. In order to minimize the idle time for users, streaming response is commonly adopted. In LLM streaming, we start decoding the response as soon as it's available, instead of waiting for the entire response to be returned.

Considering streaming process in LLM decoding, although greedy decoding often produces sub-optimal results compared to beam decoding or sampling-based methods, it wins with its fast and parallelizable computation. Most LLM applications today (e.g. ChatGPT, Bard, Anthropic, etc.) have adopted greedy decoding with certain samplings and carefully tuned them for different tasks: creative tasks such as chatbots or writing articles receives diverse responses from samplings; input-grounded tasks such as translation or coding benefit from greedy decoding to get the immediate "correct" result. (Indeed, ⌨️ coding tasks emphasize more on the consistency with given context - lines of code you just wrote, than the variations of possible responses.😆)

Incremental Decoding ⏩

However, often times decoding a sequence of tokens one-by-one without considering previous decoded results could produce undesired results. For example,

Decoding first token:                ......, 211       ->   "......[ llo]"
Indepently decoding the next token:  ......, 207, 211  ->   "......[ he][ llo]"

In the case above, the final decoded string would be " he llo" with an awkward space in between. To resolve issues like this, we could cache the already-decoded prefix and append it to the current token to decode together. It is the core idea of incremental decoding to take the prefix token into consideration for decoding current tokens. With incremental decoding, we get the desired result for the example above:

Incremental decoding:  ......, 207, 211  ->   "......[ hello]"  ✅

For interested folks, you can refer to Tabby's exact implementation in IncrementalDecoding function in creates/tabby-inference/src/decoding.rs.

Have you found our new decoding methods effective? Share your thoughts with us in our Slack channel 🌍😊!

Repository context for LLM assisted code completion

October 16, 2023 · 5 min read

Meng Zhang

Whiteboard Credit: Elon Musk's tweet

Using a Language Model (LLM) pretrained on coding data proves incredibly useful for "self-contained" coding tasks, like conjuring up a completely new function that operates independently 🚀.

However, employing LLM for code completion within a vast and intricate pre-existing codebase poses certain challenges 🤔. To tackle this, LLM needs to comprehend the dependencies and APIs that intricately link its subsystems. We must provide this "repository context" to LLM when requesting it to complete a snippet.

To be more specific, we should:

Aid LLM in understanding the overall codebase, allowing it to grasp the intricate code with dependencies and generate fresh code that utilizes existing abstractions.
Efficiently convey all of this "code context" in a manner that fits within the context window (~2000 tokens), keeping completion latency reasonably low.

To demonstrate the effectiveness of this approach, below is an example showcasing TabbyML/StarCoder-1B performing code completion within Tabby's own repository.

Completion request
start_heartbeat(args);
Server::bind(&address)
    .serve(app.into_make_service())
    .await
    .unwrap_or_else(|err| fatal!("Error happens during serving: {}", err))
}

fn api_router(args: &ServeArgs) -> Router {
    let index_server = Arc::new(IndexServer::new());
    let completion_state = {
        let (
            engine,
            EngineInfo {
                prompt_template, ..
            },
        ) = create_engine(&args.model, args);
        let engine = Arc::new(engine);
        let state = completions::CompletionState::new(
            ║
    }

Without access to the repository context, LLM can only complete snippets based on the current editor window, generating a wrong function call to CompletionState::new.

Without repository context
fn api_router(args: &ServeArgs) -> Router {
        ...
        let engine = Arc::new(engine);
        let state = completions::CompletionState::new(
            engine,
            prompt_template,
        );
        Arc::new(state);
        ...
}

However, with the repository context (Specifically, if we include the entire file of crates/tabby/src/serve/completions.rs into the prompt).

Prepend to the completion request
// === crates/tabby/serve/completions.rs ===
// ......
// ......

We can generate a snippet that properly calls CompletionState::new (with the second parameter being index_server.clone()).

With repository context
fn api_router(args: &ServeArgs) -> Router {
        ...
        let engine = Arc::new(engine);
        let state = completions::CompletionState::new(
            engine,
            index_server.clone(),
            prompt_template,
        );
        Arc::new(state);
        ...
}

The Problem: Repository Context

One obvious solution is to pack the whole codebase into LLM with each completion request. Voila✨! LLM has all the context it needs! But alas, this approach falls short for even moderately sized repositories. They're simply too massive to squeeze into the context window, causing a slowdown in inference speed.

A more efficient approach is to be selective, hand-picking the snippets to send. For instance, in the example above, we send the file containing the declaration of the CompletionState::new method. This strategy works like a charm, as illustrated in the example.

However, manually pinpointing the right set of context to transmit to LLM isn't ideal. Plus, sending entire files is a bulky way to relay code context, wasting the precious context window. LLM doesn't need a grand tour of the complete completion.rs, only a robust enough understanding to utilize it effectively. If you continually dispatch multiple files' worth of code just for context, you'll soon hit a wall with the context window limit.

Code snippet to provide context.

In the v0.3.0 release, we introduced Retrieval Augmented Code Completion, a nifty feature that taps into the repository context to enhance code suggestions. Here's a sneak peek of a snippet we pulled from the repository context:

Snippet from the Repository Context: A Glimpse into the Magic
// Path: crates/tabby/src/serve/completions.rs
// impl CompletionState {
//     pub fn new(
//         engine: Arc<Box<dyn TextGeneration>>,
//         index_server: Arc<IndexServer>,
//         prompt_template: Option<String>,
//     ) -> Self {
//         Self {
//             engine,
//             prompt_builder: prompt::PromptBuilder::new(prompt_template, Some(index_server)),
//         }
//     }
// }
//
// Path: crates/tabby/src/serve/mod.rs
// Router::new()
//         .merge(api_router(args))

By snagging snippets like this, LLM gets to peek into variables, classes, methods, and function signatures scattered throughout the repo. This context allows LLM to tackle a multitude of tasks. For instance, it can cleverly decipher how to utilize APIs exported from a module, all thanks to the snippet defining / invoking that API.

Use tree-sitter to create snippets

Tabby, under the hood, leverages 🌳 Tree-sitter query to construct its index. Tree-sitter is capable of scanning source code written in various languages and extracting data about all the symbols defined in each file.

Historically, Tree-sitter was utilized by IDEs or code editors to facilitate the creation of language formatters or syntax highlighters, among other things. However, we're taking a different approach and using Tree-sitter to aid LLM in understanding the codebase.

Here's an example of the output you'll get when you run following query on go source code:

Tree-sitter query to collect all type definitions
(type_declaration (type_spec name: (type_identifier) @name)) @definition.type

Snippets captured by the above query
type payload struct {
	Data string `json:"data"`
}
...

These snippets are then compiled into an efficient token reverse index for use during querying. For each request, we tokenize the text segments and perform a BM25 search in the repository to find relevant snippets. We format these snippets in the line comment style, as illustrated in the example above. This format ensures it doesn't disrupt the existing semantics of the code, making it easy for LLM to understand.

Roadmap

The current approach to extracting snippets and performing ranking is relatively simple. We're actively working on various aspects to fully iterate through this approach and elevate its efficiency and effectiveness:

Snippet Indexing: We are aiming to achieve a detailed understanding of what snippets should be incorporated into the index for each programming language. 📚
Retrieval Algorithm: Our focus is on refining the retrieval algorithm using attention weight heatmaps. Ideally, snippets with higher attention weights from Language Models (LLMs) should be prioritized in the retrieval process. ⚙️

We are incredibly enthusiastic about the potential for enhancing the quality and are eager to delve deeper into this exciting development! 🌟

Give it a try

To use this repository context feature:

Installing tabby.
Navigate to the Repository Context page and follow the instructions to set it up.

Announcing our $3.2M seed round, and the long-awaited RAG release in Tabby v0.3.0

October 14, 2023 · 2 min read

Meng Zhang

Lucy Gao

We are excited to announce that TabbyML has raised a $3.2M seed round to move towards our goal of building an open ecosystem to supercharge developer experience with LLM 🎉🎉🎉.

Why Tabby 🐾 ?

With over 10 years coding experience, we recognize the transformative potential of LLMs in reshaping developer toolchains. While many existing products lean heavily on cloud-based end-to-end solutions, we firmly believe that for AI to be genuinely the core of every developer's toolkit, the next-gen LLM-enhanced developer tools should embrace an open ecosystem. This approach promotes not just flexibility for easy customization, but also fortifies security.

Today, Tabby stands out as the most popular and user-friendly solution to enable coding assistant experience fully owned by users. Looking ahead, we are poised to delve even further into the developer lifecycle, and innovate across the full spectrum. At TabbyML, developers aren't just participants — they are at the heart of the LLM revolution.

Release v0.3.0 - Retrieval Augmented Code Completion 🎁

Tabby also comes to a v0.3.0 release, with the support of retrieval-augmented code completion enabled by default. Enhanced by repo-level retrieval, Tabby gets smarter at your codebase and will quickly reference to a related function / code example from another file in your repository.

A blog series detailing the technical designs of retrieval-augmented code completion will be published soon. Stay tuned!🔔

Example prompt for retrieval-augmented code completion:

// Path: crates/tabby/src/serve/engine.rs
// fn create_llama_engine(model_dir: &ModelDir) -> Box<dyn TextGeneration> {
//     let options = llama_cpp_bindings::LlamaEngineOptionsBuilder::default()
//         .model_path(model_dir.ggml_q8_0_file())
//         .tokenizer_path(model_dir.tokenizer_file())
//         .build()
//         .unwrap();
//
//     Box::new(llama_cpp_bindings::LlamaEngine::create(options))
// }
//
// Path: crates/tabby/src/serve/engine.rs
// create_local_engine(args, &model_dir, &metadata)
//
// Path: crates/tabby/src/serve/health.rs
// args.device.to_string()
//
// Path: crates/tabby/src/serve/mod.rs
// download_model(&args.model, &args.device)
    } else {
        create_llama_engine(model_dir)
    }
}

fn create_ctranslate2_engine(
    args: &crate::serve::ServeArgs,
    model_dir: &ModelDir,
    metadata: &Metadata,
) -> Box<dyn TextGeneration> {
    let device = format!("{}", args.device);
    let options = CTranslate2EngineOptionsBuilder::default()
        .model_path(model_dir.ctranslate2_dir())
        .tokenizer_path(model_dir.tokenizer_file())
        .device(device)
        .model_type(metadata.auto_model.clone())
        .device_indices(args.device_indices.clone())
        .build()
        .

Stream laziness in Tabby

September 30, 2023 · 4 min read

Wayne Wang

Lucy Gao

Meng Zhang

This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload.

What is streaming?

Let's begin by setting up a simple example program:

intro

const express = require('express');

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function* llm() {
  let i = 1;
  while (true) {
    console.log(`producing ${i}`);
    yield i++;

    // Mimic LLM inference latency.
    await sleep(1000);
  }
}

function server(llm) {
  const app = express();
  app.get('/', async (req, res) => {
    res.writeHead(200, {
      'Content-Type': 'application/jsonstream',
      'Transfer-Encoding': 'chunked',
    });

    let value, done;
    do {
      ({ value, done } = await llm.next());
      res.write(JSON.stringify(value));
      res.write('\n');
    } while (!done);
  });

  app.listen(8080);
}

async function client() {
  const resp = await fetch('http://localhost:8080');

  // Read values from our stream
  const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader();
  // We're only reading 3 items this time:
  for (let i = 0; i < 3; i++) {
    // we know our stream is infinite, so there's no need to check `done`.
    const { value } = await reader.read();
    console.log(`read ${value}`);
    await sleep(10ms);
  }
}

server(llm());
client();

In this example, we are creating an async generator to mimic a LLM that produces string tokens. We then create an HTTP endpoint that wraps the generator, as well as a client that reads values from the HTTP stream. It's important to note that our generator logs producing ${i}, and our client logs read ${value}. The LLM inference could take an arbitrary amount of time to complete, simulated by a 1000ms sleep in the generator.

Stream Laziness

If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output producing ${i} even after the client has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out.

Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors?

This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources.

Cancellation

How to handle cancellation?

The core idea is straightforward: on the server side, we need to listen to the close event and check if the connection is still valid before pulling data from the LLM stream.

app.get('/', async (req, res) => {
  ...
  let canceled;
  req.on('close', () => canceled = true);
  do {
    ({ value, done } = await llm.next());
    ...
  } while (!done && !canceled);
});

Implement cancellation for Tabby

In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance.

On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server.

// Demo code in the client side

let controller;

const callServer = (prompt) => {
  controller = new AbortController();
  const signal = controller.signal;
  // 2. calling server API to get the result with the prompt
  const response = await fetch("/v1/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json"
    },
    body: JSON.stringify({ prompt })
    signal
  });
}

const onChange = (e) => {
  if (controller) controller.abort(); // Abort the previous request
  callServer(e.target.value);
};

// 1. Debounce the input 100ms for example
<input onChange={debounce(onChange)} />

By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently!

Streaming

That's it

We would love to invite to join our Slack community! Please feel free to reach out to us on Slack - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation.

Happy hacking 😁💪🏻

Tabby v0.1.1: Metal inference and StarCoder supports!

September 18, 2023 · 2 min read

Meng Zhang

We are thrilled to announce the release of Tabby v0.1.1 👏🏻.

Staring tabby riding on llama.cpp

Created with SDXL-botw and a twitter post of BigCode

Apple M1/M2 Tabby users can now harness Metal inference support on Apple's M1 and M2 chips by using the --device metal flag, thanks to llama.cpp's awesome metal support.

The Tabby team made a contribution by adding support for the StarCoder series models (1B/3B/7B) in llama.cpp, enabling more appropriate model usage on the edge for completion use cases.

llama_print_timings:        load time =   105.15 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    25.07 ms /     6 tokens (    4.18 ms per token,   239.36 tokens per second)
llama_print_timings:        eval time =   311.80 ms /    28 runs   (   11.14 ms per token,    89.80 tokens per second)
llama_print_timings:       total time =   340.25 ms

Inference benchmarking with StarCoder-1B on Apple M2 Max now takes approximately 340ms, compared to the previous time of around 1790ms. This represents a roughly 5x speed improvement.

This enhancement leads to a significant inference speed upgrade🚀, for example, It marks a meaningful milestone in Tabby's adoption on Apple devices. Check out our Model Directory to discover LLM models with Metal support! 🎁

tip

Check out latest Tabby updates on Linkedin and Slack community! Our Tabby community is eager for your participation. ❤️

Deploying a Tabby Instance in Hugging Face Spaces

September 5, 2023 · 4 min read

Rand Xie

Meng Zhang

Lucy Gao

Hugging Face Spaces offers an easy-to-use Nvidia GPU hosting runtime, allowing anyone to host their machine learning models or AI applications.

In this blog post, we are going to show you how to deploy a Tabby instance in Hugging Face Spaces. If you have not heard of Tabby, it’s an open-source Github Copilot alternative that supports code completion. Check out more details here.

How it works

Let’s firstly take a look at what steps are needed to deploy a Tabby instance in Hugging Face. It’s super easy and you don’t need much coding knowledge. Buckle up and let’s get started.

Step 1: Create a new Hugging Face Space (link). Spaces are code repositories that host application code.

Step 2: Create a Dockerfile to capture your machine learning models' logic, and bring up a server to serve requests.

Step 3: After space is built, you will be able to send requests to the APIs.

That's it! With the hosted APIs, now you can connect Tabby's IDE extensions to the API endpoint. Next, we will deep dive into each step with screenshots!! Everything will be done in the Hugging Face UI. No local setup is needed.

tip

Looking to quickly start a Tabby instance? You can skip the tutorials entirely and simply create space from this template.

Deep Dive

Create a new Space

After you create a Hugging Face account, you should be able to see the following page by clicking this link. The owner name will be your account name. Fill in a Space name, e.g. "tabbyml", and select Docker as Space SDK. Then click "Create Space" at the bottom.

In this walkthrough we recommend using Nvidia T4 instance to deploying a model of ~1B parameter size.

Create a new Space

Uploading Dockerfile

For advanced users, you can leverage the Git workspace. In this blog post, we will show you the UI flow instead. After you click the "Create a Space" in the last step, you will be directed to this page. Just ignore the main text and click the "Files" on the top right corner.

Docker Space

After clicking on the "Files", you will be able to see a "Add file" button, click that, then click on "Create a new file"

Empty Space

Then you will be redirected to the page below. Set the filename to “Dockerfile” and copy the content to the “Edit” input box. You can copy the code from the appendix here to bring up the SantaCoder-1B model. Once ready, click the button “Commit new file to main” on the bottom.

Edit Dockerfile

Edit Readme

You also need to add a new line the README.md. Click the "edit" button in the README.md file.

Empty README

Add this line "app_port: 8080" after "sdk: docker"

Edit README

Click the button "Commit to main" to save the changes.

Verify Tabby is running

Click on the "App" button, you should be able to see the container is building:

Space Building

If the App is up successfully, you should see this page:

Tabby Swagger

Call code completion API

Now, you are able to call the completion API. The full URL is https://YOUR-ACCOUNT-NAME-tabbyml.hf.space/v1/completions. In this post, the URL is https://randxie-tabbyml.hf.space/v1/completions.

To test if your APIs are up and running, use this online tool to send curl commands:

curl

The complete curl command can also be located in the appendix. Ensure that you have adjusted the URL to align with your Hugging Face Spaces settings!

(If you are setting the space to private, you will need to fill in your Huggingface Access Token as bearer token in HTTP Headers, like Authorization: Bearer $HF_ACCESS_TOKEN.)

Conclusion

In this post, we covered the detailed steps for deploying a Tabby instance to Hugging Face Spaces. By following these steps, anyone is able to bring up their own code completion APIs easily.

Appendix

Dockerfile

FROM tabbyml/tabby

USER root
RUN mkdir -p /data
RUN chown 1000 /data

USER 1000
CMD ["serve", "--device", "cuda", "--model", "TabbyML/SantaCoder-1B"]

CURL Command

curl -L 'https://randxie-tabbyml.hf.space/v1/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
  "language": "python",
  "segments": {
    "prefix": "def fib(n):\n    ",
    "suffix": "\n        return fib(n - 1) + fib(n - 2)"
  }
}'

Introducing First Stable Release: v0.0.1

August 31, 2023 · One min read

Meng Zhang

We're thrilled to announce Tabby's first stable version, v0.0.1! 🎉 This marks a significant milestone in our continuous efforts to refine our platform. The Tabby API specification is now officially stable, providing a reliable foundation for future development. 🚀

📦 To enjoy these improvements, simply upgrade your Tabby instance to v0.0.1 using the image tag: tabbyml/tabby:v0.0.1. This ensures access to the latest features and optimizations.

📚 For more details and to access the release, visit our GitHub repository at https://github.com/TabbyML/tabby/releases/tag/v0.0.1.

We deeply appreciate your ongoing support and feedback, which has been instrumental in shaping Tabby's development. 🙏 We're committed to delivering excellence and innovation as we move forward.

Thank you for being part of the Tabby community! ❤️

Common Decoding Methods​

Beam Search 🌈​

Greedy Decoding 🏆​

Sampling-based methods 🎲​

Era of Streaming for LLM​

Incremental Decoding ⏩​

The Problem: Repository Context​

Code snippet to provide context.​

Use tree-sitter to create snippets​

Roadmap​

Give it a try​

Why Tabby 🐾 ?​

Release v0.3.0 - Retrieval Augmented Code Completion 🎁​

What is streaming?​

Stream Laziness​

How to handle cancellation?​

Implement cancellation for Tabby​

That's it​

How it works​

Deep Dive​

Create a new Space​

Uploading Dockerfile​

Edit Readme​

Verify Tabby is running​

Call code completion API​

Conclusion​

Appendix​

Dockerfile​

CURL Command​

Common Decoding Methods

Beam Search 🌈

Greedy Decoding 🏆

Sampling-based methods 🎲

Era of Streaming for LLM

Incremental Decoding ⏩

The Problem: Repository Context

Code snippet to provide context.

Use tree-sitter to create snippets

Roadmap

Give it a try

Why Tabby 🐾 ?

Release v0.3.0 - Retrieval Augmented Code Completion 🎁

What is streaming?

Stream Laziness

How to handle cancellation?

Implement cancellation for Tabby

That's it

How it works

Deep Dive

Create a new Space

Uploading Dockerfile

Edit Readme

Verify Tabby is running

Call code completion API

Conclusion

Appendix

Dockerfile

CURL Command