Skip to main content

Announcing our $3.2M seed round, and the long-awaited RAG release in Tabby v0.3.0

Β· 2 min read

We are excited to announce that TabbyML has raised a $3.2M seed round to move towards our goal of building an open ecosystem to supercharge developer experience with LLM πŸŽ‰πŸŽ‰πŸŽ‰.

Why Tabby 🐾 ?​

With over 10 years coding experience, we recognize the transformative potential of LLMs in reshaping developer toolchains. While many existing products lean heavily on cloud-based end-to-end solutions, we firmly believe that for AI to be genuinely the core of every developer's toolkit, the next-gen LLM-enhanced developer tools should embrace an open ecosystem. This approach promotes not just flexibility for easy customization, but also fortifies security.

Today, Tabby stands out as the most popular and user-friendly solution to enable coding assistant experience fully owned by users. Looking ahead, we are poised to delve even further into the developer lifecycle, and innovate across the full spectrum. At TabbyML, developers aren't just participants β€” they are at the heart of the LLM revolution.

Release v0.3.0 - Retrieval Augmented Code Completion πŸŽβ€‹

Tabby also comes to a v0.3.0 release, with the support of retrieval-augmented code completion enabled by default. Enhanced by repo-level retrieval, Tabby gets smarter at your codebase and will quickly reference to a related function / code example from another file in your repository.

A blog series detailing the technical designs of retrieval-augmented code completion will be published soon. Stay tuned!πŸ””

Example prompt for retrieval-augmented code completion:

// Path: crates/tabby/src/serve/engine.rs
// fn create_llama_engine(model_dir: &ModelDir) -> Box<dyn TextGeneration> {
// let options = llama_cpp_bindings::LlamaEngineOptionsBuilder::default()
// .model_path(model_dir.ggml_q8_0_file())
// .tokenizer_path(model_dir.tokenizer_file())
// .build()
// .unwrap();
//
// Box::new(llama_cpp_bindings::LlamaEngine::create(options))
// }
//
// Path: crates/tabby/src/serve/engine.rs
// create_local_engine(args, &model_dir, &metadata)
//
// Path: crates/tabby/src/serve/health.rs
// args.device.to_string()
//
// Path: crates/tabby/src/serve/mod.rs
// download_model(&args.model, &args.device)
} else {
create_llama_engine(model_dir)
}
}

fn create_ctranslate2_engine(
args: &crate::serve::ServeArgs,
model_dir: &ModelDir,
metadata: &Metadata,
) -> Box<dyn TextGeneration> {
let device = format!("{}", args.device);
let options = CTranslate2EngineOptionsBuilder::default()
.model_path(model_dir.ctranslate2_dir())
.tokenizer_path(model_dir.tokenizer_file())
.device(device)
.model_type(metadata.auto_model.clone())
.device_indices(args.device_indices.clone())
.build()
.

Stream laziness in Tabby

Β· 4 min read

This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload.

What is streaming?​

Let's begin by setting up a simple example program:

intro

const express = require('express');

function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}

async function* llm() {
let i = 1;
while (true) {
console.log(`producing ${i}`);
yield i++;

// Mimic LLM inference latency.
await sleep(1000);
}
}

function server(llm) {
const app = express();
app.get('/', async (req, res) => {
res.writeHead(200, {
'Content-Type': 'application/jsonstream',
'Transfer-Encoding': 'chunked',
});

let value, done;
do {
({ value, done } = await llm.next());
res.write(JSON.stringify(value));
res.write('\n');
} while (!done);
});

app.listen(8080);
}

async function client() {
const resp = await fetch('http://localhost:8080');

// Read values from our stream
const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader();
// We're only reading 3 items this time:
for (let i = 0; i < 3; i++) {
// we know our stream is infinite, so there's no need to check `done`.
const { value } = await reader.read();
console.log(`read ${value}`);
await sleep(10ms);
}
}

server(llm());
client();

In this example, we are creating an async generator to mimic a LLM that produces string tokens. We then create an HTTP endpoint that wraps the generator, as well as a client that reads values from the HTTP stream. It's important to note that our generator logs producing ${i}, and our client logs read ${value}. The LLM inference could take an arbitrary amount of time to complete, simulated by a 1000ms sleep in the generator.

Stream Laziness​

If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output producing ${i} even after the client has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out.

Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors?

This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources.

Cancellation

How to handle cancellation?​

The core idea is straightforward: on the server side, we need to listen to the close event and check if the connection is still valid before pulling data from the LLM stream.

app.get('/', async (req, res) => {
...
let canceled;
req.on('close', () => canceled = true);
do {
({ value, done } = await llm.next());
...
} while (!done && !canceled);
});

Implement cancellation for Tabby​

In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance.

On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server.

// Demo code in the client side

let controller;

const callServer = (prompt) => {
controller = new AbortController();
const signal = controller.signal;
// 2. calling server API to get the result with the prompt
const response = await fetch("/v1/completions", {
method: "POST",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify({ prompt })
signal
});
}

const onChange = (e) => {
if (controller) controller.abort(); // Abort the previous request
callServer(e.target.value);
};

// 1. Debounce the input 100ms for example
<input onChange={debounce(onChange)} />

By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently!

Streaming

That's it​

We would love to invite to join our Slack community! Please feel free to reach out to us on Slack - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation.

Happy hacking 😁πŸ’ͺ🏻

Tabby v0.1.1: Metal inference and StarCoder supports!

Β· 2 min read

We are thrilled to announce the release of Tabby v0.1.1 πŸ‘πŸ».

Staring tabby riding on llama.cpp

Created with SDXL-botw and a twitter post of BigCode

Apple M1/M2 Tabby users can now harness Metal inference support on Apple's M1 and M2 chips by using the --device metal flag, thanks to llama.cpp's awesome metal support.

The Tabby team made a contribution by adding support for the StarCoder series models (1B/3B/7B) in llama.cpp, enabling more appropriate model usage on the edge for completion use cases.

llama_print_timings:        load time =   105.15 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 25.07 ms / 6 tokens ( 4.18 ms per token, 239.36 tokens per second)
llama_print_timings: eval time = 311.80 ms / 28 runs ( 11.14 ms per token, 89.80 tokens per second)
llama_print_timings: total time = 340.25 ms

Inference benchmarking with StarCoder-1B on Apple M2 Max now takes approximately 340ms, compared to the previous time of around 1790ms. This represents a roughly 5x speed improvement.

This enhancement leads to a significant inference speed upgradeπŸš€, for example, It marks a meaningful milestone in Tabby's adoption on Apple devices. Check out our Model Directory to discover LLM models with Metal support! 🎁

tip

Check out latest Tabby updates on Linkedin and Slack community! Our Tabby community is eager for your participation. ❀️

Deploying a Tabby Instance in Hugging Face Spaces

Β· 4 min read

Hugging Face Spaces offers an easy-to-use Nvidia GPU hosting runtime, allowing anyone to host their machine learning models or AI applications.

In this blog post, we are going to show you how to deploy a Tabby instance in Hugging Face Spaces. If you have not heard of Tabby, it’s an open-source Github Copilot alternative that supports code completion. Check out more details here.

How it works​

Let’s firstly take a look at what steps are needed to deploy a Tabby instance in Hugging Face. It’s super easy and you don’t need much coding knowledge. Buckle up and let’s get started.

Step 1: Create a new Hugging Face Space (link). Spaces are code repositories that host application code.

Step 2: Create a Dockerfile to capture your machine learning models' logic, and bring up a server to serve requests.

Step 3: After space is built, you will be able to send requests to the APIs.

That's it! With the hosted APIs, now you can connect Tabby's IDE extensions to the API endpoint. Next, we will deep dive into each step with screenshots!! Everything will be done in the Hugging Face UI. No local setup is needed.

tip

Looking to quickly start a Tabby instance? You can skip the tutorials entirely and simply create space from this template.

Deep Dive​

Create a new Space​

After you create a Hugging Face account, you should be able to see the following page by clicking this link. The owner name will be your account name. Fill in a Space name, e.g. "tabbyml", and select Docker as Space SDK. Then click "Create Space" at the bottom.

In this walkthrough we recommend using Nvidia T4 instance to deploying a model of ~1B parameter size.

Create a new Space

Uploading Dockerfile​

For advanced users, you can leverage the Git workspace. In this blog post, we will show you the UI flow instead. After you click the "Create a Space" in the last step, you will be directed to this page. Just ignore the main text and click the "Files" on the top right corner.

Docker Space

After clicking on the "Files", you will be able to see a "Add file" button, click that, then click on "Create a new file"

Empty Space

Then you will be redirected to the page below. Set the filename to β€œDockerfile” and copy the content to the β€œEdit” input box. You can copy the code from the appendix here to bring up the SantaCoder-1B model. Once ready, click the button β€œCommit new file to main” on the bottom.

Edit Dockerfile

Edit Readme​

You also need to add a new line the README.md. Click the "edit" button in the README.md file.

Empty README

Add this line "app_port: 8080" after "sdk: docker"

Edit README

Click the button "Commit to main" to save the changes.

Verify Tabby is running​

Click on the "App" button, you should be able to see the container is building:

Space Building

If the App is up successfully, you should see this page:

Tabby Swagger

Call code completion API​

Now, you are able to call the completion API. The full URL is https://YOUR-ACCOUNT-NAME-tabbyml.hf.space/v1/completions. In this post, the URL is https://randxie-tabbyml.hf.space/v1/completions.

To test if your APIs are up and running, use this online tool to send curl commands:

curl

The complete curl command can also be located in the appendix. Ensure that you have adjusted the URL to align with your Hugging Face Spaces settings!

(If you are setting the space to private, you will need to fill in your Huggingface Access Token as bearer token in HTTP Headers, like Authorization: Bearer $HF_ACCESS_TOKEN.)

Conclusion​

In this post, we covered the detailed steps for deploying a Tabby instance to Hugging Face Spaces. By following these steps, anyone is able to bring up their own code completion APIs easily.

Appendix​

Dockerfile​

FROM tabbyml/tabby

USER root
RUN mkdir -p /data
RUN chown 1000 /data

USER 1000
CMD ["serve", "--device", "cuda", "--model", "TabbyML/SantaCoder-1B"]

CURL Command​

curl -L 'https://randxie-tabbyml.hf.space/v1/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
"language": "python",
"segments": {
"prefix": "def fib(n):\n ",
"suffix": "\n return fib(n - 1) + fib(n - 2)"
}
}'

Introducing First Stable Release: v0.0.1

Β· One min read

We're thrilled to announce Tabby's first stable version, v0.0.1! πŸŽ‰ This marks a significant milestone in our continuous efforts to refine our platform. The Tabby API specification is now officially stable, providing a reliable foundation for future development. πŸš€

πŸ“¦ To enjoy these improvements, simply upgrade your Tabby instance to v0.0.1 using the image tag: tabbyml/tabby:v0.0.1. This ensures access to the latest features and optimizations.

πŸ“š For more details and to access the release, visit our GitHub repository at https://github.com/TabbyML/tabby/releases/tag/v0.0.1.

We deeply appreciate your ongoing support and feedback, which has been instrumental in shaping Tabby's development. πŸ™ We're committed to delivering excellence and innovation as we move forward.

Thank you for being part of the Tabby community! ❀️