Skip to main content

· 4 min read

This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload.

What is streaming?

Let's begin by setting up a simple example program:

intro

const express = require('express');

function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}

async function* llm() {
let i = 1;
while (true) {
console.log(`producing ${i}`);
yield i++;

// Mimic LLM inference latency.
await sleep(1000);
}
}

function server(llm) {
const app = express();
app.get('/', async (req, res) => {
res.writeHead(200, {
'Content-Type': 'application/jsonstream',
'Transfer-Encoding': 'chunked',
});

let value, done;
do {
({ value, done } = await llm.next());
res.write(JSON.stringify(value));
res.write('\n');
} while (!done);
});

app.listen(8080);
}

async function client() {
const resp = await fetch('http://localhost:8080');

// Read values from our stream
const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader();
// We're only reading 3 items this time:
for (let i = 0; i < 3; i++) {
// we know our stream is infinite, so there's no need to check `done`.
const { value } = await reader.read();
console.log(`read ${value}`);
await sleep(10ms);
}
}

server(llm());
client();

In this example, we are creating an async generator to mimic a LLM that produces string tokens. We then create an HTTP endpoint that wraps the generator, as well as a client that reads values from the HTTP stream. It's important to note that our generator logs producing ${i}, and our client logs read ${value}. The LLM inference could take an arbitrary amount of time to complete, simulated by a 1000ms sleep in the generator.

Stream Laziness

If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output producing ${i} even after the client has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out.

Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors?

This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources.

Cancellation

How to handle cancellation?

The core idea is straightforward: on the server side, we need to listen to the close event and check if the connection is still valid before pulling data from the LLM stream.

app.get('/', async (req, res) => {
...
let canceled;
req.on('close', () => canceled = true);
do {
({ value, done } = await llm.next());
...
} while (!done && !canceled);
});

Implement cancellation for Tabby

In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance.

On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server.

// Demo code in the client side

let controller;

const callServer = (prompt) => {
controller = new AbortController();
const signal = controller.signal;
// 2. calling server API to get the result with the prompt
const response = await fetch("/v1/completions", {
method: "POST",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify({ prompt })
signal
});
}

const onChange = (e) => {
if (controller) controller.abort(); // Abort the previous request
callServer(e.target.value);
};

// 1. Debounce the input 100ms for example
<input onChange={debounce(onChange)} />

By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently!

Streaming

That's it

We would love to invite to join our Slack community! Please feel free to reach out to us on Slack - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation.

Happy hacking 😁💪🏻

· 2 min read

We are thrilled to announce the release of Tabby v0.1.1 👏🏻.

Staring tabby riding on llama.cpp

Created with SDXL-botw and a twitter post of BigCode

Apple M1/M2 Tabby users can now harness Metal inference support on Apple's M1 and M2 chips by using the --device metal flag, thanks to llama.cpp's awesome metal support.

The Tabby team made a contribution by adding support for the StarCoder series models (1B/3B/7B) in llama.cpp, enabling more appropriate model usage on the edge for completion use cases.

llama_print_timings:        load time =   105.15 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 25.07 ms / 6 tokens ( 4.18 ms per token, 239.36 tokens per second)
llama_print_timings: eval time = 311.80 ms / 28 runs ( 11.14 ms per token, 89.80 tokens per second)
llama_print_timings: total time = 340.25 ms

Inference benchmarking with StarCoder-1B on Apple M2 Max now takes approximately 340ms, compared to the previous time of around 1790ms. This represents a roughly 5x speed improvement.

This enhancement leads to a significant inference speed upgrade🚀, for example, It marks a meaningful milestone in Tabby's adoption on Apple devices. Check out our Model Directory to discover LLM models with Metal support! 🎁

tip

Check out latest Tabby updates on Linkedin and Slack community! Our Tabby community is eager for your participation. ❤️

· 4 min read

Hugging Face Spaces offers an easy-to-use Nvidia GPU hosting runtime, allowing anyone to host their machine learning models or AI applications.

In this blog post, we are going to show you how to deploy a Tabby instance in Hugging Face Spaces. If you have not heard of Tabby, it’s an open-source Github Copilot alternative that supports code completion. Check out more details here.

How it works

Let’s firstly take a look at what steps are needed to deploy a Tabby instance in Hugging Face. It’s super easy and you don’t need much coding knowledge. Buckle up and let’s get started.

Step 1: Create a new Hugging Face Space (link). Spaces are code repositories that host application code.

Step 2: Create a Dockerfile to capture your machine learning models' logic, and bring up a server to serve requests.

Step 3: After space is built, you will be able to send requests to the APIs.

That's it! With the hosted APIs, now you can connect Tabby's IDE extensions to the API endpoint. Next, we will deep dive into each step with screenshots!! Everything will be done in the Hugging Face UI. No local setup is needed.

tip

Looking to quickly start a Tabby instance? You can skip the tutorials entirely and simply create space from this template.

Deep Dive

Create a new Space

After you create a Hugging Face account, you should be able to see the following page by clicking this link. The owner name will be your account name. Fill in a Space name, e.g. "tabbyml", and select Docker as Space SDK. Then click "Create Space" at the bottom.

In this walkthrough we recommend using Nvidia T4 instance to deploying a model of ~1B parameter size.

Create a new Space

Uploading Dockerfile

For advanced users, you can leverage the Git workspace. In this blog post, we will show you the UI flow instead. After you click the "Create a Space" in the last step, you will be directed to this page. Just ignore the main text and click the "Files" on the top right corner.

Docker Space

After clicking on the "Files", you will be able to see a "Add file" button, click that, then click on "Create a new file"

Empty Space

Then you will be redirected to the page below. Set the filename to “Dockerfile” and copy the content to the “Edit” input box. You can copy the code from the appendix here to bring up the SantaCoder-1B model. Once ready, click the button “Commit new file to main” on the bottom.

Edit Dockerfile

Edit Readme

You also need to add a new line the README.md. Click the "edit" button in the README.md file.

Empty README

Add this line "app_port: 8080" after "sdk: docker"

Edit README

Click the button "Commit to main" to save the changes.

Verify Tabby is running

Click on the "App" button, you should be able to see the container is building:

Space Building

If the App is up successfully, you should see this page:

Tabby Swagger

Call code completion API

Now, you are able to call the completion API. The full URL is https://{YOUR-ACCOUNT-NAME}-tabbyml.hf.space/v1/completions. In this post, the URL is https://randxie-tabbyml.hf.space/v1/completions.

To test if your APIs are up and running, use this online tool to send curl commands:

curl

The complete curl command can also be located in the appendix. Ensure that you have adjusted the URL to align with your Hugging Face Spaces settings!

(If you are setting the space to private, you will need to fill in your Huggingface Access Token as bearer token in HTTP Headers, like Authorization: Bearer $HF_ACCESS_TOKEN.)

Conclusion

In this post, we covered the detailed steps for deploying a Tabby instance to Hugging Face Spaces. By following these steps, anyone is able to bring up their own code completion APIs easily.

Appendix

Dockerfile

FROM tabbyml/tabby

USER root
RUN mkdir -p /data
RUN chown 1000 /data

USER 1000
CMD ["serve", "--device", "cuda", "--model", "TabbyML/SantaCoder-1B"]

CURL Command

curl -L 'https://randxie-tabbyml.hf.space/v1/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
"language": "python",
"segments": {
"prefix": "def fib(n):\n ",
"suffix": "\n return fib(n - 1) + fib(n - 2)"
}
}'

· One min read

We're thrilled to announce Tabby's first stable version, v0.0.1! 🎉 This marks a significant milestone in our continuous efforts to refine our platform. The Tabby API specification is now officially stable, providing a reliable foundation for future development. 🚀

📦 To enjoy these improvements, simply upgrade your Tabby instance to v0.0.1 using the image tag: tabbyml/tabby:v0.0.1. This ensures access to the latest features and optimizations.

📚 For more details and to access the release, visit our GitHub repository at https://github.com/TabbyML/tabby/releases/tag/v0.0.1.

We deeply appreciate your ongoing support and feedback, which has been instrumental in shaping Tabby's development. 🙏 We're committed to delivering excellence and innovation as we move forward.

Thank you for being part of the Tabby community! ❤️