Modal
Modal is a serverless GPU provider. By leveraging Modal, your Tabby instance will run on demand. When there are no requests to the Tabby server for a certain amount of time, Modal will schedule the container to sleep, thereby saving GPU costs.
Setupβ
First we import the components we need from modal
.
from modal import Image, App, asgi_app, gpu
Next, we set the base docker image version, which model to serve, taking care to specify the GPU configuration required to fit the model into VRAM.
IMAGE_NAME = "tabbyml/tabby"
MODEL_ID = "TabbyML/StarCoder-1B"
CHAT_MODEL_ID = "TabbyML/Qwen2-1.5B-Instruct"
EMBEDDING_MODEL_ID = "TabbyML/Nomic-Embed-Text"
GPU_CONFIG = gpu.T4()
TABBY_BIN = "/opt/tabby/bin/tabby"
Currently supported GPUs in Modal:
T4
: Low-cost GPU option, providing 16GiB of GPU memory.L4
: Mid-tier GPU option, providing 24GiB of GPU memory.A100
: The most powerful GPU available in the cloud. Available in 40GiB and 80GiB GPU memory configurations.H100
: The flagship data center GPU of the Hopper architecture. Enhanced support for FP8 precision and a Transformer Engine that provides up to 4X faster training over the prior generation for GPT-3 (175B) models.A10G
: A10G GPUs deliver up to 3.3x better ML training performance, 3x better ML inference performance, and 3x better graphics performance, in comparison to NVIDIA T4 GPUs.Any
: Selects any one of the GPU classes available within Modal, according to availability.
For detailed usage, please check official Modal GPU reference.
Define the container imageβ
We want to create a Modal image which has the Tabby model cache pre-populated. The benefit of this is that the container no longer has to re-download the model - instead, it will take advantage of Modalβs internal filesystem for faster cold starts.
Download the weightsβ
def download_model(model_id: str):
import subprocess
subprocess.run(
[
TABBY_BIN,
"download",
"--model",
model_id,
]
)
Image definitionβ
Weβll start from an image by tabby, and override the default ENTRYPOINT for Modal to run its own which enables seamless serverless deployments.
Next we run the download step to pre-populate the image with our model weights.
Finally, we install the asgi-proxy-lib
to interface with modal's asgi webserver over localhost.
image = (
Image.from_registry(
IMAGE_NAME,
add_python="3.11",
)
.dockerfile_commands("ENTRYPOINT []")
.run_function(download_model, kwargs={"model_id": EMBEDDING_MODEL_ID})
.run_function(download_model, kwargs={"model_id": CHAT_MODEL_ID})
.run_function(download_model, kwargs={"model_id": MODEL_ID})
.pip_install("asgi-proxy-lib")
)
The app functionβ
The endpoint function is represented with Modal's @app.function
. Here, we:
- Launch the Tabby process and wait for it to be ready to accept requests.
- Create an ASGI proxy to tunnel requests from the Modal web endpoint to the local Tabby server.
- Specify that each container is allowed to handle up to 10 requests simultaneously.
- Keep idle containers for 2 minutes before spinning them down.
app = App("tabby-server", image=image)
@app.function(
gpu=GPU_CONFIG,
allow_concurrent_inputs=10,
container_idle_timeout=120,
timeout=360,
)
@asgi_app()
def app_serve():
import socket
import subprocess
import time
from asgi_proxy import asgi_proxy
launcher = subprocess.Popen(
[
TABBY_BIN,
"serve",
"--model",
MODEL_ID,
"--chat-model",
CHAT_MODEL_ID,
"--port",
"8000",
"--device",
"cuda",
"--parallelism",
"1",
]
)
# Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
def tabby_ready():
try:
socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
return True
except (socket.timeout, ConnectionRefusedError):
# Check if launcher webserving process has exited.
# If so, a connection can never be made.
retcode = launcher.poll()
if retcode is not None:
raise RuntimeError(f"launcher exited unexpectedly with code {retcode}")
return False
while not tabby_ready():
time.sleep(1.0)
print("Tabby server ready!")
return asgi_proxy("http://localhost:8000")
Serve the appβ
Once we deploy this model with modal serve app.py
, it will output the url of the web endpoint, in a form of https://<USERNAME>--tabby-server-app-serve-dev.modal.run
.
If you encounter any issues, particularly related to caching, you can force a rebuild by running MODAL_FORCE_BUILD=1 modal serve app.py
. This ensures that the latest image tag is used by ignoring cached layers.
Now it can be used as tabby server url in tabby editor extensions! See app.py for the full code used in this tutorial.