Deploy SLIP with Cog HTTP Prediction Server
Cog is an open-source tool that helps simplify the work of building inference applications for various AI models. It provides CLI tools, Python modules for prediction and fine-tuning, and an HTTP prediction server using FastAPI.
Using the Cog HTTP prediction server, we mainly need to provide two functions in Python: one for downloading/loading the models and another for running the inference. The server handles everything else, such as input/output, logging, health checks, and exception handling. It supports synchronous prediction, streaming output and asynchronous prediction with a webhook; its health-check feature is impressive, providing different running statuses (STARTING, READY, BUSY and FAILED) within the server.
The Cog prediction server can be further customized to meet specific needs. For instance, its path operation functions
for the health checks and predictions are, by default, declared with async def
, running in the same main thread. By
declaring the predictions function with def
, a new thread will be spawned to run the prediction when a new request
arrives.This setup can prevent long-running synchronous predictions from blocking timely responses to health queries.
Another example is IPv6 support: the server is hardcoded to listen on an IPv4 port, we can modify the code to use IPv6
by replacing 0.0.0.0
with ::
when launching its underlying Uvicorn server.
The BLIP (Bootstrapping Language-Image Pre-training) supports multiple image-to-text tasks, such as Image Captioning, Visual Question Answering and Image Text Matching. Each task requires a dedicated and fine-tuned BLIP model that is 1~2 GB in size. We can run inference for the three models of these three tasks simultaneously on a SaladCloud node that has a GPU with 8GB VRAM.
LAVIS (A Library for Language-Vision Intelligence) is the Python deep learning library, and provides the unified access to the pretrained models, datasets and tasks for multimodal applications, including BLIP, CLIP and others.
Let’s use BLIP as an example to see how to build a publicly-accessible and scalable inference endpoint using the Cog HTTP prediction server on SaladCloud, capable of handling various image-to-text tasks.
Build the container image
The following 4 files are necessary for building the image, and we also provide some test code in the Github Repo.
The yaml file defines how to build a Docker image and how to run predictions. In this example, We only use the Cog HTTP prediction server (not its CLI tools), the file is quite simple. When the prediction server is launched, it will read the file, and then set the number of Uvicorn worker processes to 1 (when the GPU is enabled) and run the provided code - predict.py for inference.
Based on the PyTorch base image, we only need to install two Python packages and their dependencies for this application.
You can download the models first and build them into the container image. This way, when the workload is running on SaladCloud, it can start the inference immediately. Alternatively, the models can also be downloaded dynamically when the container is running. This approach has the advantage of smaller image sizes, allowing for faster builds and pushes.
For the inbound connection, the containers running on SaladCloud need to listen on an IPv6 port. The Cog HTTP prediction
server is currently hardcoded to use an IPv4 port, but this can be easily modified by a sed
command in the Dockerfile.
The Predictor Class is implemented with 2 member functions that will be called by the Cog prediction server:
setup(), download and load the 3 models into the GPU.
predict(), run the inference based on inputs and return the results.
Test the image
After the container is running, you can log into it and do some tests for health checks and predictions.
The Cog HTTP prediction server is now using IPv6. The port number is configurable via the environment variable - ‘PORT’.
Deploy the image on SaladCloud
Create a container group with the following parameters:
The Readiness Probe is used to evaluate whether a container is ready to accept the traffic from the load balancer. The probe with the protocol - exec, will run the given command inside the container, if the command returns an exit code of 0, the container is considered in a healthy state. Any other exit codes indicate the container is not ready yet. A Python script is provided here and run regularly to check whether the models have been loaded successfully and the Cog HTTP prediction server is ready.
Test the inference endpoint
After the container group is deployed, an access domain name will be created and can be used to access the application.
Was this page helpful?