Run NVIDIA Triton Server on SaladCloud
Introduction
Triton Inference Server is an open-source, high-performance inference serving software that facilitates the deployment of machine learning models in production environments. Developed by NVIDIA, Triton provides a flexible and scalable solution for serving deep learning models.
Triton Inference Server supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, Python, ONNX, NVIDIA® TensorRT™, RAPIDS™ cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more. Its primary use cases are:
- Serving multiple models from a single server instance.
- Dynamic model loading and unloading without server restart.
- Ensemble inference, allowing multiple models to be used together to achieve results.
- Model versioning for A/B testing and rolling updates.
Resources:
Prerequisites
- Pick the Official NVIDIA Triton Server Docker Image:
- Create a Model Repository. The
model repository
is the directory where you place the models that you want Triton to serve. Ensure your models are organized in the
following folder structure:
- model-name It can be anything, but keep in mind you need this name to send requests to the correct model. Example: yolo
- 1/: This subdirectory suggests different model versions might be stored here, with “1” representing the first version.
- config.pbtxt: This file stores the model configuration in a text-based format, specifying details like input/output tensors, metadata, and backend used.
- model_file: The model itself.
You can upload multiple models and versions
Deploy on SaladCloud
To deploy a Triton Server on SaladCloud, create a Docker image using the official Triton Server image as a base. Include your models in the image and route IPv6 requests to the Triton Server HTTP port.
Step 1: Create a Dockerfile and add a Base Image
Create a new file called Dockerfile
and open it in your preferred text editor. Add the base image to your Dockerfile.
Example:
Step 2: Copy Your Models to the Image
Make sure the folder containing your model is structured as mentioned in the prerequisites, then copy it to your image by updating Dockerfile. Example:
Step 3: Enable IPv6
Refer to the official SaladCloud documentation on enabling IPv6: Enabling IPv6
Create a shell script that will run the Triton server with your model and route traffic from container port 80 to triton
http-port 8000 using socat
. That shell script will be an entrypoint in the image.
start_services.sh
Step 4: Complete the Dockerfile
Add the script to the Docker image and configure it to run on container startup. Here is the full Dockerfile example:
Step 5: Build your image and push it to docker hub (or the container registry of your choice.)
Step 6: Deploy your image on Salad, using either the Portal or the SaladCloud Public API
Follow this easy steps to deploy your container on salad: Quickstart - SaladCloud
Usage Example
Accessing the Application: Copy the Access Domain Name from the Container Group created above. Detailed instructions on how to find it can be found here: Setup Container Gateway
To test the solution, we deployed a Triton server running a YOLO model. To send requests to it, run the following Python script:
Explanation:
- Image Preprocessing: The image is loaded, resized to 640x640, and converted to the required format (CHW).
- Triton Client Setup: A Triton client is created to communicate with the server.
- Inputs and Outputs: The input tensor is prepared and set with the image data. The output tensor is specified to retrieve the inference results.
- Inference: The infer method is called to perform inference, and the results are printed.
Replace romaine-fennel.salad.cloud
with your Access Domain Name, and yolo
with the name of your model.
By following these steps, you can successfully deploy, manage, and test the NVIDIA Triton Inference Server on SaladCloud, enabling high-performance serving of your machine learning models.
Was this page helpful?