Run Hugging Face Models with Ollama (45,000 models)
Use Ollama with any GGUF Model on Hugging Face Hub on SaladCloud.
Introduction to Ollama
Ollama, built on the llama.cpp framework, now seamlessly integrates with a vast collection of GGUF format language models available on Hugging Face. With over 45,000 public GGUF checkpoints, users can effortlessly run any of these models on SaladCloud with minimal setup. This integration offers flexibility in selecting models, customizing quantization schemes, and other options, making it one of the simplest and most efficient ways to deploy and use language models.
Run any Hugging Face Model with Ollama on SaladCloud
You can deploy any Hugging Face LLM model with Ollama on SaladCloud by passing the model as an environment variable during deployment. Pick the model you want here: HF models The environment variable MODEL should follow the format below, allowing you to specify the model from Hugging Face, including optional quantization settings:
Here are example of models you can try:
Custom Quantization
By default, Ollama uses the Q4_K_M quantization scheme if it’s available in the model repository. You can manually select a quantization scheme by specifying it in the MODEL environment variable. To find quantization options open model’s Hugging Face page and choose ollama from “Use this model” dropdown. For example:
and choose the quantization you want:
To specify a custom quantization, follow this format:
Example with Custom Quantization:
Deploying Hugging Face Models with Ollama on Salad Cloud
To run Hugging Face models on Salad Cloud using Ollama, follow one of these deployment options:
Option 1: Fastest Way (Pre-built Recipe)
We have a pre-built recipe for deploying Llama3.1 with Ollama on SaladCloud. This recipe can also be used to run any additional model from Hugging Face. As a result you will have both Llama 3.1 and model of your choice. Full configuration takes less than a minute.
- Click Deploy a Container Group and choose the Ollama Llama 3.1 recipe.
- Add an environment variable:
MODEL
set to the desired model (as specified above).
- Continue through the steps (the default setup is 8 vCPUs, 8 GB RAM, and 12 GB GPU, RTX 3060). For better performance, select a higher-end GPU and other parameters.
- On the final page, ensure Autostart is checked, then click Deploy.
Option 2: Custom Container Group
- Click Deploy a Container Group and choose Custom Container Group.
-
Set a deployment name. Edit the image source and enter
saladtechnologies/ollama-hf:1.0.0
as the image name, then click Configure. -
Edit the Environment Variables and add the following:
MODEL
set to the desired Hugging Face model (as specified above). Move to the next page
-
Select the desired CPU, RAM, GPU, storage, and priority for the deployment.
-
Add a Container Gateway:
- Click Enable, set the port to
11434
, and select Least number of connections as the load balancer algorithm. - Optionally, limit each server to a single active connection.
- Click Enable, set the port to
-
Add a Startup Probe:
- Click Enable, set the path to
/
and port to11434
. Set the probe type toHTTP
and the initial delay to desired number.
- Click Enable, set the path to
-
Ensure Autostart is checked, then click Deploy.
Use your Deployment
Once the deployment is complete, click on the deployment name to access the deployment details. To verify the model was uploaded you can open the terminal and run the following command:
How To Send Requests
Once your Ollama server is running with, you can send requests to interact with the model. Follow the instructions provided in the OpenAI Documentation file to learn how to properly structure and send requests to the API.
Was this page helpful?