Key Considerations

To run LLM on SaladCloud, several key decisions need to be made first:

Model: There are many LLM models available, open-source or proprietary, in various sizes, precisions and formats. The LLM inference typically requires a significant amount of VRAM, not just for the model parameters, but also for the KV-Cache during inference, which is heavily influenced by the batch size and context length. Models like 7B, 8B and 9B, as well as quantized 13B and 34B, can run smoothly on a GPU with 24GB of VRAM on SaladCloud, while larger models are not recommended.

Inference Server: Several inference servers are available for running LLM models, including vLLM, TGI, Ollama, TensorRT-LLM, RayLLM, LMDeploy and llama.cpp. These servers differ in various aspects, including supported models, VRAM usage and efficiency, performance, throughput and additional features; and you should choose one that best meets your needs. Most of these servers provide container images that can be run directly on SaladCloud without modification. However, you can also create a custom wrapper image based on official images to add features like an I/O worker or model preloading. Alternatively, you can build your own inference server tailored to specific requirements.

Service Access: If you need to provide real-time and on-demand chat services, deploying a container group with a container gateway is the easiest approach. However, if the requirement is to build a batch processing system where millions of prompts are processed without the need for immediate responses, a container group integrated with a job queue offers a more resilient and cost-effective solution.

VRAM Usage

LLM inference consists of two stages: prefill and decode. In the prefill stage, the model processes the input prompt, computing their embeddings. In the decode stage, the model generates one token at a time conditioning on the prompt and all previously generated tokens.

VRAM usage during LLM inference is primarily driven by the model parameters and the KV-Cache. The KV-Cache stores the key/value embeddings of the input prompt and all previously generated tokens. These embeddings are computed once and retained in memory for reuse during the decode stage of inference, which can significantly speed up the process. As the context length (encompassing both the input prompt and generated text) and batch size (the number of input prompts processed simultaneously) increase, the VRAM consumption of the KV-Cache can grow significantly.

For example, with LLaMA 3.1 8B Instruct in 16-bit precision, the model parameters alone require approximately 16 GB of VRAM. The KV-Cache, for a batch size of 1 and a context length of 4096, uses about 2 GB of VRAM. Therefore, a GPU with 24 GB of VRAM can handle batched inference with around:

  • A batch size of 16 and a context length of 1024
  • A batch size of 8 and a context length of 2048
  • A batch size of 4 and a context length of 4096
  • A batch size of 1 and a context length of 16384

Processing too many requests simultaneously with excessively long context lengths can cause a node to run out of VRAM, forcing the inference process to use RAM instead. This shift can significantly degrade performance or result in errors.

Most inference servers provide options to manage the VRAM usage, including MAX_INPUT_TOKENS (the maximum prompt length that users can send), MAX_TOTAL_TOKENS (the maximum context length including both input prompt and generated text) and MAX_BATCH_TOTAL_TOKENS (the maximum context length multiplied by the maximum batch size). Given the sensitivity of LLM to VRAM, these parameters must be carefully planned, tested and configured to meet service requirements and prevent VRAM exhaustion during inference.

Performance and Throughput

There are two key performance indicators for measuring the performance of LLM inference on a node:

Time to First Token (TTFT): the speed of prefill, as nothing can be generated before the embeddings of input prompt is computed. This determines the waiting time for users to see the first generated token.

Time Per Output Token (TPOT): the decoding speed, the number of generated tokens per unit time after prefill. This affects the overall processing time to provide the entire generated text.

Typically, the prefill stage is brief because the prompt is processed in parallel. However, the decode stage takes much longer since tokens are generated one at a time. For the same context length, the longer the generated text, the greater the total processing time.

Here is the test data for Llama 3.1 8B Instruct using Hugging Face’s TGI, from a PC equipped with an RTX 3090 running Windows WSL:

  • Processing Time = TPOT x Number of Generated Tokens
  • Throughput = 1 / TPOT
  • Total Time = TTFT + Processing Time
Context Length: 16384, Batch Size: 1Prefill: TTFTDecode: TPOTDecode: Processing TimeDecode: ThroughputTotal Time
Input Prompt: 1024, Generated Text: 15360245.98 ms22.33 ms342955 ms44.78 tokens/sec343201 ms
Input Prompt: 4096, Generated Text: 12288934.13 ms22.66 ms278377 ms44.14 tokens/sec279312 ms
Input Prompt: 8192, Generated Text: 81921943.71 ms23.06 ms188904 ms43.36 tokens/sec190848 ms

Longer context lengths not only increase waiting time and negatively impact user experience, but may also lead to timeout errors at the load balancer in front of the inference servers, which is typically set to a 100-seconds timeout. Enabling token streaming on the servers allows tokens to be returned one by one, rather than waiting for the entire response to be generated. This feature shows the generation progress in real-time, significantly enhancing the user experience, and helping to avoid the timeout errors.

When more VRAM is available, batched inference can significantly increase throughput by effectively leveraging GPU cache and parallel processing, while only slightly increasing the processing time. Here is the test data from the same PC:

Context Length: 4096 (2048+2048)Prefill: TTFTDecode: TPOTDecode: Processing TimeDecode: ThroughputTotal Time
Batch Size 1459.07 ms21.15 ms43288 ms47.29 tokens/sec43746 ms
Batch Size 2882.91 ms22.35 ms45756 ms89.47 tokens/sec46639 ms
Batch Size 31326.06 ms23.06 ms47553 ms129.14 tokens/sec48880 ms
Batch Size 41731.33 ms23.96 ms49046 ms166.94 tokens/sec50778 ms

Compared to batched inference, concurrent inference using multiprocessing or multithreading on a single GPU might limit optimal GPU cache utilization and significantly impact performance, and is generally not recommended. This occurs because each inference process operates independently, reducing the efficiency of GPU resource usage. Additionally, multiprocessing consumes more VRAM, as each process requires its own CUDA context and loads a separate instance of the model into GPU VRAM for inference.

Many inference servers have a local queue to buffer requests and also support the continuous or dynamic batching. This means that, upon completing a request within a batch, the server can automatically add a new request from the queue to the batch and continue the inference process. These configurations can be managed using parameters such as MAX_CONCURRENT_REQUESTS (the number of requests handled simultaneously, or the queue length) and MAX_BATCH_SIZE (the number of requests grouped together for a batched inference).

These parameters must be carefully planned, tested and configured to increase throughput, enhance user experiences and avoid system errors. Typically, MAX_CONCURRENT_REQUESTS should be larger than MAX_BATCH_SIZE to effectively buffer the burst traffic, but if set too high, it can cause more requests to accumulate in the local queue of the inference servers, potentially increasing waiting time of users and leading to timeout errors at the load balancer due to prolonged delays in responding to these requests. Requests exceeding the MAX_CONCURRENT_REQUESTS limit (or the local queue length) will be rejected by the servers as a backpressure mechanism. Client applications interacting with the inference servers should incorporate traffic control and retry logic to enhance resilience.

Container Gateway or Job Queue?

Container Gateway

Deploying a container group with a container gateway (load balancer) is the simplest approach for providing real-time and on-demand services on SaladCloud. The inference server on each instance should listen on an IPv6 port, and the container gateway can map a public URL to this IPv6 port. Optionally, you can enable authentication to access the URL using an API token.

The container gateway uses least-connection load balancing without buffering requests and has a timeout set to 100 seconds. This setup may not be ideal for long-running tasks, such as generating responses with longer context lengths or keeping a high volume of requests waiting in the local queue of inference servers.

To use this solution effectively, system requirements and capabilities should be clearly defined and planned to properly configure the inference servers. Deploying the Readiness Probes is also essential to ensure that requests are only forwarded to containers when they are ready. Additionally, the client applications should implement retries and traffic control mechanisms to further enhance system resilience.

Currently, all container instances, regardless of their locations, are centrally exposed by the load balancer in the U.S. While this setup is optimal for instances running in America, it may introduce additional latency for instances in other regions. However, this is generally acceptable given that LLM inference normally takes much longer (tens of seconds) than this. If applications are latency-sensitive and require local access, consider using open-source tools or services like ngrok, and implementing your own load balancer in the relevant regions.

Job Queue

A container group integrated with a job queue offers a more resilient and cost-effective solution for batch processing systems where immediate responses are not required. You can submit millions of jobs to a job queue, which could be Salad JQ or AWS SQS. Instances with an I/O worker and the inference server, will then pull and process the jobs, subsequently uploading the generated texts to predefined locations, such as cloud storage or a webhook.

The job queue features a large buffer for queuing requests and includes built-in retry logic. If an instance does not complete a job within a specified time, the job becomes available to other instances for processing. Each instance pulls new jobs only after finishing the existing ones, eliminating the need for mechanisms like backpressure, retries, and traffic control in both the inference servers and client applications.

To implement this solution, you need to build a wrapper image that adds an I/O worker to the inference server. This worker will be responsible for pulling jobs, calling the inference server, uploading the generated texts and finally returning job results or status to the job queue.

Local Deployment To SaladCloud

Before deploying an image of these inference servers on SaladCloud, we recommend testing it in a local environment first. Troubleshooting and fine-tuning parameters in the cloud can be time-consuming and complex.

For local deployment and testing, focus on the following objectives:

  • Ensure everything is functioning as expected.
  • Monitor resource usage, including CPU, RAM, GPU, and VRAM, to inform resource allocation on SaladCloud.
  • Test system behaviors with different parameters and finalize the configuration settings.

An ideal testing environment is a Windows PC with an NVIDIA GPU, WSL2 and Docker Desktop installed. If you don’t have access to this setup, you can perform a similar test on SaladCloud using the interactive shell (refer to Scenario 3) before proceeding with a full deployment.

Let’s use LLaMA 3.1 8B Instruct and Hugging Face’s TGI inference server as examples to guide the deployment and testing process for different scenarios. Once the local testing is complete, we can easily translate the local deployment settings to the configurations on SaladCloud.

Scenario 1: Use environment variables to supply the working parameters to the TGI server.

# Define the environment variables in the host (WSL2).

# The token is needed for authorized users to download private models or those that require agreement to terms of use on Hugging Face.
# The volume is to map a host folder to the container, allowing the model to be stored persistently and preventing the need to download it every time the container is run. 

ubuntu@server:~$ token=hf_XXXXXXXXXXXXXXXXXXXXXXXX
ubuntu@server:~$ volume=/home/ubuntu/.cache/huggingface
ubuntu@server:~$ model=meta-llama/Meta-Llama-3.1-8B-Instruct

# Run the latest TGI image.
# Pass environment variables to the container.
# Map port 8080 on the host to the port 80 in the container. 
# The '--shm-size' option is not required if using only one GPU. 
 
ubuntu@server:~$ docker run -it --rm --gpus all -p 8080:80 -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-e MODEL_ID=$model \
-e HOSTNAME=0.0.0.0 \
-e PORT=80 \
-e MAX_TOTAL_TOKENS=4096 \
-e MAX_INPUT_TOKENS=2048 \
-e MAX_CONCURRENT_REQUESTS=8 \
-e MAX_BATCH_SIZE=4 \
ghcr.io/huggingface/text-generation-inference:latest

# Test the health check from the host.

ubuntu@server:~$ curl -i http://127.0.0.1:8080/health
HTTP/1.1 200 OK
vary: origin, access-control-request-method, access-control-request-headers
access-control-allow-origin: *
content-length: 0
date: Tue, 03 Sep 2024 02:30:38 GMT

# Test the inference from the host.

ubuntu@server:~$ curl http://127.0.0.1:8080/v1/chat/completions     -X POST     -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a good guy!"
    },
    {
      "role": "user",
      "content": "How to learn AI and Machine Learning?"
    }
  ],
  "stream": false,
  "max_tokens": 512
}'     -H 'Content-Type: application/json'

# Test the inference (token streaming)from the host.

ubuntu@server:~$ curl http://127.0.0.1:8080/v1/chat/completions     -X POST     -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a good guy!"
    },
    {
      "role": "user",
      "content": "How to learn AI and Machine Learning?"
    }
  ],
  "stream": true,
  "max_tokens": 512
}'     -H 'Content-Type: application/json'

# Enter the container and do some check.

ubuntu@server:~$ docker ps
CONTAINER ID   IMAGE                                                  COMMAND                CREATED         STATUS         PORTS                  NAMES
b9d7aeb218a8   ghcr.io/huggingface/text-generation-inference:latest   "/tgi-entrypoint.sh"   4 minutes ago   Up 4 minutes   0.0.0.0:8080->80/tcp   infallible_dewdney

ubuntu@server:~$ docker exec -it b9d /bin/bash
root@b9d7aeb218a8:/usr/src#

root@b9d7aeb218a8:/usr/src# curl -i http://localhost:80/health
HTTP/1.1 200 OK
vary: origin, access-control-request-method, access-control-request-headers
access-control-allow-origin: *
content-length: 0
date: Tue, 03 Sep 2024 02:35:39 GMT

# To use the container gateway on SaladCloud, the server must be configured to listen on an IPv6 port.
# Change the HOSTNAME from '0.0.0.0' to '::' for using IPv6

ubuntu@server:~$ docker run -it --rm --gpus all -p 8080:80 -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-e MODEL_ID=$model \
-e HOSTNAME=:: \
-e PORT=80 \
-e MAX_TOTAL_TOKENS=4096 \
-e MAX_INPUT_TOKENS=2048 \
-e MAX_CONCURRENT_REQUESTS=8 \
-e MAX_BATCH_SIZE=4 \
ghcr.io/huggingface/text-generation-inference:latest

# Enter the container and test the inference and the health check.
# Both '[::1]' and 'localhost' can be used to access the local IPv6 port.

root@a4ecfb06bab4:/usr/src# curl -i http://[::1]:80/health
HTTP/1.1 200 OK
vary: origin, access-control-request-method, access-control-request-headers
access-control-allow-origin: *
content-length: 0
date: Tue, 03 Sep 2024 03:03:57 GMT

root@a4ecfb06bab4:/usr/src# curl -i http://localhost:80/health
HTTP/1.1 200 OK
vary: origin, access-control-request-method, access-control-request-headers
access-control-allow-origin: *
content-length: 0
date: Tue, 03 Sep 2024 03:04:40 GMT

root@a4ecfb06bab4:/usr/src# curl http://[::1]:80/v1/chat/completions     -X POST     -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a good guy!"
    },
    {
      "role": "user",
      "content": "How to learn AI and Machine Learning?"
    }
  ],
  "stream": true,
  "max_tokens": 512
}'     -H 'Content-Type: application/json'

Here are the corresponding configurations on SaladCloud:

# Image Source

ghcr.io/huggingface/text-generation-inference:latest

# Replica Count

3+ for test and 5+ for production 

# Resource
# Based the local test, 4 vCPUs and 12 GB of RAM are sufficient, and the GPU must have 24 GB of VRAM.
# You can choose multiple GPU types simultaneously.

4 vCPUs, 12GB Memory
GPU with 24GB of VRAM, RTX 3090/3090Ti/4090

# Environment Variables

HF_TOKEN ********************
HF_HUB_ENABLE_HF_TRANSFER 0
MODEL_ID meta-llama/Meta-Llama-3.1-8B-Instruct
HOSTNAME :: 
PORT 80 
MAX_TOTAL_TOKENS 4096 
MAX_INPUT_TOKENS 2048
MAX_CONCURRENT_REQUESTS 8
MAX_BATCH_SIZE 4

# Container Gateway
# The port must match the IPv6 port used by the server. 

Enbaled, Port 80

# Readiness Probe
# The probe is essential to ensure that requests are only forwarded to containers when they are ready. 
# Both '[::1]' and 'localhost' can be used to access the local IPv6 port.

Enabled
Protocol: exec

Command: python
Argument1: -c
Argument2: import requests,sys;sys.exit(0 if requests.get('http://localhost:80/health').status_code == 200 else -1)

Initial Delay Seconds: 300
Period Seconds: 10
Timeout Seconds: 5
Success Threshold: 1
Failure Threshold: 3

You don’t need to add double or single quotes in the Argument 2 of Command for Readiness on Salad Portal:

After the instances are running and have passed the readiness probes, we can perform some test using the generated access domain name:

# Test the health check

curl -i https://pomelo-turnip-ztmw72fai31is8bp.salad.cloud/health

# Test the inference 

curl https://pomelo-turnip-ztmw72fai31is8bp.salad.cloud/v1/chat/completions     -X POST     -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a good guy!"
    },
    {
      "role": "user",
      "content": "How to learn AI and Machine Learning?"
    }
  ],
  "stream": false,
  "max_tokens": 512
}'     -H 'Content-Type: application/json'

# Test the inference(token streaming) 

curl https://pomelo-turnip-ztmw72fai31is8bp.salad.cloud/v1/chat/completions     -X POST     -d '{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a good guy!"
    },
    {
      "role": "user",
      "content": "How to learn AI and Machine Learning?"
    }
  ],
  "stream": true,
  "max_tokens": 512
}'     -H 'Content-Type: application/json'

Scenario 2: Override both the ENTRYPOINT and CMD in the image to start the TGI server.

Sometimes, we can configure the TGI server to start with additional parameters using specific values without needing to pass many environment variables. SaladCloud offers the option to override the ENTRYPOINT and CMD in the image, which easily accommodates this requirement.

# Override both the ENTRYPOINT and CMD in the image.

# Start the TGI server with addtional parameters using specific values and environment variables.
# The environment variables MAX_TOTAL_TOKENS and MAX_INPUT_TOKENS can be removed, because these parameters are explicitly provided when starting the TGI server.

ubuntu@server:~$ docker run -it --rm --gpus all -p 8080:80 -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-e MODEL_ID=$model \
-e HOSTNAME=:: \
-e PORT=80 \
-e MAX_TOTAL_TOKENS=4096 \
-e MAX_INPUT_TOKENS=2048 \
-e MAX_CONCURRENT_REQUESTS=8 \
-e MAX_BATCH_SIZE=4 \
--entrypoint 'sh' \
ghcr.io/huggingface/text-generation-inference:latest -c 'text-generation-launcher --model-id $MODEL_ID --max-total-tokens 4096 --max-input-tokens 2048'

Here is the corresponding configuration to override both the ENTRYPOINT and CMD in the image on SaladCloud:

# Command 

Command: sh
Argument1: -c
Argument2: text-generation-launcher --model-id $MODEL_ID --max-total-tokens 4096 --max-input-tokens 2048

# The other configurations remain the same as in Scenario 1.

You don’t need to add double or single quotes in the Argument 2 of Command on Salad Portal, but you do need to add them when running the ‘docker run’ command locally:

Scenario 3: Run the TGI server interactively.

Salad Portal provides an interactive terminal for each instance in SCE deployments, allowing you to interact directly with each instance to troubleshoot issues or reconfigure your application after deployment.

If you don’t have access to a local test environment with a GPU, you can use the interactive terminal on SaladCloud to test your image, check its resource usage, fine-tune and finalize server settings. Keep in mind that this is just a complement and cannot fully replace your local R&D environments for daily tasks, as Salad nodes are interruptible and may experience cold starts and occasional downtime.

# Override both the ENTRYPOINT and CMD in the image.
# Make the container sleep while running.

ubuntu@server:~$ docker run -it --rm --gpus all -p 8080:80 -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-e MODEL_ID=$model \
-e HOSTNAME=:: \
-e PORT=80 \
-e MAX_TOTAL_TOKENS=4096 \
-e MAX_INPUT_TOKENS=2048 \
-e MAX_CONCURRENT_REQUESTS=8 \
-e MAX_BATCH_SIZE=4 \
--entrypoint 'sh' \
ghcr.io/huggingface/text-generation-inference:latest -c "sleep infinity"

# Enter the container and start the TGI server manually.
root@5693c4212042:/usr/src# text-generation-launcher
root@5693c4212042:/usr/src# text-generation-launcher --model-id $MODEL_ID --max-input-tokens 2048 --max-total-tokens 4096

# Enter the container with another terminal and run the TGI benchmark.
root@5693c4212042:/usr/src# text-generation-benchmark --tokenizer-name $MODEL_ID
root@5693c4212042:/usr/src# text-generation-benchmark --tokenizer-name $MODEL_ID --batch-size 1 --sequence-length 2048 --decode-length 2048  --runs 1

Here is the corresponding configuration to override both the ENTRYPOINT and CMD in the image on SaladCloud:

# Command 

Command: sh
Argument1: -c
Argument2: sleep infinity

# The other configurations remain the same as in Scenario 1. 

You don’t need to add double or single quotes in the Argument 2 of Command on Salad Portal, but you do need to add them when running the ‘docker run’ command locally:

Once the instances are running (sleeping), you can use the interactive terminal to manually run tasks, perform various tests, and troubleshoot issues:

Recommendation for Production

The official images of these inference servers may not be sufficient for production. You can enhance your application by building a custom wrapper image with the following features:

  • Install the necessary softwares and tools, such as monitoring, troubleshooting and logging.
  • Add an I/O worker to the inference server for pulling jobs, uploading generated texts and returning job results, especially if you’re building a batch processing system.
  • Pre-load the model parameters into the image to reduce costs, as there’s no charge when nodes are downloading your images.
  • Implement initial and real-time performance checks to ensure that nodes remain in an optimal state for application execution, as the performance of Salad nodes may fluctuate over time due to their shared nature. If a node’s performance falls below a predefined threshold, the code should exit with a status of 1, triggering a reallocation. These functions can also be achieved by deploying Startup Probes and Liveness Probes.

LLM inference is both VRAM-intensive and time-consuming and user experience is crucial for real-time services. Inference time can also vary significantly depending on the length of the generated text and the performance of the node. To successfully run LLM using the container gateway on SaladCloud, the system requirements, capabilities and parameters must be clearly defined, planned, tested and deployed. Here are some examples:

Context length and batch size can significantly impact system performance, throughput, and cost-effectiveness. To utilize resources efficiently, several container groups can be deployed for different scenarios, such as short and long conversations.

Parameters such as Max Total Tokens, Maximum Concurrent Requests and Batch Size are crucial and affect every aspect of the system.These parameters need to be thoroughly tested and deployed based on the specific system requirements.

Requests exceeding the limit on the inference servers will be rejected as a backpressure mechanism. To use the load balancer efficiently, client applications should implement traffic control and retry logic to enhance resilience. This includes resending requests to the load balancer for occasional errors and stopping the acceptance of new requests from users early during congestion, rather than allowing them to be dropped within the system.

The system capacities need to be carefully calculated and provisioned, or adjusted as needed, to adequately support the system requirements. Since instances require time to download images and start up, we recommend adjusting the container group to the expected capacity 30 minutes before peak traffic.