> ## Documentation Index
> Fetch the complete documentation index at: https://docs.salad.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Load Balancing Options

*Last Updated: September 3, 2025*

When you enable the container gateway during container group creation, several load balancing (LB) options determine how
requests are forwarded to the container instances, including algorithms, concurrency and timeouts. These settings cannot
be changed once a container group is created. To modify the container gateway settings, you can duplicate the existing
container group, create a new one, and apply the desired changes.

# Simulator

Try the [Container Gateway Simulator](https://gateway-simulator.saladexamples.com/index.html) to better understand all
of the concepts described here, and how they impact system performance.

# Settings

<img src="https://mintcdn.com/salad/pEBrSzH4UQRtJw00/container-engine/images/load-balancer-options-1.png?fit=max&auto=format&n=pEBrSzH4UQRtJw00&q=85&s=e0a9b79f15e2c554c2c8a59852d11bde" width="464" height="780" data-path="container-engine/images/load-balancer-options-1.png" />

## Algorithms

There are two load balancing algorithms currently supported by SaladCloud: Round Robin and Least Number of Connections.

**Round Robin:** This algorithm evenly distributes incoming requests to instances within a container group in a
sequential and cyclical order, without considering the instance’s current load or processing power. It is ideal for a
container group using the single GPU type, especially when all requests have similar processing time and the workload is
light.

**Least Number of Connections:** This algorithm continuously monitors the load on each instance and directs new requests
to those with the fewest active connections. By prioritizing the least-loaded instances and preventing any one from
becoming overwhelmed, it balances workloads more effectively and optimizes system performance, resulting in greater
efficiency and reliability.

Processing time for AI inference can vary significantly based on factors like image size, context length, and audio
duration. Additionally, deploying a container group with multiple GPU types that share the same VRAM size but have
different processing capabilities can further complicate this variability. **In most cases, the Least Number of
Connections algorithm can effectively manage these differences and is recommended for optimal performance.**

## Concurrency

To enhance overall performance and throughput, the container gateway typically utilizes multiple connections to each
instance and sends multiple requests to the instance concurrently.

For inference servers, it is ideal to design them to handle multiple simultaneous requests. This includes incorporating
a local queue to buffer requests, the ability to accept new requests during inference (non-blocking I/O), explicitly
rejecting excessive requests as a back-pressure mechanism when the queue is full, and supporting batched inference by
grouping incoming requests or performing single inference. Some inference servers support streaming mode, enabling
real-time responses during the inference process rather than waiting for the entire process to complete. This capability
is particularly useful for LLMs.

In contrast to batched or single inference, using multiprocessing or multithreading for concurrent inference on a single
GPU may limit optimal GPU cache utilization and negatively impact performance; therefore, this approach should generally
be avoided.

**To support various inference servers and use cases, two settings are available for concurrency:**

**When the option - “Limit each server to a single, active connection” is unselected**, the container gateway will
establish multiple connections per instance and forward concurrent requests to the inference server in the container.
This server should be capable of handling multiple requests simultaneously, including a local queue, use non-blocking
I/O and have a back-pressure mechanism.

**When the option - “Limit each server to a single, active connection” is selected**, the container gateway will use a
single active connection per instance and only forward one request at a time to an instance, even when using HTTP/2
multiplexing. **New requests will wait and be queued by the container gateway until the current one is processed and a
response is returned by the instance.** This configuration ensures that an instance receives and processes only one
request at any time. Note that a WebSocket connection treated as a single request.

For single inference tasks, such as generating an image at a time, choosing this option can significantly simplify the
implementation of the inference server by using synchronous calls. **Otherwise, this option will reduce system
throughput and should be avoided.**

## Client Request Timeout

This setting (millisecond) determines how long a request can be waiting and queued in the container gateway before being
sent, after which a timeout error is returned to the client. The default is **100 seconds** (or 100000 milliseconds),
which is the maximum allowable waiting time for queued requests.

**When a container gateway is configured to use multiple connections per instance, this setting can be ignored**, as all
requests are sent to the instance immediately and would be queued by the inference server.

**However, if a container gateway is configured to use a single active connection per instance, the setting is crucial
in influencing the system behavior:**

* **If set too low, the system may struggle to handle burst traffic.**
* **If set too high, it could result in longer waiting times on the client side.**

This setting should be determined carefully based on the application requirements. In the long term, the system
resources must align with the workload.

If the client frequently encounters errors, such as client request timeout errors from the container gateway, or queue
full errors from the inference server, you may need to increase or adjust resources to enhance processing capability.

Alternatively, you can implement flow control and retry mechanisms on the client side to stop accepting new user
requests, rather than letting them be dropped within the system.

## Server Response Timeout

This setting (millisecond) determines how long the container gateway will wait for a response from the instance after
sending a request. If no response is received within this timeframe, a timeout error is returned to the client,
preventing prolonged waits for responses that may never arrive. The default value is **100 seconds** (100000
milliseconds）, which is the maximum allowable waiting time for a response.

A 100-second timeout is generally sufficient for many tasks, including image generation and classification. However, for
LLMs, the decoding process for very long conversions may take a few minutes. **If the inference server operates in
streaming mode - providing responses immediately after the first token is generated rather than waiting for the entire
process to complete, LLM applications can still function effectively with the container gateway.**

**The container gateway is not intended for long-running tasks, such as molecular simulations and long audio
transcriptions, which would take hours or days to complete for each job. These tasks should instead be managed using a
job queue.**

# Common Errors and Their Causes

Generally, the inference server handles multiple tasks for each request, including downloading/uploading,
pro-processing, post-processing and GPU inference. These tasks take time and may encounter errors along the way.

When the client sends a burst of requests to the instances through the container gateway, the following common errors
related to these LB options may arise:

-The client may encounter server response timeout errors (524) if processing certain requests on the inference server
takes longer than the server response timeout.

-If the inference server runs a single thread with blocking I/O, it cannot accept new requests during inference, which
may result in connection timeout errors (504).

-If requests wait in the container gateway for more than the client request timeout, the client may receive connection
timeout errors (504).

-The client may receive server-side errors (500) if the inference server encounters issues or crashes during inference.

-If the local queue of the inference server is full, the server may reject new requests and return errors (400 or 500)
with custom messages.

**You may receive some of these error messages from Cloudflare, as the container gateway is built on top of it.** For
more information about error messages, please refer to [this link](/container-engine/explanation/gateway/error-pages).
