When you enable the container gateway during container group creation, several load balancing (LB) options determine how requests are forwarded to the container instances, including algorithms, concurrency and timeouts. These settings cannot be changed once a container group is created. To modify the container gateway settings, you can duplicate the existing container group, create a new one, and apply the desired changes.

Algorithms

There are two load balancing algorithms currently supported by SaladCloud: Round Robin and Least Number of Connections.

Round Robin: This algorithm evenly distributes incoming requests to instances within a container group in a sequential and cyclical order, without considering the instance’s current load or processing power. It is ideal for a container group using the single GPU type, especially when all requests have similar processing time and the workload is light.

Least Number of Connections: This algorithm continuously monitors the load on each instance and directs new requests to those with the fewest active connections. By prioritizing the least-loaded instances and preventing any one from becoming overwhelmed, it balances workloads more effectively and optimizes system performance, resulting in greater efficiency and reliability.

Processing time for AI inference can vary significantly based on factors like image size, context length, and audio duration. Additionally, deploying a container group with multiple GPU types that share the same VRAM size but have different processing capabilities can further complicate this variability. In most cases, the Least Number of Connections algorithm can effectively manage these differences and is recommended for optimal performance.

Concurrency

To enhance overall performance and throughput, the container gateway typically utilizes multiple connections to each instance and sends multiple requests to the instance concurrently.

For inference servers, it is ideal to design them to handle multiple simultaneous requests. This includes incorporating a local queue to buffer requests, the ability to accept new requests during inference (non-blocking I/O), explicitly rejecting excessive requests as a back-pressure mechanism when the queue is full, and supporting batched inference by grouping incoming requests or performing single inference. Some inference servers support streaming mode, enabling real-time responses during the inference process rather than waiting for the entire process to complete. This capability is particularly useful for LLMs.

In contrast to batched or single inference, using multiprocessing or multithreading for concurrent inference on a single GPU may limit optimal GPU cache utilization and negatively impact performance; therefore, this approach should generally be avoided.

To support various inference servers and use cases, two settings are available for concurrency:

When the option - “Limit each server to a single, active connection” is unselected, the container gateway will establish multiple connections per instance and forward concurrent requests to the inference server in the container. This server should be capable of handling multiple requests simultaneously, including a local queue, use non-blocking I/O and have a back-pressure mechanism.

When the option - “Limit each server to a single, active connection” is selected, the container gateway will use a single active connection per instance and only forward one request at a time to an instance, even when using HTTP/2 multiplexing. New requests will wait and be queued by the container gateway until the current one is processed and a response is returned by the instance. This configuration ensures that an instance receives and processes only one request at any time. Note that a WebSocket connection treated as a single request.

For single inference tasks, such as generating an image at a time, choosing this option can significantly simplify the implementation of the inference server by using synchronous calls. Otherwise, this option will reduce system throughput and should be avoided.

Client Request Timeout

This setting (millisecond) determines how long a request can be waiting and queued in the container gateway before being sent, after which a timeout error is returned to the client. The default is 100 seconds (or 100000 milliseconds), which is the maximum allowable waiting time for queued requests.

When a container gateway is configured to use multiple connections per instance, this setting can be ignored, as all requests are sent to the instance immediately and would be queued by the inference server.

However, if a container gateway is configured to use a single active connection per instance, the setting is crucial in influencing the system behavior:

  • If set too low, the system may struggle to handle burst traffic.
  • If set too high, it could result in longer waiting times on the client side.

This setting should be determined carefully based on the application requirements. In the long term, the system resources must align with the workload.

If the client frequently encounters errors, such as client request timeout errors from the container gateway, or queue full errors from the inference server, you may need to increase or adjust resources to enhance processing capability.

Alternatively, you can implement flow control and retry mechanisms on the client side to stop accepting new user requests, rather than letting them be dropped within the system.

Server Response Timeout

This setting (millisecond) determines how long the container gateway will wait for a response from the instance after sending a request. If no response is received within this timeframe, a timeout error is returned to the client, preventing prolonged waits for responses that may never arrive. The default value is 100 seconds (100000 milliseconds), which is the maximum allowable waiting time for a response.

A 100-second timeout is generally sufficient for many tasks, including image generation and classification. However, for LLMs, the decoding process for very long conversions may take a few minutes. If the inference server operates in streaming mode - providing responses immediately after the first token is generated rather than waiting for the entire process to complete, LLM applications can still function effectively with the container gateway.

The container gateway is not intended for long-running tasks, such as molecular simulations and long audio transcriptions, which would take hours or days to complete for each job. These tasks should instead be managed using a job queue.

Legacy Container Gateway Settings

Container groups created before these settings were available are configured with the Least Number of Connections algorithm, a single active connection per instance and 100 seconds for both Client Request Timeout and Server Response Timeout. These settings cannot be modified.

To accommodate your application’s needs, you can create a new container group by duplicating the existing one and selecting the appropriate LB options.

Common Errors and Their Causes

Generally, the inference server handles multiple tasks for each request, including downloading/uploading, pro-processing, post-processing and GPU inference. These tasks take time and may encounter errors along the way.

When the client sends a burst of requests to the instances through the container gateway, the following common errors related to these LB options may arise:

-The client may encounter server response timeout errors (524) if processing certain requests on the inference server takes longer than the server response timeout.

-If the inference server runs a single thread with blocking I/O, it cannot accept new requests during inference, which may result in connection timeout errors (504).

-If requests wait in the container gateway for more than the client request timeout, the client may receive connection timeout errors (504).

-The client may receive server-side errors (500) if the inference server encounters issues or crashes during inference.

-If the local queue of the inference server is full, the server may reject new requests and return errors (400 or 500) with custom messages.

You may receive some of these error messages from Cloudflare, as the container gateway is built on top of it. For more information about error messages, please refer to this link.