How to Manage A Large Number of Stable Diffusion Models
High Level
Limited local storage (up to 50gb) and the inevitability of node interruptions introduce complexity when managing services that serve inference from hundreds or even thousands of models.
We can combine 3 techniques to manage that complexity in an arbitrarily scalable way:
- Preload the container image with the most popular models - This way your container can become productive immediately on start, while it downloads more models in the background.
- Local Least-Recently-Used Caching - Only keep the most recently used 50gb of models stored locally on any given node.
- Smart Job Scheduling - Assign jobs to nodes in a way that minimizes model downloading and swapping.
Pre-loading Popular Models
Your service likely has some models that are significantly more popular than others. SaladCloud allows a maximum container image size of 35gb compressed. SaladCloud also does not charge during the downloading phase, only charging once the instance enters the running state. This means it is often prudent to include some of your most popular models in the container image, as you effectively get that download time for free. Additionally, it means your container can start doing inference work immediately once it’s started, even as it downloads more models in the background. Finally, SaladCloud’s 50gb storage limit is in addition to any space taken up by your container, so you can get more total local storage by including models in the container image. The main downside is reduced elasticity in scaling, as the larger images will take longer to download to new nodes.
Local LRU Caching
When you have potentially terabytes of model checkpoints, loras, upscale models, and more, it’s never going to be possible to get it all cached locally on a SaladCloud node, due to storage size restrictions. Beyond that, you wouldn’t want to pay for node uptime while downloading all of those models at start, especially when the node may be interrupted at any time, and the process would need to start over from the beginning on another node.
You also don’t want to download 2-6gb checkpoints for potentially every single request, as it introduces unacceptable latency to user generation requests.
The solution here is to implement an LRU Cache, keeping only the most recently used 50gb of models, and automatically clearing out others as needed. A simple python implementation may look like this:
Smart Job Scheduling
Minimizing model downloading and model swapping on any given node is key to maintaining good overall performance, as these processes may take significantly longer than the generations themselves.
Doing this requires using a “pull” method of job distribution, where inference workers request work from an API, rather than “pushing” work to nodes through a load balancer. However, a simple job queue will be insufficient, as nodes are likely to receive requests for models they may not have locally stored yet.
The basic pattern here is that workers should include information about themselves and their cache when they request work. For example, a worker may include every model they have loaded in VRAM, and also every model they have downloaded locally. Then, the API can use that information to preferentially return inference jobs for models that are already loaded, or at least downloaded. The API response can also include instructions to begin downloading models not required for the currently assigned generation, in order to expand system capacity for a model that may be increasing in popularity.
Example Request
POST /availability
Example Response
In this way, the API can proactively keep models locally cached on n number of nodes, to ensure coverage in case of node interruptions, and to ensure adequate supply of warm inference servers for any given model. It also allows nodes to do other useful work while preparing the local cache for future work.
Detecting interrupted nodes typically involves a heartbeat mechanism, where if a node hasn’t requested work within a certain amount of time, assume it’s dead (this can be verified with the SaladCloud API), and reassign work as needed.
Was this page helpful?