Long-Running Jobs

Run ML/AI Training Jobs, Monte Carlo Simulations, and More

Salad is capable of running all kinds of long-running compute-intensive jobs, from fine tuning SDXL to molecular simulations, with the following restrictions:

  • One job/block must be completed on a single GPU with up to 24gb vram. Typically very large simulations are broken up into smaller blocks so they can be processed concurrently on many GPUs.
  • Your job should periodically save checkpoints of its progress, so that it can be resumed in the event of an interruption.
  • Your job is packaged in a docker container.

Kelpie

Kelpie is an open-source framework for managing long-running jobs across interruptible hardware. It coordinates with the Kelpie API (free for Salad customers) to ensure maximum utilization of your container group, and to ensure that jobs succeed despite node interruptions.

Preparing Your Job

Make it a Docker container

The first step to preparing any workload for Salad is to get it packaged up as a Docker container. If you are unfamiliar with Docker, I recommend checking out this guide to Docker for Data Science.

In addition, your job will need a very small amount of instrumentation to be fully compatible with Kelpie. You can find an example application here.

Add Kelpie

Add Kelpie to your Dockerfile by downloading it from the release page.

RUN wget https://github.com/SaladTechnologies/kelpie/releases/download/0.0.10/kelpie -O /kelpie && chmod +x /kelpie

CMD ["/kelpie"]

Note you also need to override your CMD directive to launch kelpie instead of your original script.

Deploy Your Container Group

Navigate to the Salad Portal, creating a user account, organization, and project if necessary.

Click "Deploy a Container Group", and fill out the form, providing the docker image you just built, and specifying the hardware required to run the job. For now, you can leave Replica Count as 1, since Kelpie is capable of scaling your container group for you once configured.

Your job almost definitely needs more hardware than this

Your job almost definitely needs more hardware than this

You do not need to override the command, and you do not need to enable Container Gateway.

While not required, it is recommended to set up an external logging provider to help you debug when things go wrong. Salad has integrated logging to help with debugging, but it is not as full featured as stand-alone products.

You will need to set up some environment variables to enable Kelpie to interact with the Kelpie API, and to enable it to sync your checkpoints and outputs to your preferred s3-compatible storage provider. Your KELPIE_API_KEY will be provided by Salad upon request. Note that it is not your Salad API Key.

I highly recommend choosing a storage provider that does not charge data egress fees. My personal favorite is Cloudflare R2, but there are other options on the market as well. Just google for "s3-compatible storage".

You can go ahead and disable auto-start, and then hit deploy. You'll see your container group in a "preparing" phase where it is pulling your docker image into our cache.

(Optional) Add Kelpie to your Salad Organization

Kelpie can start, stop, and scale your container group in response to job volume. It can also automatically reallocate nodes that fail too many jobs. In order to take advantage of these features, you must add the Kelpie user to your Salad Organization.

From the left panel in the portal, select "Team"

Click "Invite New Team Member"

And invite the Kelpie user. During our beta, this is [email protected]. Once we exit beta, this will likely become [email protected].

Once your invite is accepted (give us a ping on discord/slack/whatsapp/email and let us know you're ready), Kelpie will be able to manage your container group.

Get Your Container Group ID

In order to make sure your jobs get scheduled on the correct container group, you need to retrieve the container group's unique ID via the API. You can do this from the interactive docs.

(Optional) Create Scaling Rules

You can use the Kelpie API to create a scaling rule for your container group, setting min and max replicas, and how long to count an idle instance as active.

Prepare Data For Your Job

Kelpie deals with data in 3 categories:

  • Inputs - Data used as an input to your job. This might be images, text, or anything else. This data gets synced one-way from your cloud storage to a container's local storage, prior to starting a job. Changes made to the input directory locally are not detected, and are not pushed back up to the cloud.
  • Checkpoints - Data that saves the state of a job in a format that allows resuming. This directory is synced bi-directionally to and from the cloud. Data is downloaded from the cloud to the container's local storage prior to starting a job. Files added to the checkpoint directory locally are automatically pushed up to the cloud.
  • Outputs - The result of your job. This might be model weights, CSV files, or anything else. This directory is synced one-way to your cloud storage from a container's local storage, throughout a job's running time and at completion. Files added to the output directory locally are automatically pushed up to the cloud.

Prior to starting a batch of jobs, you need to make sure your inputs are organized appropriately in your storage bucket.

Submit Jobs

Use the Kelpie API to submit your jobs . For each job, you'll be creating a submitting a JSON payload like this:

{
  "command": "string",
  "arguments": [],
  "environment": {},
  "input_bucket": "string",
  "input_prefix": "string",
  "checkpoint_bucket": "string",
  "checkpoint_prefix": "string",
  "output_bucket": "string",
  "output_prefix": "string",
  "max_failures": 3,
  "heartbeat_interval": 30,
  "webhook": "string",
  "container_group_id": "string"
}
KeyTypeDescriptionDefault
commandstringThe command to execute.required
argumentsarrayList of arguments for the command.[]
environmentobjectKey-value pairs defining the environment variables.{}
input_bucketstringName of the AWS S3 bucket for input files.required
input_prefixstringPrefix for input files in the S3 bucket.required
checkpoint_bucketstringName of the AWS S3 bucket for checkpoint files.required
checkpoint_prefixstringPrefix for checkpoint files in the S3 bucket.required
output_bucketstringName of the AWS S3 bucket for output files.required
output_prefixstringPrefix for output files in the S3 bucket.required
max_failuresintegerMaximum number of allowed failures before the job is marked failed.3
heartbeat_intervalintegerTime interval (in seconds) for sending heartbeat signals.30
webhookstringURL for the webhook to notify upon completion or failure.optional
container_group_idstringID of the container group where the command will be executed.required

(Optional) Manually Start Your Container Group

If you've set up scaling rules and added the kelpie user to your salad org, you can skip this part, as it will be managed automatically for you.

Otherwise, now is the time to start your container group, and set the number of replicas desired. Nodes will begin pulling work from the Kelpie API immediately when they become running. You can access the replica count from the "Edit" button in the container group view.