
Run Automatic Speed Recognition Tasks easily with Whisper models, including Whisper Large v3 and Distil-Whisper Large v2

We have a ready-to-go inference server for automatic speech recognition tasks, with a super simple API. Check out the code.

Automatic Speech Recognition API

A minimalist, performance-oriented inference server for automatic speech recognition.

Here's some prelimary performance numbers. These numbers are for total round-trip request time, including downloading the audio file, and parsing the response. The default configuration is used for all models.

More extensive benchmarks are available here:

RTX 3080 Ti w/ BetterTransformers

ModelInput Audio LengthRealtime Multiple
OpenAI Whisper Large v319 min 51s50x
Distil Whisper Distil Large v219 min 51s78x

RTX 4090 w/ BetterTransformers

ModelInput Audio LengthRealtime Multiple
OpenAI Whisper Large v319 min 51s68x
Distil Whisper Distil Large v219 min 51s83x

RTX 4090 w/ Flash Attention 2

ModelInput Audio LengthRealtime Multiple
OpenAI Whisper Large v319 min 51s63x
Distil Whisper Distil Large v219 min 51s93x


GET /hc

This healthcheck will not respond until the server is fully ready to accept requests.


  "status": "ok",
  "version": "0.0.5",

POST /asr

Request - JSON

URL should be a download link to an audio file. It can also be a local filepath, if the server is running on the same machine as the file.

Verified extension support:

  • mp3
  • ogg
  • wav
  • webm
  • flac

It may support more formats. It is using ffmpeg and Soundfile under the hood.

  "url": "https://example.com/audio.mp3",
  "url": "/path/to/local/audio.mp3"

Request - Upload

You can also upload an audio file directly. Use the raw bytes of the file as the request body.


with open(file_path, 'rb') as f:
    # Make the POST request, uploading the file's bytes directly
    response = requests.post(base_url + "/asr", data=f).json()


curl -X POST http://example.com/asr \
--data-binary @/path/to/your/audiofile.mp3 \
-H "Content-Type: application/octet-stream"


  "text": "hello world",
  "chunks": [
      "timestamp": [
      "text": "hello world"

GET /docs

Swagger docs for the API.


All configuration is via environment variables.

See documentation for the ASR Pipeline for more information on the model configuration options:

HOSTThe host to listen on*
PORTThe port to listen on8000
MODEL_IDThe model to use. See Automatic Speech Recognition Modelsopenai/whisper-large-v3
CACHE_DIRThe directory to cache models in/data
FLASH_ATTENTION_2Whether to use flash attention 2. Must be 1 to enable. Enabled by default in -fa2 images. Note, if your GPU does not support compute capability >= 8.9, BetterTransformers will be used instead.None
BATCH_SIZEThe batch size to use.16
MAX_NEW_TOKENSNot sure what this does.128
CHUNK_LENGTH_SThe length of each chunk in seconds.30
STRIDE_LENGTH_SThe stride length in seconds. Defaults to 1/6 of CHUNK_LENGTH_SCHUNK_LENGTH_S / 6

Docker Images

Note: The -fa2 images are larger, and require a GPU with compute capability >= 8.9. If your GPU does not support this, use the non -fa2 images.

  • saladtechnologies/asr-api:latest, saladtechnologies/asr-api:0.0.5 - The base image, no models included. Does not support flash attention 2, but is a smaller base image. Will download the model at runtime.
  • saladtechnologies/asr-api:latest-fa2 ,saladtechnologies/asr-api:0.0.5-fa2 - The base image, no models included. Supports flash attention 2, but is a larger base image. Will download the model at runtime.
  • saladtechnologies/asr-api:latest-openai-whisper-large-v3, saladtechnologies/asr-api:0.0.5-openai-whisper-large-v3 - The base image, with the OpenAI Whisper Large v3 model included. Does not support flash attention 2.
  • saladtechnologies/asr-api:latest-fa2-openai-whisper-large-v3, saladtechnologies/asr-api:0.0.5-fa2-openai-whisper-large-v3 - The base image, with the OpenAI Whisper Large v3 model included. Supports flash attention 2.
  • saladtechnologies/asr-api:latest-distil-whisper-distil-large-v2, saladtechnologies/asr-api:0.0.5-distil-whisper-distil-large-v2 - The base image, with the Distil Whisper Distil Large v2 model included. Does not support flash attention 2.
  • saladtechnologies/asr-api:latest-fa2-distil-whisper-distil-large-v2, saladtechnologies/asr-api:0.0.5-fa2-distil-whisper-distil-large-v2 - The base image, with the Distil Whisper Distil Large v2 model included. Supports flash attention 2.

Deploying On Salad

Deploying with the API

You can deploy this API on Salad using the following command:

See API Docs for more information.

curl -X POST \
  --url https://api.salad.com/api/public/organizations/${organization_name}/projects/${project_name}/containers \
  --header "Salad-Api-Key: ${salad_api_key}" \
  --data '
  "name": "asr-api-distil-whisper-lg-v2",
  "display_name": "asr-api-distil-whisper-lg-v2",
  "container": {
    "image": "saladtechnologies/asr-api:latest-distil-whisper-distil-large-v2",
    "resources": {
      "cpu": 2,
      "memory": 8192,
      "gpu_classes": [
    "command": [],
  "autostart_policy": true,
  "restart_policy": "always",
  "replicas": 3,
  "networking": {
    "protocol": "http",
    "port": 8000,
    "auth": false
  "startup_probe": {
    "http": {
      "path": "/hc",
      "port": 8000,
      "scheme": "http",
      "headers": []
    "initial_delay_seconds": 1,
    "period_seconds": 1,
    "timeout_seconds": 1,
    "success_threshold": 1,
    "failure_threshold": 20

Deploying With The Portal

You can also deploy this API on Salad using the Salad Portal.

Select or Create the organization and project you want to work with, then click the "Deploy a Container Group" button.

  1. Give your container group a name that is unique within this organization and project.

  2. Select the saladtechnologies/asr-api:latest-distil-whisper-distil-large-v2 image to deploy distil-whisper large v2, using BetterTransformers.

  3. Set your replica count. We recommend at least 3 replicas for production use.

  1. Set the CPU to 2, and the memory to 8 GB.

  1. Set the GPU to 1x RTX 3080 Ti (Or another GPU. We haven't done comprehensive testing on all GPUs, so your mileage may vary).

  1. Configure the Startup Probe. This is used to determine when the container is ready to accept requests. Select the HTTP protocol, set the path to /hc, and the port to 8000. Set the initial delay, period, and timeout to 1. Set the success threshold to 1, and the failure threshold to 20. If you are using an image that downloads the model weights at runtime, you should increase intial delay to 10 or more, and a failure threshold of 180 to allow up to 3 minutes for the container to start.

  1. Enable the Container Gateway for port 8000, and choose authenticated or not authenticated. If you choose authenticated, you will need to provide an API key when making requests.

  1. Click "Deploy" to deploy your container group.