BGE-M3 Embeddings with Text Embeddings Inference

Last Updated: June 29, 2026

Overview

This recipe runs BAAI/bge-m3 with Hugging Face Text Embeddings Inference on a Salad GPU. It serves dense embeddings for RAG, semantic search, document retrieval, and similarity search workloads. BGE-M3 is useful when you need one embedding model that handles multilingual content, short queries, and longer documents. The model supports more than 100 languages, accepts inputs up to 8192 tokens, and returns 1024-dimensional dense vectors that can be stored in a vector database. The BGE-M3 model family also supports sparse and ColBERT-style retrieval modes through FlagEmbedding. This Salad recipe is focused on the dense embedding endpoint exposed by TEI because that is the interface most RAG frameworks and vector search systems expect.

Quick Start

Open the SaladCloud Portal.
Deploy the BGE-M3 Embeddings recipe.
Enter a Container Group Name.
Decide whether to enable Require Container Gateway Authentication:
- Enabled: requests must include your SaladCloud API key.
- Disabled: anyone with the URL can call the embedding service.
Deploy and wait for the first startup to finish.

The model is downloaded from Hugging Face at startup, so it can take several minutes before the deployment becomes ready.

Once the container is ready, call /embed for TEI’s native embedding API or /v1/embeddings for the OpenAI-compatible API.

Defaults

The recipe comes preconfigured with these defaults:

Server: Hugging Face Text Embeddings Inference
Model ID: BAAI/bge-m3
Container image: ghcr.io/huggingface/text-embeddings-inference:89-1.9
Command equivalent: text-embeddings-router --model-id BAAI/bge-m3 --port 3000
Container port: 3000
Host bind: ::
Data type: float16
Max batch tokens: 16384
Max client batch size: 32
Readiness probe: GET /health
Authentication: enabled by default

API Endpoints

Useful TEI endpoints include:

GET /health - readiness probe and health check
GET /info - model and server metadata
GET /docs - Swagger documentation
POST /embed - TEI native dense embedding endpoint
POST /v1/embeddings - OpenAI-compatible embeddings endpoint
POST /similarity - TEI similarity endpoint

Authentication

Require Container Gateway Authentication is available in the deployment form and is enabled by default.

Enabled: every request must include the Salad-Api-Key header.
Disabled: anyone with the deployment URL can call the API.

If you enable authentication, see Sending Requests for the header format.

Example Request

Use /embed to generate dense vectors with TEI’s native API:

curl https://<your-dns>.salad.cloud/embed \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'Salad-Api-Key: <api-key>' \
  -d '{
    "inputs": [
      "BGE-M3 is useful for multilingual semantic search.",
      "Text embeddings help retrieve documents for RAG applications."
    ],
    "normalize": true
  }'

If you disabled authentication during deployment, omit the Salad-Api-Key header. Each returned vector should contain 1024 floating-point values.

OpenAI-Compatible Request

Use /v1/embeddings if your client expects an OpenAI-compatible embeddings API:

curl https://<your-dns>.salad.cloud/v1/embeddings \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'Salad-Api-Key: <api-key>' \
  -d '{
    "model": "BAAI/bge-m3",
    "input": [
      "What is BGE-M3 good for?",
      "Use embeddings for document retrieval and similarity search."
    ],
    "encoding_format": "float"
  }'

If you disabled authentication during deployment, omit the Salad-Api-Key header.

Test A Deployment

Check health:

curl https://<your-dns>.salad.cloud/health

Check model metadata:

curl https://<your-dns>.salad.cloud/info

Confirm the embedding dimension:

curl -s https://<your-dns>.salad.cloud/embed \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"inputs":["hello world"],"normalize":true}' \
  | jq '.[0] | length'

The expected result is 1024.

Tuning Notes

Keep MODEL_ID set to BAAI/bge-m3 unless you intentionally want to repurpose the recipe.
BGE-M3 supports long inputs, but long batches use more VRAM. Lower MAX_BATCH_TOKENS if replicas run out of memory.
Increase MAX_CLIENT_BATCH_SIZE only after load testing your expected request shape.
If you want private or gated Hugging Face models in a customized deployment, add HF_TOKEN in Advanced Configuration.

Explanation

Tutorials

How-to Guides

Storage

Reference

BGE-M3 Embeddings with Text Embeddings Inference

Overview

Quick Start

Defaults

API Endpoints

Authentication

Example Request

OpenAI-Compatible Request

Test A Deployment

Tuning Notes

Source Code

​Overview

​Quick Start

​Defaults

​API Endpoints

​Authentication

​Example Request

​OpenAI-Compatible Request

​Test A Deployment

​Tuning Notes

​Source Code

Overview

Quick Start

Defaults

API Endpoints

Authentication

Example Request

OpenAI-Compatible Request

Test A Deployment

Tuning Notes

Source Code