OpenVoice Text-to-Speech and Voice Cloning Guide

Introduction

Text-to-speech (TTS) technology has undergone significant advancements in recent years, becoming more affordable and efficient than ever before. Modern TTS models leverage deep learning and artificial intelligence to produce natural-sounding speech with remarkable accuracy. These models find applications in various real-life scenarios, such as voice assistants, audiobook narration, and accessibility tools for those with visual impairments or reading difficulties. In this article, we will focus on using one such TTS model, Open Voice, on Salad Cloud, demonstrating how to harness its capabilities in a cloud-based environment.

If you are looking for fast deployment of OpenVoice on Salad move to Deploying OpenVoice Application to Salad

Discover OpenVoice: The Open-Source Voice Cloning Tool

OpenVoice is an opensource instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker. OpenVoice stands out for its ability to precisely replicate the voice's tone color while offering extensive control over various speech attributes such as emotion and rhythm. Remarkably, it also supports zero-shot cross-lingual voice cloning, enabling the generation of speech in languages not originally included in its extensive training set.

OpenVoice is not only versatile but also exceptionally efficient, requiring significantly lower computational resources compared to commercially available text-to-speech APIs, often at a fraction of the cost and with superior performance. For developers and organizations interested in exploring or integrating OpenVoice, the technical report and source code are available at arXiv and GitHub.

Exploring OpenVoice Framework

The OpenVoice technology encompasses a sophisticated framework designed to replicate human speech with remarkable accuracy and versatility. The process involves several key steps, each contributing to the creation of natural-sounding and personalized voice output. Here’s a closer look at the OpenVoice framework:

  1. Text-to-Speech (TTS) Synthesis: At the core of the OpenVoice framework is its TTS engine, which converts written text into spoken words. This initial step utilizes a base speaker model to generate speech that serves as the foundation for further customization.
  2. Tone Extraction: Following the TTS synthesis, OpenVoice extracts the tone characteristics from a reference voice sample.
  3. Tone Color Embodiment: The final step involves integrating the extracted tone color into the speech generated by the TTS engine. You can also ensure that the output not only replicates the voice tone of the reference speaker but also add distinctive vocal signature such as rhythm and intonation.
    Here is an illustration from official technical report:

Exploring Open Voice Capabilities

In our benchmarking efforts, we discovered that OpenVoice can be executed on any GPU available on Salad Cloud, including those with lower memory capacities. However, when it comes to voice cloning, there is an exception; it cannot be run on 40x series GPUs due to current driver/library incompatibilities, which are expected to be resolved soon.

Based on our analysis, the RTX 2070 emerges as the best choice for balancing cost and performance. Our benchmarks reveal that when using the RTX 2070 on Salad Cloud GPUs, OpenVoice can process an impressive 4 million words per dollar for text-to-speech plus cloning, and over 6 million words per dollar for text-to-speech alone, making it an efficient and economical option for voice synthesis and cloning tasks. You can check our benchmark here: https://blog.salad.com/openvoice/

Project Overview: TTS and Voice Cloning using Open Voice and Salad Cloud

In this project, our aim is to deploy an open voice solution that offers the flexibility to choose between transferring text to speech and adding narrator’s voice tone. This solution will be deployed as an API.

The Workflow:

  1. Request: The process begins with a request sent to the API.
  2. Input Data: We store our text file and reference voice (if cloning is required) on Azure.
  3. TTS Conversion: OpenVoice processes the input text file and performs text-to-speech (TTS) conversion using the base TTS model. Style parameters, such as speed and emotions, can be specified at this stage.
  4. Extract Tone Color (Optional): If voice cloning is desired, the tone color extractor uses the reference voice file to create a voice model.
  5. Add Tone Color (Optional): The extracted tone color is applied to the TTS-generated speech file, adding the cloned voice's characteristics.
  6. Storage and Accessibility: The resulting audio file is uploaded back to Azure for accessibility and further use.

Through this project, we aim to demonstrate that advanced voice cloning and text-to-speech synthesis are accessible to a broader audience, not just large organizations with significant resources. By combining OpenVoice with Salad Cloud, we democratize access to state-of-the-art voice technology, enabling users to create realistic and customizable speech with minimal effort. This initiative highlights the synergy between cloud computing and AI models in addressing real-world applications in voice synthesis and cloning, providing value in various scenarios such as content creation, accessibility, and personalized communication.

Average processing price can be found in our benchmarks: https://blog.salad.com/openvoice/

Reference Architecture

  • Process Flow:
    • API Request: The FastAPI receives a request containing all necessary parameters to initiate the text-to-speech or voice cloning task.
    • Task Execution: Based on the provided parameters, the process either performs text-to-speech conversion only or adds voice cloning to enhance the output.
    • Result Storage: Upon completion, the generated audio file is stored in an Azure storage container for easy access and retrieval.
  • Deployment:
    • The Fast API is containerized using Docker, ensuring a consistent and isolated environment for deployment.
    • This Docker container is then deployed on Salad compute resources to utilize processing capabilities.
    • The Docker image itself is housed in open Salad Docker Container Registry for secure and convenient access.

Folder Structure

Our full solution is stored here: git repo

openvoice-on-salad/
├─ src/
│  ├─ infrastructure/
│  │  ├─ main.bicep (azure resources deployment)
│  ├─ python/
│  │  ├─ api /
│  │  │  ├─ inference/
│  │  │  │  ├─ dev/
│  │  │  │  │  ├─ setup
│  │  │  │  ├─ fast.py 
│  │  │  │  ├─ other Open Voice python scripts
│  │  │  ├─ .dockerignore
│  │  │  ├─ Dockerfile

Local Development Setup and Testing

To make it easier for you to customize the script and fit your usecase we make our git repo public. Start with Setting up an efficient local development environment. Run setup file to facilitate the installation of all dependencies and downloading open voice checkpoints. These files help verify that the dependencies function correctly during the development phase. We also provide the complete contents of the setup script below.

The Setup Script:

#! /bin/bash

set -e

echo "setup the curent environment"
CURRENT_DIRECTORY="$( dirname "${BASH_SOURCE[0]}" )"
cd "${CURRENT_DIRECTORY}"
echo "current directory: $( pwd )"
echo "setup development environment for inference"
OPENVOICE_DIR="$( cd .. && pwd )"
echo "dev directory set to: ${OPENVOICE_DIR}"
echo "remove old virtual environment"
rm -rf "${OPENVOICE_DIR}/.venv"
echo "create new virtual environment"
python3.9 -m venv "${OPENVOICE_DIR}/.venv"
echo "activate virtual environment"
source "${OPENVOICE_DIR}/.venv/bin/activate"
echo "installing dependencies ..."

(cd "${OPENVOICE_DIR}" && pip install --upgrade pip && pip install -r requirements.txt)
pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1
# download the latest model
wget -P ${OPENVOICE_DIR} https://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip
unzip ${OPENVOICE_DIR}/checkpoints_1226.zip -d ${OPENVOICE_DIR}
rm -r ${OPENVOICE_DIR}/checkpoints_1226.zip

To establish a clean virtual environment and install all the necessary libraries, you simply needs to execute the script using this command:

bash dev/setup

Voice cloning test with OpenVoice on Salad Cloud

To explore the capabilities of OpenVoice, we followed the instructions provided in the Open Voice documentation and conducted our experiments on Salad Cloud using Salad Jupyter Lab. This same experiment can be run on a local machine. We adapted the code to run as a single script. The code snippet below outlines the process, from initialization to inference, demonstrating how to control voice style and speed:

import os
import torch
from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter

# Initialization
ckpt_base = 'checkpoints/base_speakers/EN'
ckpt_converter = 'checkpoints/converter'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs'

base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

# Obtain Tone Color Embedding
source_se = torch.load(f'{ckpt_base}/en_default_se.pth').to(device)

reference_speaker = 'resources/example_reference.mp3'
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)

# Inference
save_path = f'{output_dir}/output_en_default.wav'

# Run the base speaker tts
text = "This audio is generated by OpenVoice."
src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)

Speed and speaker emotion can be controlled with the base_speaker_tts.tts method. Available speaker emotions include: default, whispering, shouting, excited, cheerful, terrified, angry, sad, and friendly.

Here is an example of the results:

Separating TTS and Voice Cloning

In our implementation, we first handle the text-to-speech conversion and then proceed to the optional voice cloning step. This way you can choose if you want to just use TTS part of the process, or add voice custom voice cloning on top of it:

Step 1: Text-to-Speech (TTS) Conversion

# Step 1: TTS with base speaker

base_speaker_tts.tts(text, tts_path, speaker=speaker_tone, language=language, speed=speed)
result_path = tts_path
result_file_name = tts_file_name

In this step, we use the BaseSpeakerTTS model from OpenVoice to convert the input text into speech. The tts method takes parameters such as the text, output path, speaker tone, language, and speed to generate the speech file. The resulting audio file is stored at tts_path, and its name is saved in result_file_name. Those paths will be used later in the cloning step if it is used, or to save the results back to azure.

Step 2: Voice Cloning

# Step 2: Voice Cloning
if clone is True:
    voice_file_path = f'{voice_dir}/{reference_voice}'
    # download the voice file
    if reference_voice not in os.listdir(voice_dir):
        blob_client = voices_blob.get_blob_client(reference_voice)
        voice_file = blob_client.download_blob()
        # save voice file to voice_file_path
        with open(voice_file_path, "wb") as my_blob:
            voice_file.readinto(my_blob)
    target_se, audio_name = se_extractor.get_se(voice_file_path, tone_color_converter, target_dir='.data/tmp', vad=True)
    encode_message = "@MyShell"
    clone_result_name = f"{text_file.rsplit('.', 1)[0]}_cloned.wav"
    clone_result_path = f"{clone_dir}/{clone_result_name}"
    tone_color_converter.convert(
        audio_src_path=tts_path, 
        src_se=source_se, 
        tgt_se=target_se, 
        output_path=clone_result_path,
        message=encode_message)
    result_path = clone_result_path
    result_file_name = clone_result_name

In the voice cloning step, we first check if cloning is enabled (clone is True). If so, we proceed to download the reference voice file from Azure storage if it's not already present locally. Using the se_extractor, we extract the tone color embedding (target_se) from the reference voice. Then, we use the ToneColorConverter to apply this tone color to the previously generated TTS audio, creating a cloned voice output. The final cloned audio is saved at clone_result_path, and its name is updated in result_file_name.

Integrating Azure Storage

To handle input text files, reference voices, and output results, we integrate Azure Blob Storage into our workflow. In order to do that we created a storage account in Azure with several storage container: input, voices, output. This allows us to fetch input files dynamically and store the processed audio files for easy access. You can use any other storage provider you prefer.


def azure_initiate(
    result_blob: str,
    storage_connection_string: str,
):
    azure_client = ContainerClient.from_connection_string(
        storage_connection_string, result_blob
    )
    return azure_client
# Initialize Azure Blob clients for input, voices, and results
input_blob = azure_initiate(input_container_name, connection_string)
voices_blob = azure_initiate(voices_container_name, connection_string)
result_blob = azure_initiate(output_container_name, connection_string)

# Download input text file from Azure Blob Storage
blob_client = input_blob.get_blob_client(text_file)
data = blob_client.download_blob().readall()
with open(f"{text_dir}/{text_file}", "wb") as f:
    f.write(data)

# (Optional) Download reference voice file from Azure Blob Storage
blob_client = voices_blob.get_blob_client(reference_voice)
voice_file = blob_client.download_blob()
with open(voice_file_path, "wb") as my_blob:
    voice_file.readinto(my_blob)

# Upload the resulting audio file to Azure Blob Storage
output_blob_client = result_blob.get_blob_client(result_file_name)
with open(result_path, "rb") as bytes_data:
    output_blob_client.upload_blob(bytes_data, overwrite=True)

Packaging as an API

We've successfully tested our OpenVoice model, set up the logic for saving results, and configured our Azure storage account. The next step is to package and deploy our solution to the cloud.

For deployment, we chose Python FastAPI for its high performance and asynchronous support, which are essential for handling real-time data processing. FastAPI also provides automatic interactive documentation with a Swagger interface, making our API user-friendly and easy to navigate.

Our service includes the following API endpoints:

  • Process Endpoint: This endpoint initiates the text-to-speech or voice cloning process. It accepts parameters such as the Azure storage connection string, container names for input and output files, optional parameters for voice speed, language, and speaker tone, and the text file name for processing.
app = FastAPI()

@app.post("/process")
async def process(
    connection_string: str = Query("DefaultEndpointsProtocol=https;AccountName=accountname;AccountKey=key;EndpointSuffix=core.windows.net", description="Azure Storage Connection String"),
    input_container_name: str = Query("requests", description="Container name for input files"),
    output_container_name: str = Query("results", description="Container name for output files"),
    voices_container_name: Optional[str] = Query("voices", description="Container name for voice files"),
    reference_voice: Optional[str] = Query(None, description="Voice file to be used as reference"),
    speed: float = Query(1.0, description="Speed of the voice"),
    language: str = Query("English", description="Language of the voice"),
    speaker_tone: str = Query("default", description="Tone of voice. Options: default, whispering, shouting, excited, cheerful, terrified, angry, sad, friendly"),
    text_file: str = Query(description="Text file to be used for TTS"),
):
    result = inference(connection_string, input_container_name, output_container_name, voices_container_name, reference_voice, speed, language, speaker_tone, text_file)
    return result

  • Health Check Endpoint: A simple endpoint to check the health of the service, ensuring it's operational and ready to accept requests.
@app.get("/hc") async def health_check(): return {"status": "healthy"}

Local Testing with Uvicorn:

Before deploying our FastAPI application to the cloud, it's crucial to test it locally to ensure everything is functioning as expected. For this purpose, we use Uvicorn, a lightning-fast ASGI server implementation that's ideal for running FastAPI applications. Uvicorn not only serves as a local development server but also plays a key role in running our application in a cloud environment.

If you haven't already installed Uvicorn, you can do so using the following command:

pip install uvicorn

If you used our setup script to install all the dependencies, Uvicorn should already be installed.

To start the FastAPI application locally with Uvicorn, run the following command in your terminal:

uvicorn main:app --host 0.0.0.0 --port 8000

After running the command, you should see output in your terminal indicating that Uvicorn is running and serving your FastAPI application.

You can then access the interactive API documentation at http://localhost:8000/docs to test your endpoints.

By testing locally with Uvicorn, we can ensure our FastAPI application is ready for deployment and can smoothly transition to a cloud environment.

Containerizing the FastAPI Application with Docker

After testing and verifying our FastAPI application, we need to containerize it using Docker. This process will ensure that our application can be deployed reliably in the cloud. The Dockerfile provided below is configured to use the nvidia/cuda:11.7.1-devel-ubuntu22.04 base image, which is compatible with the NVIDIA CUDA toolkit, making it suitable for running our OpenVoice application with GPU acceleration.

The Dockerfile sets up the necessary environment variables for NVIDIA compatibility, installs essential packages, and sets the working directory to /app. It then copies the inference folder containing our application code into the container. The Dockerfile also installs specific versions of PyTorch, torchvision, and torchaudio that are compatible with CUDA 11.7, along with other required Python packages. Additionally, it downloads the AzCopy tool for efficient data transfer to and from Azure storage, and the OpenVoice cloning model from a public repository.

Here is the Dockerfile:

FROM nvidia/cuda:11.7.1-devel-ubuntu22.04 as cuda-base
# Set some environment variable for better NVIDIA compatibility
ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

ENV DEBIAN_FRONTEND=noninteractive

# Set the working directory to /app
WORKDIR /app
# Copy the inference folder to /app/inference
COPY /inference /app/inference

# Install curl and add the NodeSource repositories
RUN apt-get update && apt-get install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && apt-get install -y python3.9
RUN apt-get update && apt-get install -y curl wget ffmpeg unzip git python3-pip
# Update pip and install requirements
RUN pip install --upgrade pip
RUN pip install torch==1.13.1+cu117 torchvision>=0.13.1+cu117 torchaudio>=0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir 
RUN pip install  -r inference/requirements.txt
RUN pip install uvicorn

# Download AzCopy
RUN wget -O azcopy.tar.gz https://aka.ms/downloadazcopy-v10-linux && tar -xf azcopy.tar.gz --strip-components=1

# Move AzCopy to the /usr/bin directory
RUN mv azcopy /usr/bin/

WORKDIR /app/inference
# Download the cloning model
RUN wget https://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip && \
    unzip -o checkpoints_1226.zip && \
    rm -r checkpoints_1226.zip

CMD ["uvicorn", "fast:app", "--host", "::", "--port", "80"]

By following this Dockerfile, our FastAPI application is prepared for deployment to a cloud environment with Docker support, ensuring consistent performance and compatibility. We use Docker container registry to store the image, but you can use any container registry you prefer.

Deploying OpenVoice Application to Salad

We've reached the final and most exciting stage of our project: deploying our solution to Salad. If you're not making any additional customizations, you can directly proceed to this step.

Deploying your containerized FastAPI application to Salad's GPU Cloud is a very efficient and cost-effective way to run your text-to-speech solutions. Here's how to deploy the solution using the Salad portal:

  1. Create an Account: Sign up for an account on Salad's Portal if you haven't already.
  2. Create an Organization: Once logged in, set up your organization within the Salad platform to manage your deployments and resources.
  3. Deploy Container Group: Go to the "Container Groups" section in the Salad portal and select "Deploy a Container Group" to begin deploying your FastAPI application to Salad's cloud infrastructure.

We now need to set up all of our container group parameters:

Configure Container Group:

  1. Create a unique name for your Container group

  2. Pick the Image Source: In our case we are using a public Salad registry. Click Edit next to Image source. Under image name paste the image path: saladtechnologies/openvoice-api:1.0.0
    If you are using your custom solution, specify your image location.

  3. Replica count: It is recommended to use 3 or more replicas for production. We will use just 1 for testing.

  4. Pick compute resources: That is the best part. Pick how much cpu, ram and gpu you want to allocate to your process. The prices are very low in comparison to all the other cloud solutions, so be creative. TTS process can be run on any GPU. Check out our benchmark to choose which GPU is better for your needs.

  5. Optional Settings: Salad gives you some great options like health check probe, external logging and passing environment variables.

  6. Container Gateway. Click “Edit“ next to it, check “Enable Container Gateway“ and set port to 80:


In addition you can set an extra layer of security by turning Authentication on. If you turn it on you will need to provide your personal token together with the api call. Your token can be found here: https://portal.salad.com/api-key
With everything in place, deploying your FastAPI application on Salad is just a few clicks away. By taking advantage of Salad's platform, you can ensure that your object detection API is running on reliable infrastructure that can handle intensive tasks at a fraction of the cost.
Now check “AutoStart container group once image is pulled“ and hit “Deploy“. We are all set let’s wait till our solution deploys and test it.

Benefits of Using Salad:

  • Affordability: Salad's GPU cloud solutions are competitively priced compared to other cloud providers, enabling you to access more resources for your application at a lower cost.
  • User-Friendly Interface: Salad prioritizes user experience, offering an intuitive interface that simplifies the deployment and management of cloud-based applications.
  • Comprehensive Documentation and Support: Salad offers extensive documentation to guide you through deployment, configuration, and troubleshooting, complemented by a dedicated support team ready to assist you whenever required.

Test Full Solution deployed to Salad

Once your solution is deployed on Salad, the next step is to interact with your FastAPI application using its public endpoint. Salad provides you with a deployment URL, which allows you to send requests to your API using Salad's infrastructure, just as you would locally.

You can use this URL to access your FastAPI application's Swagger page, which is now hosted in the cloud. Replace localhost in your local URL with the provided deployment URL to access the Swagger page. For example:

https://tomato-cayenne-zjomiph125nsc021.salad.cloud/docs

You will see your Swagger page similar to this:

On the Swagger page, you can interact with your API by providing the required parameters to run the process. Some parameters are optional, and you may not need to override them if you're using the same Azure container names. Additionally, some parameters are only relevant for voice cloning, so you can skip them if you're only running text-to-speech (TTS). Note, that we are using azure storage in the current solution, so make sure you deploy your azure resources in advance. If you want to use another storage provide, check out the full solution documentation. Here is a full list of the arguments:

  • connection_string: Azure Storage Connection String (e.g., "DefaultEndpointsProtocol=https;AccountName=accountname;AccountKey=key;EndpointSuffix=core.windows.net")
  • input_container_name: Container name for input files (e.g., "requests")
  • output_container_name: Container name for output files (e.g., "results")
  • voices_container_name: (Optional) Container name for voice files (e.g., "voices")
  • reference_voice: (Optional) Voice file to be used as a reference
  • speed: Speed of the voice (default: 1.0)
  • language: Language of the voice (default: "English")
  • speaker_tone: Tone of voice. Options include "default", "whispering", "shouting", "excited", "cheerful", "terrified", "angry", "sad", "friendly" (default: "default")
  • text_file: Text file to be used for TTS

By providing these parameters, you can run the TTS or voice cloning process through your FastAPI application deployed on Salad Cloud. Now hit “execute” and wait for a reply.

We can see that it took 18 seconds to process 1k words file doing both tts and cloning. For the test we used stable diffusion compatible compute. Once you see a 200 response, your output audio file should be available in azure.