Collect and Transcribe Millions of Hours of YouTube Videos on SaladCloud!

To collect and transcribe massive videos from YouTube using SaladCloud, the process typically involves three key steps:

Data Collection from YouTube: Gather relevant video URLs from specific channels or based on particular topics, creating video metadata files. These files include essential details such as video title, URL, duration and codec, etc.

Data Processing on SaladCloud: Input the gathered video URLs into a transcription pipeline, comprising a job queue such as AWS SQS, accompanied by a job filler that monitors progress and injects jobs regularly, and numerous Salad nodes, each equipped with dedicated GPUs for transcribing. Additionally, utilize a NoSQL database like AWS DynamoDB to store job results, while generated assets can be stored in Cloud Storage platforms such as Cloudflare R2.

Validation, Analysis and Delivery: Validate outcomes using both the metadata files and the job results. The videos that failed to be transcribed can be further analyzed and may be refilled to the pipeline for reprocessing. Then, deliver the final output, potentially including index files containing YouTube video URLs paired with their corresponding transcript URLs on Cloudflare. This facilitates easy access to transcribed content for further analysis and utilization.

The three steps can either constitute a single end-to-end pipeline or operate as three independent steps, depending on the specific business requirements.

Transcribing YouTube videos doesn't inherently require pre-downloading and storage of the videos. Simultaneous downloading and transcribing are feasible. Additionally, if a YouTube video supports an audio codec, efficiency can be improved by exclusively downloading its audio. This approach not only reduces bandwidth demands but also significantly saves time, considering that video sizes are typically ten times larger than their corresponding audio files.

By leveraging the solution outlined in the documentation, we've managed to transcribe over 1 million hours of YouTube videos in just a week. This was made possible by utilizing a resource group of 100 low to mid-end Salad nodes, each equipped with 2vCPU, 12GB RAM, and a GPU with 8GB or more VRAM.

STEP 1: Data Collection from YouTube

We need to develop a job generator capable of accepting named channels or playlists, as well as keywords, as input parameters. This generator will then produce multiple channel metadata files containing detailed information such as video title, URL, duration, codec, and other relevant data. Additionally, it will create an index file that maps each channel to its corresponding metadata file, ensuring organized and accessible storage of the collected data.

The index file comprises the following details for each channel: channel title, ID, URL, playlist number, total video number, total video duration (hour), and a link to the corresponding metadata file.

A metadata file offers extensive details of videos within a channel, encompassing URL, title, duration (second), codec, and more. Extracting codec information for each video requires additional operations and time. Hence, we opt to randomly select a subset of videos to collect this information. Through our testing, we've noted that all YouTube videos encountered thus far support audio codecs.

Downloading video metadata from a channel is time-consuming, ranging from a few minutes to several hours, depending on the channel's size. Additionally, there are constraints on concurrent access from a single IP address to YouTube.

Through testing, we've implemented and executed the job generator on a standard computer, achieving a collection rate of approximately 1 million video URLs per day using combined solutions over a single IP address. To exceed this limit or to further optimize time efficiency, leveraging the global distributed network provided by SaladCloud is recommended. Building a job generation pipeline over a small pool of Salad nodes distributed across the global Internet can significantly accelerate the process.

Build the Job Generator

We can utilize a variety of tools and libraries to construct the job generator:

This tool offers all the necessary functionalities for interacting with YouTube, such as searching channels, playlists, and videos based on keywords, retrieving all playlists from a particular channel, and uploading/downloading videos.

To utilize the API, we must possess a Google Account to request an API key and choose a client library, such as Python. While the API is free to use, each GCP project enabled with the YouTube Data API has a default quota allocation of 10,000 units per day. All API requests, including invalid requests, incur a quota cost.

Now, let's calculate the quota cost for collecting video URLs from a search on a specific keyword:

AssumptionQuota CostExplanations
Search and get 100 channels based on a keyword200Executing a search.list needs 100 units and it can return a maximum of 50 items:
100 x 2
Get 10,000 playlists from the 100 channels, assuming each channel contains 100 playlists200Executing a playlist.list requires 1 unit and it can return a maximum of 50 items:
100 x 1 x 2
Get 2,000,000 video URLs from the 10,000 playlists of the 100 channels, assuming each playlist contains 200 videos40000Executing a video.list requires 1 unit and it can return a maximum of 50 items:
100 x 100 x 1 x 4

The default quota allocation of 10,000 units per day may prove insufficient if we need to collect numerous video URLs in a short timeframe. We can create several GCP projects with each providing 10,000 units per day, or we could also complete an audit and request an additional quota following the process. Here we provide a combined solution by using both YouTube Data API and Pytube.

Pytube, a lightweight library for downloading YouTube videos

Pytube bypasses the YouTube Data API by directly scraping the YouTube website, thereby avoiding any quota limits imposed by the API. However, it does not offer access to the complete range of functionalities available through the YouTube API. For instance, it lacks the ability to perform searches with diverse request parameters such as location and relevance language. Nevertheless, Pytube excels at retrieving all video URLs and associated metadata (such as title, codec, video or audio format, etc.) within a specified playlist, and filtering and downloading streams, the tasks that would otherwise consume the most significant portion of the quota cost units if executed using the YouTube Data API.

A combined solution enables us to harness the strengths of both tools while alleviating their respective limitations. Utilizing the YouTube Data API and its Python Client, we can search for channels based on keywords and retrieve all playlists within each channel. Subsequently, Pytube can be utilized to download all video metadata from these playlists.

It is noteworthy that Pytube does not have a limit on API quota, but it is still constrained by the number of concurrent accesses from a single IP address imposed by YouTube.

The reference design for the job generator

We use a two-tier system for managing our collection: first, we create a metadata file for each channel, containing comprehensive details of all videos within the channel. Then, we maintain an index file that consolidates all the collected channels, with each channel linked to its corresponding metadata file.

This helps prevent duplicate channels by checking the index file when processing a new channel. However, there may still be instances of duplicated videos stemming from different channels or playlists. For example, a playlist within Channel A might reference a video originally uploaded on Channel B. This scenario requires additional consideration to handle duplicated videos across various channels or playlists effectively.

Downloading video metadata from a playlist can be I/O-bound and time-consuming, taking from a few minutes to several hours, especially for playlists with numerous videos. To expedite this process, we employ multiple processes to concurrently handle the playlists within a channel.

For instance, if a channel contains 100 playlists, we divide and store these playlists into 10 temporary files. Each file serves as input for a separate process, responsible for downloading video metadata from its designated input file and saving the results to a corresponding temporary output file. Once all processes have completed their tasks, we merge the temporary output files into the final metadata file for the channel and update the index file accordingly.

We conducted tests with different numbers of processes. When the number was over 10, the IP address was promptly blocked by YouTube. Despite this limitation, we managed to collect approximately 1 million video URLs, equivalent to around 300,000 hours of video content, per day using the aforementioned method.

It is important to note that not all collected video URLs remain valid over time as new videos are continuously being added to the channels and playlists that we have already collected. Some videos may experience changes in permissions, being publicly accessible today but removed or restricted to membership-only access or accessible only to logged-in users tomorrow. Additionally, some videos may be static pictures or lengthy monitoring videos without any meaningful content. Therefore, it's crucial to filter the content and inject the valid URLs into the transcription pipeline. Moreover, implementing robust mechanisms in the code to handle various exceptions is essential to ensure resilience.

Please refer to the example code for the job generator implementation.

Build the job generation pipeline

To further enhance the efficiency of the job generation process, we can divide the job generator into several components and build a job generation pipeline:

Channel Manager: This component manages the index file and supports operations such as search, creation, and update. It needs to be publicly accessible and can be implemented using AWS Lambda backed by AWS DynamoDB.

Channel Collector: This component gathers relevant channels based on specific topics. It interacts with the Channel Manager to verify the existence of a channel and creates a new record if it doesn’t already exist. Once validated, it sends the channel URL to the job queue, such as AWS SQS, for further processing.

Metadata Downloader: This component runs on a pool of Salad nodes distributed across the global Internet. Each node retrieves a channel URL from the job queue, gathers all playlists within the channel, and downloads video metadata from these playlists using the multiprocessing approach. Upon completion, it creates the channel metadata file in Cloudflare, updates the corresponding record through the Channel Manager, and removes the processed job from the queue.

With this solution, we can effectively harness the global distributed network provided by SaladCloud, significantly accelerating the job generation process by tens or even hundreds of times.

STEP 2: Data Processing on SaladCloud

The transcription pipeline consists of:

YouTube and its CDN: YouTube utilizes a Global Content Delivery Network (CDN) to distribute content efficiently. The CDN edge servers are strategically dispersed across various geographical locations, serving content in close proximity to users, and enhancing the speed and performance of applications.

GPU Resource Pool and Global Distributed Network: Hundreds of Salad nodes, equipped with dedicated GPUs, are utilized for tasks such as downloading/uploading, pre-processing/post-processing and transcribing. These nodes assigned to GPU workloads are positioned within a global, high-speed distributed network infrastructure, and can effectively align with YouTube’s Global CDN, ensuring optimal system throughput.

Job Queue System: The Salad nodes retrieve jobs via the message queue like AWS SQS, providing the video URLs and where to store the generated assets (Cloudflare URLs).

Job Filler: It generates jobs based on the index file and metadata files during the STEP 1, monitors the transcription pipeline process and ensures a consistent supply of tasks by replenishing them regularly.

Cloud Storage: Generated assets stored in Cloudflare R2, which is AWS S3-compatible and incurs zero egress fees.

Job Recording System: Job results, including YouTube video URLs, audio length, processing time, Cloudflare URLs, word count etc., are stored in NoSQL databases like AWS DynamoDB.

Job Filler

The job filler must be versatile in supporting multiple job injection strategies. It should be capable of injecting millions of hours of video URLs to the job queue instantly and remains idle until the pipeline completes all tasks. However, a potential issue with this approach arises when certain nodes in the pipeline experience downtime and fail to process and remove jobs from the queue. Consequently, these jobs may reappear for other nodes to attempt processing, potentially causing earlier injected jobs to be processed last, which may not be suitable for certain use cases.

Another approach is to continuously monitor the size of the job queue. The job filler will inject new jobs only when there are nearly no available tasks left in the queue. This method ensures that the pipeline can complete time-sensitive tasks efficiently. The job filler must maintain high availability, as any failure could potentially cause the pipeline to run idle, leading to delays in task completion.

Here is an example: Every day, we initially inject a large batch of jobs into the pipeline and monitor progress. When the queue is nearly empty, we start injecting only a few jobs and aim to keep the number of available jobs in the queue as low as possible for a period of time. This strategy allows us to prioritize completing older jobs before injecting a massive influx of new ones.

For time-sensitive tasks, implementing autoscaling is also beneficial. By continually monitoring the job count in the queue, the job filler dynamically adjusts the number of Salad node groups. This adaptive approach ensures that specific quantities of tasks can be completed within a predefined timeframe while also offering the flexibility to manage costs efficiently during periods of reduced demand.

The reference design for the job filler

Job Filter: It filters the original video URLs based on video length and channels, removing unqualified content such as videos that are too short or too long, purely musical, or in a non-English language, etc. Subsequently, generates a specified number of hours of tasks from the index file and metadata files, and adds them to the local job queue. The job filter utilizes a cursor to track progress and generate new tasks after the position of the last generated task, thereby supporting various job injection strategies.

Job Uploader: Operates on multiple threads to boost throughput; each thread retrieves jobs from the local job queue, and uses AWS SendMessageBatch to transmit up to 10 jobs simultaneously.

Through testing, the job filler has demonstrated the capability to inject up to 1 million jobs to AWS SQS within a single hour.

Below are the examples of jobs sent from the job filler to the job queue:

Job Queue System Settings

We set the AWS SQS Visibility Timeout to 1 hour. This allows sufficient time for downloading, chunking, buffering, and processing by most of the nodes in SaladCloud until final results are merged and uploaded to Cloudflare. If a node fails to process and remove polled jobs within the hour, the jobs become available again for other nodes to process.

Additionally, the AWS SQS Retention Period is set to 2 days. Once the message retention quota is reached, messages are automatically deleted. This measure prevents jobs from lingering in the queue for an extended period without being processed for any reason, thereby avoiding wastage of node resources.

Node Implementation on SaladCloud (using Parakeet TDT 1.1B)

Within each node in the GPU resource pool in SaladCloud, we follow best practices by utilizing two processes: the Inference Process, dedicated to GPU-based transcription inference; and the Benchmark Worker Process, focused on I/O- and CPU-bound tasks, such as downloading/uploading, pre-processing, and post-processing.

Inference Process

The transcription for audio involves resource-intensive operations on both CPU and GPU, including format conversion, re-sampling, segmentation, transcription and merging. The more CPU operations involved, the lower the GPU utilization experienced.

While having the capacity to fully leverage the CPU, multiprocessing or multithreading-based concurrent inference over a single GPU might limit optimal GPU cache utilization and impact performance. This is attributed to each inference running at its own layer or stage. The multiprocessing approach also consumes more VRAM as every process requires a CUDA context and loads its own model into GPU VRAM for inference.

Following best practices, we delegate more CPU-bound pre-processing and post-processing tasks to the benchmark worker process, allowing the inference process to concentrate on GPU operations and run on a single thread. It begins by loading the model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app on a Unicorn server. Upon receiving a request, it invokes the transcription inference and promptly returns the generated assets.

Batch inference can be employed to enhance performance by effectively leveraging GPU cache and parallel processing capabilities. However, it requires more VRAM and might delay the return of the result until every single sample in the input is processed. The choice of using batch inference and determining the optimal batch size depends on model, dataset, hardware characteristics and use case; and requires experimentation and ongoing performance monitoring.

Benchmark Worker Process

The benchmark worker process primarily handles various I/O- and CPU-bound tasks, such as downloading/uploading, pre-processing, and post-processing.

The Global Interpreter Lock (GIL) in Python permits only one thread to execute Python code at a time within a process. While the GIL can impact the performance of multithreaded applications, certain operations remain unaffected, such as I/O operations and calling external programs.

To maximize performance with better scalability, we adopt multiple threads to concurrently handle various tasks, with two queues created to facilitate information exchange among these threads.

DownloaderIn most cases, we require 3 threads to concurrently pull jobs from the job queue and download audio files from YouTube, and efficiently feed the inference server. It also performs the following pre-processing steps:
1)Removal of bad audio files.
2)Format conversion from Mp4A to MP3.
3)Chunking very long audio into 10-minute clips.
4)Metadata extraction (URL, file/clid ID, length).

The pre-processed audio files are stored in a shared folder, and their metadata are added to the transcribing queue simultaneously.

To prevent the download of excessive audio files, we enforce a maximum length limit on the transcribing queue. When the queue reaches its capacity, the downloader will sleep for a while.
CallerIt reads metadata from the transcribing queue, and subsequently sends a synchronous request, including the audio filename, to the inference server.

Upon receiving the response, it forwards the generated texts along with statistics, and the transcribed audio filename to the reporting queue.

The simplicity of the caller is crucial as it directly influences the inference performance.
ReporterThe reporter, upon reading the reporting queue, deletes the processed audio files from the shared folder and manages post-processing tasks, including merging results and calculating real-time factor and word count.

Eventually, it uploads the generated assets to Cloudflare, reports the job results to AWS DynamoDB and removes the processed jobs from AWS SQS.

By running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching and preparing the next audio clips concurrently and in advance while the current one is still being transcribed, we can eliminate any waiting period. After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings.

Please refer to the example code for the node implementation.

Optimization of Performance and Throughput

You can define a resource group on SaladCloud that encompasses various GPU types. These resources are distributed across the global Internet, each with varying bandwidth and latency to specific endpoints. Moreover, the performance of each node and the number of running replicas within a group may fluctuate over time due to their shared nature.

To optimize node performance and system throughput, consider the following best practices:

Define Initialization Time Threshold: Establish a threshold for initialization time, encompassing tasks such as model loading and warm-up. If a node's initialization time exceeds this threshold, such as 300 seconds, the code should exit with a status of 1. SaladCloud will allocate a new node in response, ensuring that nodes are adequately prepared to execute your application.

Monitor Real-Time Performance: Continuously monitor application performance within a specific time window. If performance metrics, such as the real-time factor (calculated as audio length divided by processing time, serving as an efficient measure of transcription performance), fall below a predefined threshold, the code should exit with a status of 1, prompting reallocation. This practice ensures that nodes remain in an optimal state for application execution. For example, most nodes can achieve a real-time factor of 80 or higher for long audio; if the real-time factor of a node falls below 20 during a monitoring period, it should be removed from the resource pool immediately.

Adopt Adaptive Algorithms: Recognize that nodes may vary in GPU types and network performance. High-performance nodes can handle longer chunks with a large transcription queue (100), whereas low-performance nodes are better suited for processing shorter chunks with a smaller transcription queue (50). By employing adaptive algorithms, resource utilization can be optimized, while preventing low-performance nodes from impeding overall progress.

Single Node Test

Before deploying the application container image on a large scale on SaladCloud, we can build a specialized application image with JupyterLab and conduct the single-node test across various types of Salad nodes.

With JupyterLab’s terminal, we can log into a container instance running on SaladCloud, gaining OS-level access. This enables us to conduct various tests and optimize the configurations and parameters of the model and application. These include:

Measure the time to download the model and then load it into the GPU, including warm-up, to define the appropriate health check count.

Analyze the VRAM usage variations during the inference process based on long audio lengths and different batch sizes, and select the best performing or most cost-effective GPU types for the application.

Determine the minimum number of downloader threads and maximum length limit of the transcribing queue to efficiently feed the inference server.

Identify and handle various possible exceptions during application runtime, etc.

STEP 3: Validation, Delivery and Analysis

Job results, which include video URL, audio length, processing time, Cloudflare URL, and word count for each video, serve multiple purposes.

When combined with the index file and metadata files, the job results become instrumental in validating the pipeline output. They provide crucial information such as processing time, word count and Cloudflare URL for each video, as well as identify how many and which videos failed to be transcribed. In cases where videos failed to be transcribed, we can analyze the reasons and may refill them into the transcription pipeline for reprocessing.

Job results also provide the information like how many nodes have been actually allocated and their GPU types, and how many transcription jobs have been done by each node and each GPU type, and the corresponding real-time factors. Leveraging this information enables us to identify the best-performing and most cost-effective GPU types for specific tasks, thereby optimizing our future applications for enhanced efficiency and performance.

The final output comprises index files containing mappings from YouTube URLs to Cloudflare URLs, organized according to specific channels, topics, and title keywords. These index files adhere to a predefined schema and are intended to serve as input for subsequent pipelines, facilitating further analysis.

job_generator.pyGenerates metadata files in the directory "./metadata/" and updates the index file located at "./index.csv" based on named channels and search queries.
Creates temporary input and output files in the directory "./temp_files/” for multiprocessing downloads.

The file “topics.csv” contains a list of popular topics.

Please ensure that the environment variable GCP_API_KEY is set:
export GCP_API_KEY=’######################’
job_queue_access.pyProvides the access to the job queue and is imported by Job Filler, Job Cleaner and Job Monitor.

Please ensure that the following environment variables are set:
export AWS_ID='######################'
export AWS_KEY='######################'
export QUEUE_URL=''
job_filler.pySupports two job injection strategies: one-time injection and send-monitor-replenish.

Need access to the index file located at "./index.csv” and metadata files in the directory "./metadata/”.

Keep the cursor at the file “./last_position” to track progress, and send new tasks to the job queue after the position of the last generated task.
job_queue_cleaner.pyPurge the job queue mainly for tests.
job_queue_monitor.pyMonitor the available jobs in the queue and forecast when the queue will be empty.
The inference process and the benchmark worker process.
node/Dockerfile.parakeet.yt1mThe dockerfile to build the image.

Build the image:
docker image build -t -f Dockerfile.parakeet.yt1m .

Run the image and needs 10 environment variables:
docker run --rm --gpus all
--env AWS_ID=$AWS_ID

Access the Job Queue System (AWS SQS):
export AWS_ID='######################'
export AWS_KEY='######################'
export QUEUE_URL=''

Access the Cloud Storage (Cloudflare R2):
export CLOUDFLARE_ID='######################'
export CLOUDFLARE_KEY='######################'

Access the Job Reporting System (AWS Lambda to AWS DynamoDB):
export BENCHMARK_ID="######################"
export REPORTING_AUTH_HEADER="######################"
export REPORTING_API_KEY="######################"