ZOO Digital supplies end-to-end localization and media providers to adapt authentic TV and film content material to completely different languages, areas, and cultures. It makes globalization simpler for the world’s finest content material creators. Trusted by the largest names in leisure, ZOO Digital delivers high-quality localization and media providers at scale, together with dubbing, subtitling, scripting, and compliance.
Typical localization workflows require handbook speaker diarization, whereby an audio stream is segmented primarily based on the identification of the speaker. This time-consuming course of should be accomplished earlier than content material may be dubbed into one other language. With handbook strategies, a 30-minute episode can take between 1–3 hours to localize. By automation, ZOO Digital goals to realize localization in below half-hour.
On this publish, we focus on deploying scalable machine studying (ML) fashions for diarizing media content material utilizing Amazon SageMaker, with a concentrate on the WhisperX mannequin.
Background
ZOO Digital’s imaginative and prescient is to supply a quicker turnaround of localized content material. This purpose is bottlenecked by the manually intensive nature of the train compounded by the small workforce of expert individuals that may localize content material manually. ZOO Digital works with over 11,000 freelancers and localized over 600 million phrases in 2022 alone. Nevertheless, the availability of expert individuals is being outstripped by the rising demand for content material, requiring automation to help with localization workflows.
With an goal to speed up the localization of content material workflows by way of machine studying, ZOO Digital engaged AWS Prototyping, an funding program by AWS to co-build workloads with clients. The engagement centered on delivering a purposeful answer for the localization course of, whereas offering hands-on coaching to ZOO Digital builders on SageMaker, Amazon Transcribe, and Amazon Translate.
Buyer problem
After a title (a film or an episode of a TV sequence) has been transcribed, audio system should be assigned to every section of speech in order that they are often appropriately assigned to the voice artists which are solid to play the characters. This course of is known as speaker diarization. ZOO Digital faces the problem of diarizing content material at scale whereas being economically viable.
Answer overview
On this prototype, we saved the unique media information in a specified Amazon Easy Storage Service (Amazon S3) bucket. This S3 bucket was configured to emit an occasion when new information are detected inside it, triggering an AWS Lambda operate. For directions on configuring this set off, discuss with the tutorial Utilizing an Amazon S3 set off to invoke a Lambda operate. Subsequently, the Lambda operate invoked the SageMaker endpoint for inference utilizing the Boto3 SageMaker Runtime consumer.
The WhisperX mannequin, primarily based on OpenAI’s Whisper, performs transcriptions and diarization for media property. It’s constructed upon the Quicker Whisper reimplementation, providing as much as 4 instances quicker transcription with improved word-level timestamp alignment in comparison with Whisper. Moreover, it introduces speaker diarization, not current within the authentic Whisper mannequin. WhisperX makes use of the Whisper mannequin for transcriptions, the Wav2Vec2 mannequin to reinforce timestamp alignment (guaranteeing synchronization of transcribed textual content with audio timestamps), and the pyannote mannequin for diarization. FFmpeg is used for loading audio from supply media, supporting varied media codecs. The clear and modular mannequin structure permits flexibility, as a result of every part of the mannequin may be swapped out as wanted sooner or later. Nevertheless, it’s important to notice that WhisperX lacks full administration options and isn’t an enterprise-level product. With out upkeep and help, it will not be appropriate for manufacturing deployment.
On this collaboration, we deployed and evaluated WhisperX on SageMaker, utilizing an asynchronous inference endpoint to host the mannequin. SageMaker asynchronous endpoints help add sizes as much as 1 GB and incorporate auto scaling options that effectively mitigate visitors spikes and save prices throughout off-peak instances. Asynchronous endpoints are notably well-suited for processing giant information, akin to films and TV sequence in our use case.
The next diagram illustrates the core components of the experiments we carried out on this collaboration.
Within the following sections, we delve into the small print of deploying the WhisperX mannequin on SageMaker, and consider the diarization efficiency.
Obtain the mannequin and its elements
WhisperX is a system that features a number of fashions for transcription, compelled alignment, and diarization. For clean SageMaker operation with out the necessity to fetch mannequin artifacts throughout inference, it’s important to pre-download all mannequin artifacts. These artifacts are then loaded into the SageMaker serving container throughout initiation. As a result of these fashions aren’t instantly accessible, we provide descriptions and pattern code from the WhisperX supply, offering directions on downloading the mannequin and its elements.
WhisperX makes use of six fashions:
Most of those fashions may be obtained from Hugging Face utilizing the huggingface_hub library. We use the next download_hf_model()
operate to retrieve these mannequin artifacts. An entry token from Hugging Face, generated after accepting the person agreements for the next pyannote fashions, is required:
import huggingface_hub
import yaml
import torchaudio
import urllib.request
import os
CONTAINER_MODEL_DIR = "/choose/ml/mannequin"
WHISPERX_MODEL = "guillaumekln/faster-whisper-large-v2"
VAD_MODEL_URL = "https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin"
WAV2VEC2_MODEL = "WAV2VEC2_ASR_BASE_960H"
DIARIZATION_MODEL = "pyannote/speaker-diarization"
def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
"""
Fetches the offered mannequin from HuggingFace and returns the subdirectory it's downloaded to
:param model_name: HuggingFace mannequin identify (and an elective model, appended with @[version])
:param hf_token: HuggingFace entry token licensed to entry the requested mannequin
:param local_model_dir: The native listing to obtain the mannequin to
:return: The subdirectory inside local_modeL_dir that the mannequin is downloaded to
"""
model_subdir = model_name.cut up('@')[0]
huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f"{local_model_dir}/{model_subdir}", local_dir_use_symlinks=False)
return model_subdir
The VAD mannequin is fetched from Amazon S3, and the Wav2Vec2 mannequin is retrieved from the torchaudio.pipelines module. Primarily based on the next code, we will retrieve all of the fashions’ artifacts, together with these from Hugging Face, and save them to the desired native mannequin listing:
def fetch_models(hf_token: str, local_model_dir="./fashions"):
"""
Fetches all required fashions to run WhisperX regionally with out downloading fashions each time
:param hf_token: A huggingface entry token to obtain the fashions
:param local_model_dir: The listing to obtain the fashions to
"""
# Fetch Quicker Whisper's Giant V2 mannequin from HuggingFace
download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)
# Fetch WhisperX's VAD Segmentation mannequin from S3
vad_model_dir = "whisperx/vad"
if not os.path.exists(f"{local_model_dir}/{vad_model_dir}"):
os.makedirs(f"{local_model_dir}/{vad_model_dir}")
urllib.request.urlretrieve(VAD_MODEL_URL, f"{local_model_dir}/{vad_model_dir}/pytorch_model.bin")
# Fetch the Wav2Vec2 alignment mannequin
torchaudio.pipelines.__dict__[WAV2VEC2_MODEL].get_model(dl_kwargs={"model_dir": f"{local_model_dir}/wav2vec2/"})
# Fetch pyannote's Speaker Diarization mannequin from HuggingFace
download_hf_model(model_name=DIARIZATION_MODEL,
hf_token=hf_token,
local_model_dir=local_model_dir)
# Learn within the Speaker Diarization mannequin config to fetch fashions and replace with their native paths
with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'r') as file:
diarization_config = yaml.safe_load(file)
embedding_model = diarization_config['pipeline']['params']['embedding']
embedding_model_dir = download_hf_model(model_name=embedding_model,
hf_token=hf_token,
local_model_dir=local_model_dir)
diarization_config['pipeline']['params']['embedding'] = f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}"
segmentation_model = diarization_config['pipeline']['params']['segmentation']
segmentation_model_dir = download_hf_model(model_name=segmentation_model,
hf_token=hf_token,
local_model_dir=local_model_dir)
diarization_config['pipeline']['params']['segmentation'] = f"{CONTAINER_MODEL_DIR}/{segmentation_model_dir}/pytorch_model.bin"
with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'w') as file:
yaml.safe_dump(diarization_config, file)
# Learn within the Speaker Embedding mannequin config to replace it with its native path
speechbrain_hyperparams_path = f"{local_model_dir}/{embedding_model_dir}/hyperparams.yaml"
with open(speechbrain_hyperparams_path, 'r') as file:
speechbrain_hyperparams = file.learn()
speechbrain_hyperparams = speechbrain_hyperparams.exchange(embedding_model_dir, f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}")
with open(speechbrain_hyperparams_path, 'w') as file:
file.write(speechbrain_hyperparams)
Choose the suitable AWS Deep Studying Container for serving the mannequin
After the mannequin artifacts are saved utilizing the previous pattern code, you’ll be able to select pre-built AWS Deep Studying Containers (DLCs) from the next GitHub repo. When choosing the Docker picture, contemplate the next settings: framework (Hugging Face), process (inference), Python model, and {hardware} (for instance, GPU). We advocate utilizing the next picture: 763104351884.dkr.ecr.[REGION].amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
This picture has all the mandatory system packages pre-installed, akin to ffmpeg. Keep in mind to exchange [REGION] with the AWS Area you might be utilizing.
For different required Python packages, create a necessities.txt
file with a listing of packages and their variations. These packages might be put in when the AWS DLC is constructed. The next are the extra packages wanted to host the WhisperX mannequin on SageMaker:
Create an inference script to load the fashions and run inference
Subsequent, we create a customized inference.py
script to stipulate how the WhisperX mannequin and its elements are loaded into the container and the way the inference course of needs to be run. The script accommodates two capabilities: model_fn
and transform_fn
. The model_fn
operate is invoked to load the fashions from their respective places. Subsequently, these fashions are handed to the transform_fn
operate throughout inference, the place transcription, alignment, and diarization processes are carried out. The next is a code pattern for inference.py
:
import io
import json
import logging
import tempfile
import time
import torch
import whisperx
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
def model_fn(model_dir: str) -> dict:
"""
Deserialize and return the fashions
"""
logging.data("Loading WhisperX mannequin")
mannequin = whisperx.load_model(whisper_arch=f"{model_dir}/guillaumekln/faster-whisper-large-v2",
machine=DEVICE,
language="en",
compute_type="float16",
vad_options={'model_fp': f"{model_dir}/whisperx/vad/pytorch_model.bin"})
logging.data("Loading alignment mannequin")
align_model, metadata = whisperx.load_align_model(language_code="en",
machine=DEVICE,
model_name="WAV2VEC2_ASR_BASE_960H",
model_dir=f"{model_dir}/wav2vec2")
logging.data("Loading diarization mannequin")
diarization_model = whisperx.DiarizationPipeline(model_name=f"{model_dir}/pyannote/speaker-diarization/config.yaml",
machine=DEVICE)
return {
'mannequin': mannequin,
'align_model': align_model,
'metadata': metadata,
'diarization_model': diarization_model
}
def transform_fn(mannequin: dict, request_body: bytes, request_content_type: str, response_content_type="software/json") -> (str, str):
"""
Load in audio from the request, transcribe and diarize, and return JSON output
"""
# Begin a timer in order that we will log how lengthy inference takes
start_time = time.time()
# Unpack the fashions
whisperx_model = mannequin['model']
align_model = mannequin['align_model']
metadata = mannequin['metadata']
diarization_model = mannequin['diarization_model']
# Load the media file (the request_body as bytes) into a brief file, then use WhisperX to load the audio from it
logging.data("Loading audio")
with io.BytesIO(request_body) as file:
tfile = tempfile.NamedTemporaryFile(delete=False)
tfile.write(file.learn())
audio = whisperx.load_audio(tfile.identify)
# Run transcription
logging.data("Transcribing audio")
consequence = whisperx_model.transcribe(audio, batch_size=16)
# Align the outputs for higher timings
logging.data("Aligning outputs")
consequence = whisperx.align(consequence["segments"], align_model, metadata, audio, DEVICE, return_char_alignments=False)
# Run diarization
logging.data("Operating diarization")
diarize_segments = diarization_model(audio)
consequence = whisperx.assign_word_speakers(diarize_segments, consequence)
# Calculate the time it took to carry out the transcription and diarization
end_time = time.time()
elapsed_time = end_time - start_time
logging.data(f"Transcription and Diarization took {int(elapsed_time)} seconds")
# Return the outcomes to be saved in S3
return json.dumps(consequence), response_content_type
Throughout the mannequin’s listing, alongside the necessities.txt
file, make sure the presence of inference.py
in a code subdirectory. The fashions
listing ought to resemble the next:
Create a tarball of the fashions
After you create the fashions and code directories, you need to use the next command strains to compress the mannequin right into a tarball (.tar.gz file) and add it to Amazon S3. On the time of writing, utilizing the faster-whisper Giant V2 mannequin, the ensuing tarball representing the SageMaker mannequin is 3 GB in dimension. For extra info, discuss with Mannequin internet hosting patterns in Amazon SageMaker, Half 2: Getting began with deploying actual time fashions on SageMaker.
Create a SageMaker mannequin and deploy an endpoint with an asynchronous predictor
Now you’ll be able to create the SageMaker mannequin, endpoint config, and asynchronous endpoint with AsyncPredictor utilizing the mannequin tarball created within the earlier step. For directions, discuss with Create an Asynchronous Inference Endpoint.
Consider diarization efficiency
To evaluate the diarization efficiency of the WhisperX mannequin in varied eventualities, we chosen three episodes every from two English titles: one drama title consisting of 30-minute episodes, and one documentary title consisting of 45-minute episodes. We utilized pyannote’s metrics toolkit, pyannote.metrics, to calculate the diarization error fee (DER). Within the analysis, manually transcribed and diarized transcripts offered by ZOO served as the bottom reality.
We outlined the DER as follows:
Whole is the size of the bottom reality video. FA (False Alarm) is the size of segments which are thought-about as speech in predictions, however not in floor reality. Miss is the size of segments which are thought-about as speech in floor reality, however not in prediction. Error, additionally known as Confusion, is the size of segments which are assigned to completely different audio system in prediction and floor reality. All of the models are measured in seconds. The everyday values for DER can range relying on the particular software, dataset, and the standard of the diarization system. Be aware that DER may be bigger than 1.0. A decrease DER is best.
To have the ability to calculate the DER for a bit of media, a floor reality diarization is required in addition to the WhisperX transcribed and diarized outputs. These should be parsed and lead to lists of tuples containing a speaker label, speech section begin time, and speech section finish time for every section of speech within the media. The speaker labels don’t have to match between the WhisperX and floor reality diarizations. The outcomes are primarily based totally on the time of the segments. pyannote.metrics takes these tuples of floor reality diarizations and output diarizations (referred to within the pyannote.metrics documentation as reference and speculation) to calculate the DER. The next desk summarizes our outcomes.
Video Sort | DER | Right | Miss | Error | False Alarm |
Drama | 0.738 | 44.80% | 21.80% | 33.30% | 18.70% |
Documentary | 1.29 | 94.50% | 5.30% | 0.20% | 123.40% |
Common | 0.901 | 71.40% | 13.50% | 15.10% | 61.50% |
These outcomes reveal a big efficiency distinction between the drama and documentary titles, with the mannequin attaining notably higher outcomes (utilizing DER as an mixture metric) for the drama episodes in comparison with the documentary title. A more in-depth evaluation of the titles supplies insights into potential components contributing to this efficiency hole. One key issue could possibly be the frequent presence of background music overlapping with speech within the documentary title. Though preprocessing media to reinforce diarization accuracy, akin to eradicating background noise to isolate speech, was past the scope of this prototype, it opens avenues for future work that might doubtlessly improve the efficiency of WhisperX.
Conclusion
On this publish, we explored the collaborative partnership between AWS and ZOO Digital, using machine studying strategies with SageMaker and the WhisperX mannequin to reinforce the diarization workflow. The AWS staff performed a pivotal position in helping ZOO in prototyping, evaluating, and understanding the efficient deployment of customized ML fashions, particularly designed for diarization. This included incorporating auto scaling for scalability utilizing SageMaker.
Harnessing AI for diarization will result in substantial financial savings in each value and time when producing localized content material for ZOO. By aiding transcribers in swiftly and exactly creating and figuring out audio system, this know-how addresses the historically time-consuming and error-prone nature of the duty. The standard course of usually includes a number of passes by way of the video and extra high quality management steps to reduce errors. The adoption of AI for diarization allows a extra focused and environment friendly method, thereby rising productiveness inside a shorter timeframe.
We’ve outlined key steps to deploy the WhisperX mannequin on the SageMaker asynchronous endpoint, and encourage you to strive it your self utilizing the offered code. For additional insights into ZOO Digital’s providers and know-how, go to ZOO Digital’s official website. For particulars on deploying the OpenAI Whisper mannequin on SageMaker and varied inference choices, discuss with Host the Whisper Mannequin on Amazon SageMaker: exploring inference choices. Be at liberty to share your ideas within the feedback.
Concerning the Authors
Ying Hou, PhD, is a Machine Studying Prototyping Architect at AWS. Her major areas of curiosity embody Deep Studying, with a concentrate on GenAI, Pc Imaginative and prescient, NLP, and time sequence information prediction. In her spare time, she relishes spending high quality moments along with her household, immersing herself in novels, and climbing within the nationwide parks of the UK.
Ethan Cumberland is an AI Analysis Engineer at ZOO Digital, the place he works on utilizing AI and Machine Studying as assistive applied sciences to enhance workflows in speech, language, and localisation. He has a background in software program engineering and analysis within the safety and policing area, specializing in extracting structured info from the net and leveraging open-source ML fashions for analysing and enriching collected information.
Gaurav Kaila leads the AWS Prototyping staff for UK & Eire. His staff works with clients throughout various industries to ideate & co-develop enterprise vital workloads with a mandate to speed up adoption of AWS providers.