IT News

MICROSOFT’S NEW AI BOT VALL-E CAN REPLICATE ANYONE’S VOICE WITH JUST A 3-SECONDS AUDIO SAMPLE

January 10, 2023

A team of researchers at Microsoft has developed a new text-to-speech AI model called VALL-E that can simulate a person’s voice almost perfectly once it has been trained. And that to train this new AI bot, all they need is a three-second audio sample.

Moreover, the researchers claim that once the AI bot learns a specific voice, VALL-E can synthesize audio of that person saying anything and do it in a way that attempts to preserve the speaker’s emotional tone.

The developers of VALL-E can potentially be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript, and in conjunction with content creation with other generative AI models like GPT-3.

Microsoft’s VALL-E builds off of a technology called EnCodec, which Meta announced in October 2022. VALL-E generates discrete audio codec codes from text and acoustic prompts, unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms. VALL-E analyzes how a person sounds and breaks down the voice into tokens. Then it uses the training data to match what it “knows” about how that voice would sound if it spoke other phrases.

Microsoft used LibriLight, an audio library by Meta, to train VALL-voice E’s synthesis skills. Most of the 60,000 hours of English-language speech are taken from LibriVox public domain audiobooks and spoken by more than 7,000 people. The voice in the three-second sample must closely resemble a voice in the training data for VALL-E to get a satisfactory result.

In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. The audio output, for instance, will replicate a telephone call’s acoustic and frequency qualities in its synthetic work, which is a fancy way of stating that it will sound like a telephone call. Additionally, Microsoft’s samples (included in the “Synthesis of Diversity” section) show how VALL-E may produce different voice tones by altering the random seed utilized during creation.