The artificial intelligence division of Meta presented Voicebox, an AI model capable of generating speech, without having been trained with input samples.
Meta AI introduced Voicebox, calling it a ” first model able to adapt to speech generation tasks for which it was not trained, with peak performancein his announcement article.
Meta wants to create music and voices… from scratch
Mark Zuckerberg’s company advertises Voicebox as an automatic generation system using artificial intelligence, comparing it to text or image generation tools. This time, it’s to create voice.
The particularity of this model is that it does not need prior recordings to create voice: it has been trained enough beforehand. Voicebox includes a model called Flow Matching, which does not require prepared recordings for training. This allows Voicebox to learn on more diverse data, but above all in larger quantities. 50,000 hours of speeches and transcriptions of public domain audiobooks in English, French, Spanish, German, Polish and Portuguese have been “ingestedby Voicebox. The AI has been trainedto predict a segment of speech when given the surrounding speech and the transcription of a segment.Which means that from a context, Voicebox is able to produce voice.
Meta states that “the model can synthesize speech in six languages, as well as remove noise, edit content, convert style, and generate various samples.At this time, Meta has announced that it does not want to make the model or code publicly available.because of the potential risks of misuse.» Indeed, this could allow to createdeep fakes, false recordings of personalities (including politicians). The company writes that it wants tofind the right balance between openness and responsibility.»
Voicebox wants to do better than the others
Meta wants to make Voicebox a versatile tool, capable of multitasking around audio. For example, he can modify a track, not just the end, but any other part. The noise reduction function is reminiscent of the RTX Voice function available on Nvidia graphics cards. It reduces noise when using its microphone, thanks to artificial intelligence. A solution adopted last year by AMD on its own graphics cards as well.
Meta also wants to race against Microsoft. In January, the latter presented Vall-E, an AI model for voice generation. Its peculiarity was that it required only three seconds of recording to reproduce one. Voicebox would be better than Vall-E”on text-to-speech in terms of intelligibility […] and audio similarity […] while being up to 20 times faster.»
What uses for AI voice generation?
Meta obviously imagined several possible uses of Voicebox and detailed them.
Two seconds of a voice is enough to reproduce it
First, there is speech synthesis, namely the generation of voice from text. Using a two-second voice sample, Voicebox would be able to generate that same voice from text given to it.
Meta imagines that this would allow “people who are unable to speak to express themselves or to customize the voices used by non-player characters and virtual assistants“. A technology already used by Apple for its audio books for example.
Translate your voice, in all languages, with a perfect accent
The French are known for not being comfortable with foreign languages and having a very bad accent. This might not be the case in the future, but not thanks to a few more language courses. Voicebox could make it possible to reproduce a voice, but in another language. The AI is already capable of this, in English, French, German, Spanish, Polish or Portuguese.
We can imagine concrete applications in Google Translate for example. In a foreign country, we could dictate to our smartphone what we want to translate and the AI would speak with our voice, but in the destination language. Another practical case: videoconferencing. We could translate our voice in real time within Zoom, Microsoft Teams or Google Meet.
do voice processing
Imagine that you are recording a podcast, or any other audio recording. Listening to it again, you realize that a bug or a knock in the microphone makes the sound almost inaudible, in any case unpleasant.
Voicebox is able to solve this problem by resynthesizing the corrupted part. Enough to save a recording and avoid redoing it.
Train speech recognition tools
Voicebox can also… train other AI models, specifically voice recognition models. Meta says that since Voicebox can precisely generate audio, these voice recordings can be used to train speech recognition AIs.
The recordings that Voicebox generates are already labeled, we know what is said, since they were generated using text. The published blog post states that “speech recognition models trained on synthetic data generated by Voicebox perform almost as well as models trained on real data“. Meta claims that there is only 1% error rate degradation with Voicebox compared to real practice recordings.
The Watt Else newsletter is THE unmissable event dedicated to the mobility of the future. Register here!