Microsoft introduced SpeechX, a speech-generating artificial intelligence. More than a voice generator, this tool can also transform lyrics, or remove ambient noise. The objective for the company: to make it a versatile tool, and above all better than the others.
Last January, Microsoft unveiled Vall-E: an AI model for reproducing a voice from three seconds of recording. A few months later, the firm presents a new model, which aims to be more ubiquitous. Called SpeechX, Microsoft is already planning several uses for this artificial intelligence specializing in voice.
SpeechX: this tool that can do (almost) everything with voice
It is on the section dedicated to research on the Microsoft site that we discover SpeechX, in a page put online on August 14. We learn that it is a ” versatile model of speech generation that relies on audio and text messages. For its creation, it was trained on 60,000 hours of audio data. For Microsoft, existing models are still limited in handling various build tasks », especially in poor acoustic conditions.
The uses designed by Microsoft are plural. The company mentions the text-to-speech (i.e. generating voice from text), removing ambient noise, extracting a voice from a target speaker, removing and editing speech (the target voice can be modified by preserving rest of an audio track).
Some pretty impressive demonstrations of Microsoft’s AI
Still on the page dedicated to SpeechX, Microsoft has published some demos. We have, for example, the case of text-to-speech, where SpeechX reproduces a voice from three seconds of recording, like Vall-E, changing the words. Then, Microsoft made these voices pronounce the same sentences as its AI, in order to have the comparison. Even without the latter, the results are quite impressive: if we consider that the audio quality is poor, we can ignore the somewhat robotic aspect of the generated voices. By having the comparison, it jumps to the ears, but without, it is less the case.
Where it is all the more deceitful is in the case of a modification in the middle of a sentence. SpeechX is able to replace a few words within a spoken sentence. In this case, the artificial voice is camouflaged by the natural voice and it is really difficult to tell the difference between the two. The same goes for misspelled words. As for the suppression of ambient noise, the published demonstrations seem to perform less well than RTX Voice, the equivalent at Nvidia. Its rival AMD has similar technology on its graphics cards.
Microsoft is not the only one working on AI specialized in audio: for example, Meta presented a few months ago Voicebox, a tool capable of translating its voice into another language. Apple is already using AI to play audio books.