Microsoft has developed VALL-E 2, an advanced text-to-speech AI tool that is so realistic that the company has decided against releasing it to the public due to concerns over potential misuse, such as impersonation of people’s voices.
“VALL-E 2 is purely a research project,” Microsoft’s researchers stated. “Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.”
According to the tech giant’s researchers, VALL-E 2 has achieved “human parity” in speech generation. This means the AI’s generated speech is indistinguishable from a real human voice.
VALL-E 2 could synthesize speech while maintaining speaker identity, making it useful for educational and entertainment purposes.
The AI could also be used in journalistic and self-authored content, providing realistic voiceovers and narrations.
VALL-E 2 can enhance accessibility features, such as screen readers, by providing more natural and engaging speech.
Interactive voice response systems and chatbots could benefit from VALL-E 2’s realistic speech generation, improving user interactions.
The AI could be used in translation services to offer accurate and natural-sounding translations in various languages.
VALL-E 2 can accurately replicate a person’s voice by using just a few seconds of audio from the speaker.
Repetition-aware sampling helps the AI reduce monotonous speech by recognizing and varying small language units, such as words or syllables, to sound more natural.
Grouped code modeling reduces the sequence length, allowing the AI to process fewer units of speech, which speeds up speech generation and minimizes the challenge of processing long sentences.