Microsoft’s research team has introduced VALL-E 2, a new AI system for speech synthesis that generates voices with human-like quality using minimal audio input. This advancement in neural codec language models marks a significant milestone in the field of zero-shot text-to-speech synthesis.
Unlike other voice cloning techniques, VALL-E 2 employs a unique method called “Repetition Aware Sampling” and adaptive switching between sampling techniques, which enhance consistency and address common issues in traditional generative voice. The system is capable of synthesizing high-quality speech, even for complex or repetitive phrases, potentially aiding individuals who lose the ability to speak.
Despite its impressive performance, Microsoft has no plans to make VALL-E 2 available to the public due to potential risks such as voice imitation without consent and the use of AI voices in scams and criminal activities. The research team also emphasizes the need for a standard method to digitally mark AI-generated content and ensures that the speaker approves the use of their voice.
Other AI companies like Meta and OpenAI have also withheld their voice cloning models due to similar concerns. As the use of generative AI continues to grow, ethical guidelines are becoming increasingly important within the AI community, especially as regulators begin to address the impact of AI on our daily lives.