OpenAI’s voice cloning AI model only needs a 15-second sample to work

OpenAI is offering limited access to a text-to-voice generation platform it developed called Voice Engine, which can create a synthetic voice based on a 15-second clip of someone’s voice. The AI-generated voice can read out text prompts on command in the same language as the speaker or in a number of other languages. “These small scale deployments are helping to inform our approach, safeguards, and thinking about how Voice Engine could be used for good across various industries,” OpenAI said in its blog post.

Companies with access include the education technology company Age of Learning, visual storytelling platform HeyGen, frontline health software maker Dimagi, AI communication app creator Livox, and health system Lifespan.

In these samples posted by OpenAI, you can hear what Age of Learning has been doing with the technology to generate pre-scripted voice-over content, as well as reading out “real-time, personalized responses” to students written by GPT-4.

First, the reference audio in English:

And here are three AI-generated audio clips based on that sample,

OpenAI said it began developing Voice Engine in late 2022 and that the technology has already powered preset voices for the text-to-speech API and ChatGPT’s Read Aloud feature. In an interview with TechCrunch, Jeff Harris, a member of OpenAI’s product team for Voice Engine, said the model was trained on “a mix of licensed and publicly available data.” OpenAI told the publication the model will only be available to about 10 developers.

AI text-to-audio generation is an area of generative AI that’s continuing to evolve. While most focus on instrumental or natural sounds, fewer have focused on voice generation, partially due to the questions OpenAI cited. Some names in the space include companies like Podcastle and ElevenLabs, which provide AI voice cloning technology and tools the Vergecast explored last year.

According to OpenAI, its partners agreed to abide by its usage policies that say they will not use Voice Generation to impersonate people or organizations without their consent. It also requires the partners to get the “explicit and informed consent” of the original speaker, not build ways for individual users to create their own voices, and to disclose to listeners that the voices are AI-generated. OpenAI also added watermarking to the audio clips to trace their origin and actively monitor how the audio is used.

OpenAI suggested several steps that it thinks could limit the risks around tools like these, including phasing out voice-based authentication to access bank accounts, policies to protect the use of people’s voices in AI, greater education on AI deepfakes, and development of tracking systems of AI content.