OpenAI’s Voice Engine: everything you need to know about the voice cloning model

April 9, 2024

It takes comedians months of study to imitate a famous person’s speech, while it only takes Voice Engine 15 seconds to clone a person’s voice. The difference is abysmal, but what is also surprising is the result guaranteed by the latest deep learning model announced by OpenAI, which is increasingly a leader in developing solutions for generative artificial intelligence. It is not the first synthetic voice because there are already several AIs capable of simulating a human’s timbre, intonation and vocal rhythm. Still, as ChatGPT‘s texts and DALL-E‘s images have shown, the fact that the company led by Sam Altman is behind the technology attracts a lot of attention.

OpenAI’s Voice Engine, as demonstrated by the company itself, holds immense potential and usability in the realm of audio. From content translation with a specific voice to the creation of digital avatars that can mimic the appearance of an individual with a voice, the possibilities are vast. Moreover, the technology also offers a unique opportunity to give a voice to those who are no longer with us and those who are present but unable to express themselves.

Few partners, many risks

After launching experiments in late 2022, specifying that it did so with ‘training on a mix of licensed and publicly available data’, OpenAI made Voice Engine available to a few partners to get feedback and figure out which directions to push in order to determine if and when to expand the user base. The company said it gave priority to low-risk and beneficial activities.

A crucial point because there are many risks with a distorted use of the tool, as is often the case with the development of new technologies. Which, it is worth remembering, are never good or bad, as the assessment depends on how people use them and their intentions. To get an idea of what is at stake, getting someone to say a word or phrase they have never said can generate scams, misinformation, security breaches and other unpleasant and damaging incidents.

By using textual input and a single 15-second audio sample to generate natural-sounding speech that closely resembles the original speaker, Voice Engine increases the problem of deepfakes, in this case audio, which through GenAI’s solutions can be packaged quickly and easily even by those with little knowledge of technology.

OpenAI’s Voice Engine: everything you need to know about the voice cloning model — Audio voice – Photo Credits: unsplash

Audio Voice — Audio voice – Photo Credits: unsplash

A concrete example of this was the phone calls received in recent months by some New Hampshire residents, who found themselves listening to a recorded audio message in which President Joe Biden urged them not to vote in the Democratic Party primaries. Only later did it come to light that the voice had been cloned with AI, resulting in the opening of an investigation that the state attorney general is investigating.

Aware that the new AI will play a key role in disseminating knowledge and breaking down language barriers, but also that ‘generating speech that resembles people’s voices carries serious risks, particularly felt in an election year’, OpenAI is working with governments, media, entertainment companies and education organisations to understand how to continue the development of Voice Engine. ‘We are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse. We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities,’ reads the audio model presentation.

In order to steer the future on the right track, the company has imposed several conditions on partners for the preview use of the solution, such as a ban on impersonating another person without explicit consent. Everyone is also obliged to clearly communicate to the public the voices that have been generated by the AI. At the security level, OpenAI uses watermarking that allows the origin of any audio generated by the Voice Engine to be traced.