Audio tech combined with AI for something different

February 8, 2024

People from the tech industry like giving big names to things that we might already know. Audio technology is one of them, a term for a range of technologies that concern recording, editing, processing, or delivering auditory sounds. Some examples of well-known audio tech products would be radio, television, or even that smartphone in your hands to read this article now.

In South Korea, audio tech has been taking some creative turns with the adaptation of artificial intelligence in the past few months. It is not difficult to find a YouTuber voicing their narrations via text-to-speech tools with AI-generated voices. K-pop fans listen to their favourite singers doing a cover of a song they have never sung in real life with the power of AI-generative audio tools.

As virtual worlds connected through wires become more important in our everyday lives, a more realistic, immersive experience of audio technologies may be what people want more in the near future. For the audio equipment market alone, Mordor Intelligence expected that the market size of $US 15.2 billion will grow by 140 per cent until 2029, with Asia Pacific being the largest market. Combined with AI, audio technologies seem to pave innovative ways for content creation and consumption. 4i Magazine looks at what kinds of products or services South Korean audio tech start-ups offer as of 2024.

AI voice technologies grows big on video platforms

A successful example of AI-assisted audio technology is text-to-speech software, and Neosapience is on the lead of that technology in South Korea. The 2017-founded audio tech start-up specialises in creating virtual voices and images with the power of AI. After a year and a half of research, the company developed a technology that can recreate a certain person’s voice with just six seconds of audio data. Their test result also garnered much attention from the public – Neosapience’s edited video of Donald Trump talking in Korean was viewed by millions of people upon release.

What made the company more known to the public is its text-to-speech tool, Typecast. The tool offers up to 300 different virtual voices that people can access under monthly subscription plans priced from 9,900 ($US 7.44) to 90,000 won. What differentiates its voice tool from other competitors is how each virtual voice can express emotions; users can pick an emotion from scripts they type and choose the option that conveys the content best depending on their tones of talking. What’s more, these emotions can become more complex to users’ liking, such as “sad but confident” instead of just being “sad”. The platform also offers ten virtual humans who can act based on the script (or prompts) users put in.

The virtual voices of Typecast, the platform that now has more than 1.2 million users worldwide, seem to resonate with internet figures who wish to remain anonymous when making videos. Many of their users are short-form creators making vlogs or cooking videos.

Different from text-to-speech technologies, there has been a case of AI voice synthesis technology used by K-pop artists. Last May, BTS’ home agency HYBE LABELS introduced its artist MIDNATT’s debut single “Masquerade” in six different languages. The agency explained that this was possible through Supertone, the audio tech start-up that HYBE LABELS took over in the same year.

The recording process in six languages seemed rather simple: the artist first records the song in each language, then Supertone steps in with their AI technology to correct the pronunciation and intonation of the artist’s singing.

audio tech — Screen capture of Supertone

Considering that Bang Si-hyuk, the chairperson of HYBE LABELS, thinks of the power of AI highly, we can assume that the collaboration between the music industry and AI will become more prominent in the country in the next few years.

Separating one music source from remixes

To musicians who want to try a different turn on their productions, stem separation tools may come in handy. Stem separation refers to isolating instrumental parts from a track, such as drumbeats, guitar, or bass plays. With separated music sources, they can make a range of mashups with the addition of new vocals or other instruments. It also can help producers remove noise getting in the way.

While there are several free-to-use stem separation software programs available, Gaudio Lab, a start-up based in Seoul, is known to offer one of the best on-browser apps.

Gaudio Lab’s studio tool can separate individual instrumental tracks from a mixture of audio sources, assisted by AI. Users can remove noise or extract a certain music source and reduce the volume of high-definition audio. The company said in a press release that Gaudio Studio recorded a “higher score than global tech companies like Meta and Sony Music for the signal-to-distortion ratio (SDR)”, an index that indicates stem separation quality.

Audio books powered by AI audio tech

Sometimes, audio tech start-ups’ main products are not the technology itself but their content made with such technology. For example, the book browsing and reading platform Millie Seojae brought up a do-it-yourself audiobook service, where users read a book of their choice out loud, record it, and turn it into an official audiobook available for others. They can also make an audiobook with a virtual voice available on the platform.

Last October, software company SELVAS AI launched an audiobook editor program similar to Millie Seojae’s project.

The editor runs on the company’s text-to-speech tool Selvy Deep TTS, which impersonates various vocal components, including breathing, intonations, and emotions. SELVAS AI explains that the program can make a “more human-like, natural” audiobook using its Selvy Deep TTS tool. There also are different themes of voices that users can choose based on the content of books, ranging from economy to self-development.

It also utilises AI document analysis technology, automatically transforming a book into a voice recording script. When a user uploads their book files, the program creates text bubble groups of different categories, like headings, subheadings, or bullet points. Users can designate different AI voice effects for each category. This process has been time-consuming for many publishers, as distinguishing sentences or paragraphs into diverse categories has not been the work of machines until today.
In addition, users can add their own sound effects or background music if they would like to on the editor.

Assisted by deep learning-based voice synthesis technology and AI document analysis tools, the company explained that users can make an audiobook 90 per cent faster than a normal speed, which takes about four to eight weeks. Kwon Hyuk-min, the odiro project owner at SELVAS AI, said that the company plans to expand the use of AI in the automation of audiobook productions.

AI can also help stabilise the quality of audiobooks. Last March, Welaaa, the biggest audiobook platform in South Korea, introduced a couple of features that enable users to speed up the reading voice on their platforms without undermining the audio quality. The company said that it will improve the features by developing other AI-powered speeding technologies, such as MFA methods.