Google announced that it has developed MusicLM, an artificial intelligence capable of creating music from a textual request. Following the success of ChatGPT, which brought textual generative AI to the fore, huge linguistic models trained through machine learning to generate content in relation to an initial description, Big G has now taken steps forward in the field of music, with researchers speaking of a model that ‘outperforms previous systems both in audio quality and adherence to the text description’.
This further confirms AI’s ability to make innovations in multiple fields of application because, as with texts and music, generative artificial intelligence also makes it possible to get images, videos and code from a basic input.
How it works
A system that allows MusicLM to generate music at 24 kHz in response to a text. Google’s MusicLM is based on MusicCaps, a dataset consisting of 5500 music clips linked to texts, resulting from textual descriptions from audio clips extracted from YouTube. A step derived from AudioLM, a previous music-AI-themed experiment attempted at Mountain View, then used as a basis with technologies such as SoundStream, a neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU.
The thirteen researchers of the Google team dedicated to the project have published a series of listenable examples. As with Chat GPT with texts and Ball-E-2 or Midjourney with images, the level and quality of the final result depends on the textual source request: the more specific the latter, the more defined the AI-generated melodies.
In the examples released by Google, there are audios generated from rich captions; generic text prompts to obtain audios of more than three minutes and the Story mode with a sequence of four or more different prompts that vary the song’s rhythm while maintaining the background melody. MusicLM can also recognise whistled or hummed tunes by humming a few words of the lyrics and merging these inputs with the prompt for a precise melody to follow. Again, the examples provided by the researchers are worth listening to because they make one think about where this may lead in the future.
Too many risks, so AI will not be released for now
At the first impact of the tracks generated by MusicLM, an ear trained to listen to specific genres of music can detect differences from tracks played by professional musicians because Google’s audios appear too standardised and sometimes lacking in nuance.
But these are the first steps of a project that is already very advanced and has the potential to recreate songs replicating the original versions in every way. The Google Research team has plenty of time to continue training the AI because MusicLM will continue to be reserved for Big G’s laboratories only.
The researchers have chosen not to publish the code to avoid copyright issues but also – as in all AI developments – to combat potential biases regarding underrepresented cultures in the datasets used for learning. There are also technical improvements to be made, such as the need to improve audio quality and the goal of creating structured songs with intro, verse and chorus—an ambitious goal for which time and experiments are still needed.