What I want to talk about today, even though I'm pretty sure no-one but me wants to hear it, is Suno. I wasn't expecting to write about AI music again but that's because, as Vic Reeves used to say, I hadn't thought it through.
Here's the thing: I spent two months generating literally hundreds of versions of about thirty songs until I had what I thought was a definitive version of all of them. I moved on to making the videos and I didn't need Suno for those, so I thought I was probably done with it, for a while at least.
I was just about to unsubscribe then, just before the sub was due to renew, Version 4.5 arrived. Which was a complete surprise to me. Even though the "Cover" feature I'd been using was clearly flagged as "Beta" and there was a drop-down menu of previous models in the app itself, it never occured to me that meant development was still going on.
I mean, I knew, obviously. I just didn't know.
Several things changed with the update, all of them potentially significant for the project. For a start, the cover feature was much improved in 4.5. I'd been very satisfied with the 4.0 beta version but things can always get better so naturally the first thing I did with my remaining credits was generate a few covers for comparison. The difference was hard to miss.
There was a noticeable improvement in quality and the annoying glitches almost completely vanished. Most importantly, the new covers I was generating stuck much more reliably to the uploads. Under 4.0 that had not always been the case.
One of the reason I generated so many covers of the same songs was to get one where the AI didn't decide it had a better melody for the bridge than the one I'd given it or that it could phrase my lyrics more evocatively than I could. Which might have been okay except that nine times out of ten the AI was wrong. The original was better. (The other reason I keep making more and more covers, of course, is that it's incredibly satisfying and tons of fun. I'm not saying it'll never get old but it sure hasn't yet.)
With 4.5 the AI rarely deviates from the template it's given. Hardly at all. When it does, it's usually because I'm trying to make it accomodate my lyrics and melody to an entirely different and wholly unsuitable genre, something I do mostly for my own amusement. I can hardly blame the AI for losing patience with me there.As well as producing much more accurate versions of my originals, the new model can also make them longer. Output from 4.0 was limited to four minutes. That's doubled in 4.5.
Not that I've got anything that's eight minutes long. This isn't prog rock. But I was having problems with a handful of songs that naturally run just over four minutes. They kept getting cut off in the coda or the final verse.
There are ways to get around that with extensions in the old model but it's fiddly and usually doesn't sound quite right. The new model just keeps going until the song's over, which is infinitely better. And it gets the timing right so the songs have an actual ending rather than just stopping like the machine's been switched off.
All of that meant I've had to think again about the whole project. I might need to go back over all the "finished" versions to see if they really are as good as they could be. I suspect some of them are not. More to the point, I'm now aware there may be further improvements to the AI in future releases, so I'm just going to have to draw a line somewhere for sanity's sake.
Either that or keep playing around with it until I finally get bored, something I'm showing absolutely no sign of doing so far. I have to say that at £8 a month I'm getting at least as much use and entertainment out of the subscription than I ever got from an MMORPG sub. It's cheap entertainment by the Dollar per Hour scale and you have something to show for your time and money at the end.
Perhaps the most intriguing addition to Suno's capabilities in this latest model is the new Prompt Enhancement Helper. The AI as a whole has had its interpretative and creative powers increased and extended, something that's apparent from the results. That's probably the most important change but the most fascinating is the way it now lifts the curtain to show how it's doing it.
I had been under the impression that to get the best results I needed to keep my prompts as concise and tight as possible. I didn't think long, descriptive prompts full of adjectives and metaphors did all that much. I was convinced that even listing multiple genres or naming more than one or two specific instruments led to the AI cherry-picking a couple of them and ignoring the rest.
Based on the new feature, I had it all backwards. What happens when you use the Prompt Enhancement Helper is that your brief summary gets translated into a long paragraph, full of clauses and imagery and opinion. The exact opposite of what I expected, in other words.
For example, if I prompt
"1960s garage rock, female vocals"
the Prompt Enhancer turns it into
"Raw, punchy 1960s garage rock led by gritty female vocals. Begins with fuzzed-out guitar riff and tambourine accents; bass and drums drive a stripped-back rhythm. The chorus bursts open with call-and-response backing vocals and harmonized guitar lines. Short, energetic instrumental break before a brisk finale."
That's some "enhancement".
Because I can't help myself, I tested the results on a few songs using the basic and enhanced prompts and the enhanced ones always come closer to what I was imagining. It's still true that the end result appears to have ignored a lot of the specifics but the overall effect is much closer to the intent. That's great in itself but the best part is that, now I know what the AI is looking for, I can write more detailed prompts myself.
Of course, once I'd started playing with the prompt enhancer, I started getting ideas. The first thing I thought of was to see just what the AI would make of that mysterious, seemingly meaningless prompt I wrote a whole post about a while back, Post-Post-Vibe Cassette.
With enhancement
"Post-Post-Vibe Cassette, female vocals"
becomes
"Dreamy lo-fi "post-post-vibe" cassette track: begins with washed-out, tape-warped synth pads and muffled drum machine groove. Minimalist bass pulses subtly beneath airy, melodic female vocals, layered and filtered. Interspersed analog samples weave throughout, adding texture."
Seriously? Where is it getting all that detail from? As far as I can tell, the original prompt is void of all context other than "female vocals". You can't be "post-post" a "vibe" unless you know what the vibe is and "cassette" isn't a genre. Cassettes don't even sound significantly different to other forms of recording (Except worse.) And the enhanced prompt admits that by putting the phrase in quotes. It doesn't recognize it as a genre. It just takes my word for it that it might be one.
I was curious to see if the prompt enhancer would always interpret post-post-vibe cassette the same way. It doesn't. It's not a fixed response.
Here's another version it gave me:
"Atmospheric synths lay a hazy, lo-fi bed, layered with warbly cassette textures and saturated drums. Minimalist bass pulsates beneath crisp, delayed guitar plucks. Female vocals glide with dreamy reverb, while tape flutter and subtle noise accent shifting, moody sections."
That explains why the results vary when I just use the basic prompt. Even so, the long prompts are very similar. It's clearly not just making stuff up at random. I have no clue how it's landing where it does but it clearly thinks post-post-vibe cassette means something fairly specific, which explains why the variations always sit within a relatively limited range of styles and sounds. They're always a bit woozy and untethered, which is how the descriptions feel.
I had fun playing with that for a while and then I started to wonder... if the AI can interpret an abstract phrase as effectively as that, what could it do with something more concrete? Quite a lot, it turns out.
All the time I'd been using Suno I'd been seeding my prompts with musical genres and styles, specific instrumentation, production and arrangement and some mood descriptors like "sad" or "wistful". It never occurred to me to leave all that out and just tell the AI the context instead.
So I tried this:
"music heard coming from the local radio station of an orbiting space habitat over a vast conurbation in an anime set in the far future, female vocals"
Suno didn't blink. It gave me two very nice interpretations, neither of which felt particularly "spacey" but either of which I could easily imagine hearing in the scenario described.
I have to say that one of the best things about the generative AIs is the way you can ask them to do all kinds of things that would be too embarassing to ask a person. They really don't care. Bearing that in mind, I asked Suno to enhance that description:
"Dreamy synth layers open the song, with glitchy electronic percussion and pulsing sub-bass creating a weightless atmosphere. Bright, airy pads ebb and flow, supporting spacious female vocals processed with subtle delay and reverb. Futuristic textures blend with downtempo beats, weaving ethereal melodies and brief robotized vocal fragments between verses and choruses. The arrangement gradually intensifies, introducing swirling arpeggios, metallic accents, and digital noise, cresting with a lush, enveloping bridge before gently dissolving into haunting echoes and cosmic ambience."
I see the 200 character prompt limit is a thing of the past, then...
Obviously, I wanted to hear the results and I wasn't disappointed. They were excellent, although once again I can't quite see what's futuristic about them. But then, it's an impossible task, predicting the future, even in musical style.
Clearly I am nowhere near done with Suno yet. It's a fantastically entertaining toy. I have a bunch more ideas of things to try and projects to work on. I certainly feel more excited about playing around with it than I do about playing games these days.
In fact, I'm starting to wonder if generative AI might not end up becoming a new entertainment medium in its own right. It's clearly not great at facts but it's pretty good at making stuff up. Maybe that's what it'll end up being used for. I can imagine a whole raft of purpose-specific AIs being marketed and sold just as games are now.
And on that happy thought I'll leave you. I have some songs to generate.
Man, I thought you were getting into Japanese wrestling.
ReplyDeleteThis wasn't sumo at all. :(
I watched quite a bit of Sumo in the '90s, when there was a fad for it on British TV. It only lasted a couple of years here, though. Pretty much never heard it mentioned again since.
DeleteYep, it's fascinating how this whole AI prompt thing works. ChatGPT essentially demo'ed this for me too with image prompts. At some point, it went, "Would you like me to generate an image prompt for you, that you can cut and paste into any image generation AI? (regarding what you described).
ReplyDeleteOn accepting, it then laid out a massively verbose chunk of text, which I just pasted into another chat tab, and it generated something rather more decent than I had expected.
If one really wants to go down the rabbit hole further, start editing the generated text prompt with -more- of your own words and see what results. It's like the AI produces a decent baseline that you as the human can then riff on.
I have been asking Gemini to write prompts for Suno to get covers that sound like specific bands since Suno itself doesn't allow (And claims not to recognize.) names of people or groups. When I started doing it, Gemini gave me long paragraphs of description but at that time Suno had a 200 character limit for prompts so I had to ask Gemini to stay within it. Those prompts worked to an extent but not enough that you'd guess which band it was meant to be. More like the general genre that band would have been known for. Now the character limit seems to have been raised I might try using the full version and see how close that gets.
DeleteThere's a good deal of derision for the idea of "prompt scripting" but I wonder if, in time, it won't turn out to be an actual job description akin to copywriter or layout artist. Since it seems clear even the people who create and operate the AIs don't know how they work, there's probably room for a "priest" class to develop that claims to know how to control them.
It's astounding how fast this stuff is progressing. I found a youtube channel, "AI Revolution" (complete with an AI generated voice and weird lip-synced robot avatar saying it) that does almost daily videos on updates from the AI world, and it seems to have plenty of news to cover.
ReplyDeleteI find it all both fascinating and a little daunting to try to get into.
Just wait 'til the AIs start creating their own, new AIs! That's generally the point when the robot revolution begins.
DeleteMore seriously, I think this is one of those technological watersheds that will soon become too awkward to opt out of, like television and mobile/cell phones were. Obviously lots of people will choose not to get involved but increasingly they'll find that choice leaves them culturally and eventually practically isolated. Either that or it'll all disappear up its own back end like VR or crypto. I kind of think not, though.