Prompt Engineering for Music AI

Aug. 14, 2024 | Ryan Fitzpatrick
banner_for_a_blog_article_about_AI_music_gener_e7d4a42c-eb11-485a-9c42-126777c0aa8f_3_compressed

Some techniques for prompting music AI, along with some experiments and audio results.

Introduction

AI tools like UdIO (v1.5 as of writing) are becoming invaluable for generating dynamic and genre-specific tracks. By leveraging the memory feature of ChatGPT-4, I was able to create a consistent song structure through prompt engineering. Now, I can simply request a UdIO song, and it provides me with exact parameters ready to copy and paste, making the process incredibly efficient. Further below is the song sample we used for the style and lyrics.

Using ChatGPT-4 for Consistent Song Structures

One of the most beneficial aspects of using ChatGPT-4 is its ability to remember and reproduce consistent song structures. I can prompt it to generate a UdIO song in the style of a specific artist and about a particular topic, and it provides the exact parameters ready for use. This integration streamlines the creative process, ensuring that each song adheres to the desired style and thematic content.

Song Style Prompt

metalcore, melodic metalcore, heavy metal, aggressive, powerful, melodic, intense, emotional, dynamic, anthemic, energetic, passionate, raw, heavy, driving, relentless, male vocalist, harsh vocals for verses, clean vocals for choruses

I made ChatGPT memorize that when I request a udio song it is comprised of one core music genre, then up to 3 sub-genres, then up to 10 descriptive sound shaping verbs. Having ChatGPT output in this manner provides variety across songs while aiming for a particular sound in general.

Lyrics Prompt

[Verse 1 - Harsh Vocals]
In the grip of anguish, where comfort fades
Through the depths of trials, I'll forge my way
Every shiver, every fear, they mark my soul
In this realm of torment, I'll take control

[Chorus - Clean Vocals]
Rise from the shadows, we will not fall (we will not fall)
In the face of discomfort, we stand tall (we stand tall)
With every breath, with every cry (with every cry)
Our will unbroken, we will defy (we will defy)

[Verse 2 - Harsh Vocals]
Chained by uncertainty, we'll break these ties
In the storm of chaos, we'll rise and rise
Voices of doubt, they call our names
But in this burning resolve, we feel no shame

[Chorus - Clean Vocals]
Rise from the shadows, we will not fall (we will not fall)
In the face of discomfort, we stand tall (we stand tall)
With every breath, with every cry (with every cry)
Our will unbroken, we will defy (we will defy)

[Bridge - Clean Vocals]
In the silence of despair, we hear the call
Through the echoes of our doubts, we'll stand tall
With the fire in our hearts, we'll light the way
In this battle for our peace, we'll seize the day

[Special Resolution - Clean Vocals]
With every struggle, with every scar (with every scar)
We rise anew, no matter how far (no matter how far)
Bound by courage, we'll find our way
In this fight, we won't decay

[Chorus - Clean Vocals]
Rise from the shadows, we will not fall (we will not fall)
In the face of discomfort, we stand tall (we stand tall)
With every breath, with every cry (with every cry)
Our will unbroken, we will defy (we will defy)

[Outro - Clean Vocals]
In the void, we find our light
Through the pain, we'll win this fight
Our hearts beat as one, until the end (until the end)
In this battle, we will not bend (we will not bend)

Again here I made ChatGPT memorize a few structural components. the section headings should always be structured as such to include the part and the vocal type. I would have ChatGPT change this depending on genres and the sound I am trying to create.

Creating Dual Vocalists and Backing Vocals

When crafting lyrics, I often aim for dual vocalists to add depth and contrast to the music. By tagging sections specifically for harsh and clean vocals, I attempt to guide the AI in producing distinct vocal parts. Additionally, using backing vocals strategically can sometimes trick the AI into generating two different voices, enhancing the complexity of the track.

Experimentation and Results

Through a series of experiments with different UdIO settings, we aimed to uncover optimal prompting techniques to produce high-quality music that aligns with specific stylistic preferences. Below, I detail the results of each test, accompanied by sound samples to illustrate the outcomes.

Test 1: Not Manual, Low Prompt Strength, Low Lyrics, Low Clarity (where I start usually)

In this test, we aimed to see how minimal manual intervention and low settings would affect the output. Unfortunately, the results were underwhelming. The track produced lacked coherence, with backing vocals and dual voices being completely ignored. The overall sound was muddy and failed to capture the intended intensity and dynamism. This test highlighted the limitations of low clarity and minimal prompt strength, emphasizing the need for more refined settings to achieve desirable results.

Test 2: Manual, Low Prompt Strength, Low Lyrics, Low Clarity

Switching to manual mode with low prompt strength yielded mixed results. Both samples correctly utilized two voices, which was a step forward. However, the overall output remained unremarkable, with the chorus being the only part that stood out as decent. This test underscored that while manual adjustments can improve vocal variety, the overall impact of low settings still leaves much to be desired.

Test 3: Not Manual, High Prompt Strength, Low Lyrics, Low Clarity

Increasing the prompt strength while keeping other settings low provided some improvements but also introduced new challenges. The dual voices were ignored, but backing vocals worked well, making the song clearer. However, this clarity came at a cost, the vocals sounded boring and stiff, and the lyrics were largely incomprehensible when screamed. This experiment demonstrated that higher prompt strength could enhance certain elements but may compromise vocal expressiveness.

Test 4: Not Manual, Low Prompt Strength, High Lyrics, Low Clarity

When we adjusted the lyrics setting to high, the vocals became more distinct and varied. However, dual vocals still did not work, and the overall sound remained somewhat mid-tier. The second sample produced the same vocal quality, reinforcing the notion that while high lyrics settings can improve vocal clarity, other issues like dual voice integration need separate tweaking.

Test 5: Not Manual, Low Prompt Strength, Low Lyrics, High Clarity

This setting was a complete wash. The output was marred by a weird thumping noise and an unnaturally long instrumental opening, with no lyrics being produced. This test clearly showed that high clarity settings, when paired with low prompt strength and low lyrics, can result in an unnatural and unusable track. It's essential to balance clarity with other parameters to avoid such pitfalls.

Test 6: Manual, Low Prompt Strength, High Lyrics, Low Clarity

Manual control with high lyrics and low clarity produced mixed yet promising results. The first sample had a long instrumental opening and delayed lyrics but sounded good for metalcore. The second sample was better, ignored dual voices, backing vocals worked, words were very clear despite being screamed. The result was a bit mid-sounding but included a nice spoken word breakdown, making the sample more interesting despite clean vocals being ignored.

Test 7: Manual, High Prompt Strength, Low Lyrics, Low Clarity

Our final test with high prompt strength and manual control showcased the most potential. While dual voices were again ignored, the backing vocals were accurate, and the track featured a creative spoken word breakdown. The voice had a slightly better quality, and the guitars were notably melodic. Despite some lyrical structure being sacrificed, the dynamic vocals and genre adherence were impressive. This test illustrated that higher prompt strength, combined with manual adjustments, could produce high-quality, genre-specific music, albeit with some trade-offs in lyric integration.

Results in Summary

The experiments revealed a trade-off between lyric clarity, prompt strength, and manual control. Higher prompt strength tended to enhance genre-specific elements but often at the expense of lyric structure and dual voices. Manual control generally yielded better results, allowing for more precise adjustments to achieve the desired vocal variety and backing vocals. Overall, using manual settings with careful prompt engineering in ChatGPT-4 led to more consistent and high-quality outputs, demonstrating the power of combining AI tools for music creation.

Also another small setting I like to always adjust is the lyrics timing, by default it sits at 0s which leads to instant start of singing in songs, not a big fan of that unless i was looking for it, i prefer the variety of auto setting for that, you will get instrumental intros, and very rarely you wont get vocals till the end of the song.

Conclusion

Through these experiments, we have gained valuable insights into how different UdIO settings impact music generation. While there is no one-size-fits-all solution, understanding the nuances of prompt strength, manual control, and lyric clarity can help fine-tune AI-generated music to better meet stylistic and quality expectations. By sharing these findings and sound samples, we hope to assist fellow creators in optimizing their use of music AI tools.