The State of Creative Gen-AI in Summer 2024

Aug. 3, 2024 | Ryan Fitzpatrick

In the summer of 2024, AI continues to make remarkable strides in video generation, music composition, image creation, and open-source innovation.

How We Got Here

Starting in late 2022, AI has moved from niche applications to mainstream tech. However, we should exercise caution; there is a sense that AI progress might be stalling. Amid the hype, we need to recognize the challenges and limitations ahead to determine if AI will ever be truly useful as a product, without overestimating its capabilities.

AI Content Pipeline

We are now almost coming to a point where AI generated media will start saturating the engagement content market space such as youtube and social media in general.

A very simple process that utilizes the following platforms:

  • Runway Gen 3 Alpha - Video Clips
  • Udio/Suno - Music Composition
  • MidJourney - Image Source Style
  • ChatGPT - prompts for MidJourney, style and lyric composition for music, prompt engineering for video clips

ChatGPT does a lot of heavy lifting as far as coming up with ideas and expanding on concepts that I already had.

Run-way Gen-3 Alpha

Runway Gen-3 Alpha has been steadily iterating and improving on video AI generation, bringing some impressive creative potential to the table. This tech lets users create highly realistic and dynamic videos from just simple text prompts, which is pretty cool for digital storytelling and content creation.

As for being a hobbyist user its tough, with a price of around $5 CAD for the Highway Symphony video all AI services rendered, I have to be really sure i want to gamble with my credits to make what is in my head. And that is really what it comes down to with these AI services, you essentially gamble with credits to generate what you have in your mind.

Suno and Udio vs. The Music Industry

Shadows in the Cave - Suno

Whispers of the Untamed - Udio

Suno and Udio continue to innovate, releasing new models that create increasingly engaging music compositions with longer generation lengths and greater macro control over the output. However, these advancements have sparked controversy, as major music labels are suing them, raising significant ethical issues in AI. The lawsuits focus on how training data is obtained and challenge the boundaries of fair use copyright law, highlighting the need for clear regulations in the evolving landscape of AI-generated music.

Midjourney Updates

MidJourney, the image-generating AI, has also seen remarkable progress with the release of its Version 6.1. This latest iteration enhances image quality and creativity, making it a powerful tool for artists and designers. Alongside this update, MidJourney has introduced a new website frontend, moving away from its initial Discord-based interface, making it more user-friendly and accessible. Additionally, I have continued to enjoy a monthly coffee table magazine they publish, showcasing the best creations from its community and inspiring users with the diverse and stunning possibilities of AI-generated art.

Progress in the same subject is noticeable, but has become more slow and incremental, no longer a monumental shift in capability.

Version 4 (Spring 2023)

Version 5 (Fall 2023)

Version 6 (Spring 2024)

Version 6.1 (Summer 2024)

I have to give some accolades to Gemini and ChatGPT in image generation, they occasionally do well in some spaces over others. For instance my FinTech app Simple Cents got its Goosey logo from Gemini by stroke of luck I suppose, MidJourney frustrated me for days coming up with this mascot.

ChatGPT 4o

This summer, ChatGPT received a significant boost with the 4o update (omni), enhancing its language understanding and generation capabilities across diverse applications. This update delivers more accurate, context-aware responses, making interactions smoother and more intuitive.

In the evolving AI landscape, ChatGPT voice feature mode is a significant advancement, allowing users to interact with the AI through spoken commands. This functionality enhances user experience by enabling hands-free interaction, perfect for multitasking. Its ability to understand and process natural language accurately makes it useful for various applications, from drafting emails to setting reminders.

However, maintaining a natural flow of conversation can be difficult. Often, the AI responds before I have fully articulated my thoughts, disrupting the conversation and leading to incomplete or misunderstood commands. This is a common challenge across all voice command technologies, requiring improvements in the AI ability to discern when a user has finished speaking.

OpenAI has announced a major update for this fall, expected to enhance the AI conversational abilities, making interactions smoother and more intuitive. This update could allow users to express their thoughts more naturally without interruptions.

The current voice feature mode in ChatGPT offers impressive capabilities but faces challenges in capturing natural conversation flow. The forthcoming update holds great potential to address these issues, paving the way for more intuitive and effective communication with AI. The future of AI-driven voice interaction looks promising.

It has been very handy at writing scripts to perform rudimentary media editing functions. I use a script ChatGPT wrote to compress images i dump in, and convert over to webp to save on storage bandwidth.

import os
from PIL import Image
import imageio


def compress_image(input_path, output_path, quality=80):
    # Open the input image using Pillow
    image = Image.open(input_path)

    # Save the image in WebP format using imageio
    imageio.imwrite(output_path, image, format='webp', quality=quality)

    print(f"Image saved as {output_path} with quality={quality}.")


def process_directory(directory_path, quality=80):
    for filename in os.listdir(directory_path):
        if filename.lower().endswith('.png'):
            input_path = os.path.join(directory_path, filename)
            output_filename = f"{os.path.splitext(filename)[0]}_compressed.webp"
            output_path = os.path.join(directory_path, output_filename)

            compress_image(input_path, output_path, quality)


# Example usage
directory_path = 'images'
process_directory(directory_path, quality=80)

All in a project called "imageconverter" which at this point is doing much more than just image scripts, the poor lil project also does video editing scripts to make youtube short format videos.

from moviepy.editor import VideoFileClip


def convert_to_portrait(input_path, output_path):
    # Load the video
    clip = VideoFileClip(input_path)

    # Get the original video dimensions
    original_width, original_height = clip.size

    # Define the target portrait dimensions
    target_height = original_height
    target_width = int(target_height * 9 / 16)

    # Calculate cropping dimensions
    crop_x = (original_width - target_width) / 2
    crop_y = 0

    # Crop and resize the video
    cropped_clip = clip.crop(x1=crop_x, y1=crop_y, x2=crop_x + target_width, y2=crop_y + target_height)

    # Save the new video
    cropped_clip.write_videofile(output_path, codec='libx264')


# Example usage
input_video_path = "input.mp4"
output_video_path = "input Short.mp4"
convert_to_portrait(input_video_path, output_video_path)

LangChain

LangGraph has emerged as a powerful extension within the LangChain ecosystem, enabling the development of complex, stateful, multi-agent applications. Key features include cyclical execution, enhanced state management, and fine-grained control over agent workflows. This allows developers to create sophisticated AI systems that can reason, make decisions, and interact with multiple data sources effectively.

LangGraph Cloud provides scalable infrastructure for deploying these agents at scale, supporting double-texting, asynchronous background jobs, and cron jobs. This makes it easier to manage large-scale, real-world AI applications. Integration with LangSmith enhances monitoring and debugging capabilities, ensuring robust performance and reliability.

LangGraph's state management and error handling, along with support for human-in-the-loop workflows, make it versatile for building advanced AI applications, from chatbots to complex decision-making systems in various industries.

Looking Ahead

Artificial intelligence (AI) has made substantial advancements across various domains in 2024, significantly impacting industries like healthcare, finance, and creative arts. Key developments include the integration of AI in diagnostic tools, the rise of generative models for content creation, and the improvement of natural language processing (NLP) technologies. These advancements have been driven by the increased availability of large datasets and the evolution of AI models that can process and analyze this data more effectively.

By the end of 2024, it is likely that AI-driven personalization will become more sophisticated and widespread. This could manifest in enhanced user experiences across digital platforms, where AI systems can tailor content, recommendations, and interactions more precisely to individual preferences and behaviors. This shift towards more intuitive and responsive AI could improve customer satisfaction and engagement in sectors ranging from e-commerce to entertainment.