Stability AI, known for its generative AI technology in image and code generation, has now expanded its capabilities to text-to-audio generation. The organization recently launched its Stable Audio technology, which enables users to generate short audio clips using simple text prompts. This comes after their successful ventures in Stable Diffusion and StableCode, which focused on text-to-image and text-to-code generation respectively. The Stable Audio technology utilizes a diffusion model trained on audio data to generate new and unique audio clips based on user descriptions.
According to Ed Newton-Rex, VP of Audio at Stability AI, Stable Audio is the organization’s first product for music and audio generation. Newton-Rex, who previously founded the startup Jukedeck, brings his expertise in computer-generated music to Stability AI. However, the roots of Stable Audio lie in Stability AI’s internal research studio called Harmonai, founded by Zach Evans. This studio applies similar AI techniques used in image generation to the domain of audio, creating a community-oriented space for generative audio research.
Moving Beyond Symbolic Generation
While symbolic generation techniques using MIDI files have allowed users to generate basic audio tracks in the past, Stable Audio offers a novel approach. Instead of relying on repetitive notes and predefined patterns, Stable Audio works directly with raw audio samples for higher quality output. The model was trained on over 800,000 pieces of licensed music from the AudioSparks audio library, ensuring not only high-quality audio but also comprehensive metadata.
Unlike image generation models that often focus on replicating a specific artist’s style, Stable Audio encourages users to explore their creativity. Users are not limited to asking the AI model to generate music that sounds like a classic Beatles tune or any other specific musical group. According to Newton-Rex, most musicians prefer to start from scratch and develop their own unique sound rather than imitate existing artists.
The Power of Parameters
As a diffusion model, Stable Audio boasts approximately 1.2 billion parameters, comparable to the original Stable Diffusion model for image generation. The text prompts that drive the audio generation process were built and trained by Stability AI using Contrastive Language Audio Pretraining (CLAP) techniques. To assist users in generating the desired audio files, Stability AI is also releasing a prompt guide that provides helpful suggestions.
Accessible to All
Stable Audio will be available to users in both free and Pro versions. The free version allows for 20 audio generations per month, with each track limited to 20 seconds. In contrast, the Pro version, available for $12 per month, increases these limits to 500 audio generations and 90-second tracks. Stability AI aims to make Stable Audio accessible to everyone, encouraging experimentation and exploration.
With the launch of Stable Audio, Stability AI expands its capabilities to the world of music and audio generation, leveraging its expertise in generative AI technology. By enabling users to generate unique audio clips through simple text prompts, Stable Audio empowers musicians and creators to explore their creativity and develop their own distinct sound. Whether it’s for experimental purposes or professional music production, Stable Audio provides a powerful tool that unlocks new possibilities in the realm of audio generation.
Leave a Reply