Stability AI · Open Weights Available

Generative AI for Music & Sound Effects

Stable Audio creates original music and sound effects from natural-language text prompts. Generate full tracks in seconds, edit with precision, and build on open-weights models trained on licensed data.

Start Creating View on Hugging Face

Model Sizes

6m 20s

Max Track Length

< 2s

Generation Time

Open

Weights Available

Overview

What makes Stable Audio different from other AI music tools

Stable Audio is built by Stability AI on fully licensed data. It combines fast inference, artist-first controllability, and open-weights availability into a single platform for music and sound design.

Text-to-audio generation

Describe the music or sound effect you want in natural language. The model outputs full tracks with coherent musical structure at 44.1kHz stereo.

Audio-to-audio editing

Upload existing audio and pair it with a text prompt to change style, genre, or mood. The input audio guides the model toward your target output.

Inpainting & continuation

Edit specific segments of a track, replace sections, or extend audio beyond its original endpoint. Targeted control without regenerating the whole file.

Open-weights & API access

Download Small and Medium model weights for self-hosting, or use the Stability AI API for managed hosting. Enterprise plans include customization and indemnification.

How It Works

From prompt to production audio in seconds

Stable Audio uses a fast latent diffusion architecture. A semantic-acoustic autoencoder projects audio into a compact latent space, enabling efficient generation while preserving fidelity.

Write your prompt

Describe the audio you want: genre, mood, tempo, key instruments. The more detail you provide, the closer the output matches your intent.

Model generates audio

The diffusion model produces full-length audio in less than two seconds on an H200 GPU. Small models run on consumer hardware including MacBook Pro M4.

Edit and refine

Use inpainting to modify specific segments, change style via audio-to-audio, or extend the track with causal continuation.

Export and use

Download in WAV or MP3 format. Pro users get full commercial rights. Enterprise customers receive legal indemnification.

Models

Meet the Stable Audio family

Four model variants cover everything from on-device sound effects to enterprise-grade production. Small and Medium weights are open-source on Hugging Face.

Small SFX

459M params · Sound effects

Optimized for mobile devices. Generates up to 2 minutes of sound effects. Runs offline on consumer laptops.

Small

459M params · Short music

Full music composition on-device with open-weights availability. Generates tracks up to 2 minutes.

Medium

1.4B params · Full tracks

Generates up to 6m 20s of music with complex dynamic structure. Open-weights on Hugging Face. LoRA fine-tuning supported.

Large

Enterprise-grade

Designed for enterprise sound production. Access via Stability AI API with customization and white-glove support.

Why It Matters

Built for professionals who need reliable AI audio

Stable Audio combines open innovation with commercial safety. Every model is trained on licensed and Creative Commons data, so you can use the outputs with confidence.

Built open

Experiment with open-weights models. See what is under the hood. Build what comes next. Small and Medium are freely available to download and customize.

Built to customize

Fine-tune models on your own audio library using LoRA. Enterprise customers get guided fine-tuning support from the Stability AI Audio Research team.

Built to own

Commercially safe models trained on fully licensed datasets. Legal indemnification provided under the Enterprise license. Use outputs in your commercial projects.

Deep Dive

How Stable Audio 3.0 achieves fast, high-fidelity generation

Stable Audio 3.0 is a family of fast latent diffusion models for variable-length audio generation and editing. At its core is a novel semantic-acoustic autoencoder that projects audio into a compact latent space, preserving fidelity while encouraging semantic structure. This representation makes diffusion efficient enough to generate up to six minutes of audio in under two seconds on an H200 GPU.

The model family supports three key interaction modes. Text-to-audio generates full tracks from natural-language descriptions. Audio-to-audio transforms uploaded audio by changing style, genre, or instrumentation. Inpainting enables targeted editing — modify a single segment, perform multi-segment edits, or extend audio coherently beyond its original endpoint via causal continuation.

Adversarial post-training reduces the number of inference steps needed while improving fidelity and prompt adherence. The result is a model that runs on consumer hardware — a MacBook Pro M4 generates audio in a few seconds — while matching the quality of much larger systems.

Explainer 5 min read

Read the research paper

Quick Start

Go from reading to generating in minutes

Whether you use the Stability AI API or run models locally, getting started takes only a few steps.

Choose your path

Use the Stability AI API for managed hosting, or download open weights from Hugging Face for self-hosted inference.

Install dependencies

Python 3.10+ and PyTorch. The open-weights models run on any CUDA-capable GPU or Apple Silicon Mac.

Run inference

Load the model and pass a text prompt. A few lines of code is all it takes to generate your first track.

Fine-tune (optional)

Use LoRA to adapt the model to your own audio library. Documentation is available alongside the weights.

FAQ

The fastest answers to the questions people ask first

Start here if you want to know about licensing, hardware requirements, or what makes Stable Audio different from other AI music tools.

What is Stable Audio? ▼

Stable Audio is a generative AI platform by Stability AI that creates original music and sound effects from text prompts. It uses latent diffusion models trained on licensed and Creative Commons data to produce high-quality audio at 44.1kHz stereo.

Who created Stable Audio? ▼

Stable Audio is built by Stability AI, the company behind Stable Diffusion. The research team published the Stable Audio 3 paper on arXiv in May 2026, detailing the architecture and training methodology.

Can I use the generated audio commercially? ▼

Yes, with the appropriate plan. Pro users get full commercial rights to generated outputs. Enterprise customers receive legal indemnification. All models are trained on fully licensed datasets for commercial safety.

What hardware do I need to run it? ▼

The Small models run on consumer hardware including MacBook Pro M4. For Medium and Large models, an NVIDIA GPU is recommended. The Small SFX model is optimized for mobile devices and can generate sound effects offline.

Are the model weights open-source? ▼

Yes. Stable Audio 3.0 Small and Medium weights are available on Hugging Face under an open license. The Large model is available through the Stability AI API and Enterprise plans.

What audio formats are supported? ▼

The models generate 44.1kHz 16-bit audio. Outputs can be exported as WAV or MP3 files. The web product also supports MIDI export for further editing in DAWs.