How long can MusicGen generate in one pass?

MusicGen reliably produces up to about 30 seconds of audio per generation. The model can technically be pushed longer, but coherence degrades, so for longer pieces you chain generations using continuation mode.

What is continuation mode?

Continuation feeds the model an existing audio segment and asks it to keep going in the same style. It is how you build coherent tracks longer than 30 seconds — generate a base clip, then continue from its final seconds repeatedly.

What is melody conditioning?

The melody-conditioned MusicGen variant accepts a reference melody (a hummed or played line) alongside the text prompt. The model then generates accompaniment in your described style while following that melodic contour, which is useful for arranging a tune you already have.

Does MusicGen understand BPM and key?

It responds well to tempo cues like "120 BPM" or "slow" and to mood-implied keys ("melancholic", "uplifting"), but it does not take an explicit musical-key parameter. Describe the feel rather than naming a specific key signature.

What is the MusicGen Prompt Guide?

Guide to Meta MusicGen prompt writing. Covers text conditioning, continuation and melody-conditioning modes, the 30-second duration limit per generation, and a style descriptor vocabulary for genre, instrumentation, and mood. It runs free in your browser on Gera Tools, with nothing uploaded.

MusicGen Prompt Guide

Name: MusicGen Prompt Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

The difference between a muddy MusicGen clip and a usable one is usually not the model — it is whether the prompt names actual instruments. Meta’s open text-to-music model (released as part of the AudioCraft library) is text-conditioned: it reads your description holistically and renders audio from it. It is fast and runs locally, but it has two quirks worth planning around — a practical 30-second per-generation limit and three distinct modes (plain generation, continuation, and melody conditioning) that each call for a slightly different prompt approach. This guide builds the conditioning string and tells you which mode fits your goal.

The mode decision, in one paragraph

In generation mode, MusicGen reads a short descriptive phrase — genre, instrumentation, tempo, mood — and produces a clip up to roughly 30 seconds. For longer tracks you switch to continuation: you pass in an existing audio segment and the model extends it in the same style, which keeps a multi-minute piece coherent. The melody-conditioned variant additionally takes a reference melody and arranges your described style around that contour.

Model sizes and what they cost you

MusicGen ships in several checkpoints on Hugging Face (e.g. musicgen-small and musicgen-large):

Checkpoint	Parameters	Practical note
musicgen-small	~300M	Fast drafts on modest GPUs; weakest prompt adherence
musicgen-medium	~1.5B	The sweet spot for local generation
musicgen-melody	~1.5B	The one that accepts a reference melody
musicgen-large	~3.3B	Best fidelity and adherence; needs serious VRAM

A sensible workflow is to iterate prompts on small, then re-render the winning prompt on large — prompt quality transfers across checkpoints far better than seed luck does.

MusicGen’s three modes in practice

Text-to-music (generation) is the default. You write a descriptive phrase and the model produces a clip from scratch. Because MusicGen reads the full string holistically rather than treating it as comma-separated tags, write it more like a sentence: “a warm jazz trio with Rhodes piano, upright bass, and brushed drums, 90 BPM, relaxed late-night feel” works better than “Rhodes, bass, drums, jazz, warm, 90 BPM”.

Continuation is how you build longer tracks. You generate a 20-second base clip, then pass the last 5-10 seconds of that clip back to MusicGen as a prompt-audio input alongside the same text description. The model picks up from where the audio left off. Chaining three or four continuations gives you a coherent 60-90 second track. Keep the text prompt consistent across continuations so the style does not drift.

Melody conditioning requires a MusicGen variant trained for this purpose. You provide a reference audio clip — a hummed melody, a sung tune, or any single melodic line — and the text description. The model generates harmonic accompaniment that follows the melodic contour of your reference in the style you describe. This is the fastest way to turn a melody you have in your head into a produced arrangement.

Prompt anatomy for MusicGen

A strong MusicGen prompt has four components in this order:

Genre or style anchor — “lo-fi hip hop”, ”80s synthwave”, “acoustic folk”
Specific instruments — “Rhodes piano, upright bass, brushed snare” (not just “jazz”)
Tempo or energy level — “80 BPM”, “slow and meditative”, “driving and upbeat”
Mood or texture — “warm, late-night, melancholy”, “bright and energetic”

For example: “lo-fi hip hop with a warm Rhodes piano, gentle upright bass, and soft brushed drums, 85 BPM, nostalgic and relaxed, late-night studying vibe.”

Tips for better MusicGen output

Be concrete about instruments. “warm Rhodes piano, soft brushed drums, upright bass” gives a far more controlled result than “jazzy”.
State tempo, not key. “90 BPM, relaxed” works; an explicit key signature does not, so describe mood instead.
Chain with continuation for length. Don’t push a single call past 30 seconds — generate a base and continue from its tail to stay coherent.
Use melody conditioning to arrange. If you already have a tune, the melody-conditioned model will dress it in your described genre.
Generate multiple seeds. MusicGen varies considerably across seeds; generating 3-4 variations and picking the best one is standard practice.

What MusicGen will not do

Knowing the model’s hard edges saves prompt-fiddling time. It does not take an explicit key signature or chord chart (describe mood instead), it does not generate vocals with intelligible lyrics (it was trained on instrumental-leaning data; vocal-sounding textures come out as wordless syllables), and it cannot hold structure — verse/chorus form — across a single long generation, which is precisely why continuation-chaining exists. If you need lyric-bearing songs, you want a different model family; if you need arrangement of an existing tune, melody conditioning is the strongest feature in the AudioCraft toolbox.

Sources

Meta AudioCraft — official repository and documentation
facebook/musicgen-large on Hugging Face — model card with checkpoints, license, and usage

Model behavior described here reflects the released AudioCraft MusicGen checkpoints; parameter counts are from the official model cards. The prompt builder runs entirely in your browser.