2024:Music Audio Generation
Contents
Task Description
The MIREX 2024 Music Audio Generation Task challenges participants to develop models capable of generating high-quality, original music audio clips. This task aims to advance the state-of-the-art in music generation by encouraging the creation of systems that can produce coherent, aesthetically pleasing, and musically diverse outputs across various genres and styles.
Participants will be required to generate music clips based on textual prompts or other conditioning information provided in the dataset. The generated audio will be evaluated based on its musical quality, creativity, adherence to the provided prompt, and overall listenability.
Dataset
Description
An in-house music generation dataset, MirexGen2024 dataset will serve as the evaluation benchmark for this task. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:
- Audio Clips: A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
- Textual Prompts: Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.
For training, any data can be used.
Description of Audio Files
The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).
Description of Text
The textual prompts provided in the dataset are carefully crafted to guide the generation process. These prompts include specific instructions regarding the desired genre, mood, instrumentation, and other musical characteristics. They are designed to challenge the generative models to produce music that is not only coherent but also closely aligned with the given descriptions.
Description of Split
The MirexGen2024 dataset is only used for testing.
Baseline
MusicGen
MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components.
MusicGen-large and MusicGen-medium will be used as baselines.
Metrics
The evaluation of the generated music will be based on a combination of objective and subjective metrics:
- MOS (Mean Opinion Score): A subjective evaluation metric where human listeners rate the overall quality and aesthetic appeal of the generated music.
- Inception Score (IS): An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
- FAD (Fréchet Audio Distance): Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
- Prompt Adherence Score: A metric designed to assess how well the generated music aligns with the provided textual prompts.
Each metric will contribute to the final ranking, with MOS and Prompt Adherence Score being given the highest weight.
Download
The MusicGen2024 dataset, including both the audio clips and corresponding textual prompts, will be made available for download. Participants can access the dataset via a link that will be posted here.
Rules
Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, the use of the MusicGen2024 evaluation split for training or validation is strictly prohibited. Participants must ensure that their submissions are original and do not overlap with the evaluation data.
Submission
Submissions will be evaluated using CodaBench for automated assessment.
Participants are required to submit the following:
- Audio Files: A set of generated music clips corresponding to the prompts in the evaluation dataset.
- PDF File: A detailed report describing the system architecture, training process, and any external data or models used.
Each participant or team may submit up to three versions of their system. The final ranking will be based on the metrics outlined above.