2024:Music Audio Generation

From MIREX Wiki

Task Description

The MIREX 2024 Music Audio Generation Task challenges participants to develop models capable of generating high-quality, original music audio clips. This task aims to advance the state-of-the-art in music generation by encouraging the creation of systems that can produce coherent, aesthetically pleasing, and musically diverse outputs across various genres and styles.

Participants will be required to generate music clips based on textual prompts or other conditioning information provided in the dataset. The generated audio will be evaluated based on its musical quality, creativity, adherence to the provided prompt, and overall listenability.

Dataset

Description

For training, any non-test-set data from the open-source world can be used.

An in-house music generation dataset, MirexGen2024, will serve as this task's evaluation benchmark. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:

  • Audio Clips: A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
  • Textual Prompts: Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.

Description of Audio Files

The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).

Description of Text

The textual prompts provided in the dataset are carefully crafted to guide the generation process. These prompts include specific instructions regarding the desired genre, mood, instrumentation, and other musical characteristics. They are designed to challenge the generative models to produce music that is not only coherent but also closely aligned with the given descriptions.

Description of Split

The MirexGen2024 dataset is only used for testing. For training, any non-test-set data from the open-source world can be used.

Baseline

MusicGen

MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components.

(MusicGen-large)[1] and (MusicGen-medium)[2] will be used as baselines.

Metrics

The evaluation of the generated music will be based on a combination of objective and subjective metrics:

  • Inception Score (IS): An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
  • FAD (Fréchet Audio Distance): Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
  • CLAP-Score: A metric designed to assess how well the generated music aligns with the provided textual prompts.

Each metric will contribute to the final ranking.

Download

We do not provide the download of the dataset for now.

Rules

Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, Participants should not use any test-split data from any open-source dataset for training or validation.

Submission

Submissions will be evaluated using CodaBench for automated assessment.

Participants are required to submit the following:

  • Audio Files: A set of generated music clips corresponding to the prompts in the evaluation dataset.
  • PDF File: A detailed report describing the system architecture, training process, and any external data or models used.

Each participant or team may submit up to three versions of their system. The final ranking will be based on the metrics outlined above.