Difference between revisions of "2024:Music Audio Generation"

From MIREX Wiki
(Description)
(Rules)
 
(11 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
=== Description ===
 
=== Description ===
  
An in-house music generation dataset, MirexMusicGen2024 dataset will serve as the evaluation benchmark for this task. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:
+
For training, any non-test-set data from the open-source world can be used.
 +
 
 +
An in-house music generation dataset, MirexGen2024, will serve as this task's evaluation benchmark. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:
  
 
* '''Audio Clips''': A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
 
* '''Audio Clips''': A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
 
* '''Textual Prompts''': Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.
 
* '''Textual Prompts''': Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.
 
For training, any data can be used.
 
  
 
=== Description of Audio Files ===
 
=== Description of Audio Files ===
  
The audio files in the MusicGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).
+
The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).
  
 
=== Description of Text ===
 
=== Description of Text ===
Line 26: Line 26:
 
=== Description of Split ===
 
=== Description of Split ===
  
The MusicGen2024 dataset is only used for testing.
+
The MirexGen2024 dataset is only used for testing.
 +
For training, any non-test-set data from the open-source world can be used.
  
 
== Baseline ==
 
== Baseline ==
  
'''Gen-MusicTransformer: Model Architecture'''
+
'''MusicGen'''
 
 
Gen-MusicTransformer employs a transformer-based architecture tailored for music generation tasks. The model is designed to handle sequential data, making it well-suited for generating coherent and contextually rich music clips.
 
  
* '''Encoder''': The encoder processes the input textual prompt, transforming it into a series of embeddings that capture the key aspects of the prompt, such as mood, genre, and instrumentation.
+
MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components.
* '''Decoder''': The decoder is responsible for generating the music audio. It utilizes a series of transformer blocks to predict the next audio feature based on the previous context, producing a continuous stream of audio data. The model generates log-mel spectrograms, which are subsequently converted into audio waveforms using a vocoder.
 
* '''Conditioning''': The model can be conditioned on additional inputs, such as specific musical motifs or rhythms, allowing for more controlled generation outputs.
 
  
Gen-MusicTransformer is pre-trained on a large corpus of music data and fine-tuned on the MusicGen2024 dataset to optimize its performance on the specific task of prompt-based music generation.
+
(MusicGen-large)[https://huggingface.co/facebook/musicgen-large] and (MusicGen-medium)[https://huggingface.co/facebook/musicgen-medium] will be used as baselines.
  
 
== Metrics ==
 
== Metrics ==
Line 44: Line 41:
 
The evaluation of the generated music will be based on a combination of objective and subjective metrics:
 
The evaluation of the generated music will be based on a combination of objective and subjective metrics:
  
* '''MOS (Mean Opinion Score)''': A subjective evaluation metric where human listeners rate the overall quality and aesthetic appeal of the generated music.
 
 
* '''Inception Score (IS)''': An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
 
* '''Inception Score (IS)''': An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
 
* '''FAD (Fréchet Audio Distance)''': Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
 
* '''FAD (Fréchet Audio Distance)''': Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
* '''Prompt Adherence Score''': A metric designed to assess how well the generated music aligns with the provided textual prompts.
+
* '''CLAP-Score''': A metric designed to assess how well the generated music aligns with the provided textual prompts.
  
Each metric will contribute to the final ranking, with MOS and Prompt Adherence Score being given the highest weight.
+
Each metric will contribute to the final ranking.
  
 
== Download ==
 
== Download ==
  
The MusicGen2024 dataset, including both the audio clips and corresponding textual prompts, will be made available for download. Participants can access the dataset via a link that will be posted here.
+
We do not provide the download of the dataset for now.
  
 
== Rules ==
 
== Rules ==
  
Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, the use of the MusicGen2024 evaluation split for training or validation is strictly prohibited. Participants must ensure that their submissions are original and do not overlap with the evaluation data.
+
Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, Participants should not use any test-split data from any open-source dataset for training or validation.
  
 
== Submission ==
 
== Submission ==

Latest revision as of 21:31, 11 October 2024

Task Description

The MIREX 2024 Music Audio Generation Task challenges participants to develop models capable of generating high-quality, original music audio clips. This task aims to advance the state-of-the-art in music generation by encouraging the creation of systems that can produce coherent, aesthetically pleasing, and musically diverse outputs across various genres and styles.

Participants will be required to generate music clips based on textual prompts or other conditioning information provided in the dataset. The generated audio will be evaluated based on its musical quality, creativity, adherence to the provided prompt, and overall listenability.

Dataset

Description

For training, any non-test-set data from the open-source world can be used.

An in-house music generation dataset, MirexGen2024, will serve as this task's evaluation benchmark. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:

  • Audio Clips: A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
  • Textual Prompts: Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.

Description of Audio Files

The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).

Description of Text

The textual prompts provided in the dataset are carefully crafted to guide the generation process. These prompts include specific instructions regarding the desired genre, mood, instrumentation, and other musical characteristics. They are designed to challenge the generative models to produce music that is not only coherent but also closely aligned with the given descriptions.

Description of Split

The MirexGen2024 dataset is only used for testing. For training, any non-test-set data from the open-source world can be used.

Baseline

MusicGen

MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components.

(MusicGen-large)[1] and (MusicGen-medium)[2] will be used as baselines.

Metrics

The evaluation of the generated music will be based on a combination of objective and subjective metrics:

  • Inception Score (IS): An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
  • FAD (Fréchet Audio Distance): Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
  • CLAP-Score: A metric designed to assess how well the generated music aligns with the provided textual prompts.

Each metric will contribute to the final ranking.

Download

We do not provide the download of the dataset for now.

Rules

Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, Participants should not use any test-split data from any open-source dataset for training or validation.

Submission

Submissions will be evaluated using CodaBench for automated assessment.

Participants are required to submit the following:

  • Audio Files: A set of generated music clips corresponding to the prompts in the evaluation dataset.
  • PDF File: A detailed report describing the system architecture, training process, and any external data or models used.

Each participant or team may submit up to three versions of their system. The final ranking will be based on the metrics outlined above.