Difference between revisions of "2024:Music Audio Generation"
(→Description of Audio Files) |
(→Rules) |
||
(13 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
=== Description === | === Description === | ||
− | An in-house music generation dataset, | + | For training, any non-test-set data from the open-source world can be used. |
+ | |||
+ | An in-house music generation dataset, MirexGen2024, will serve as this task's evaluation benchmark. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes: | ||
* '''Audio Clips''': A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation. | * '''Audio Clips''': A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation. | ||
* '''Textual Prompts''': Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo. | * '''Textual Prompts''': Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo. | ||
− | |||
− | |||
=== Description of Audio Files === | === Description of Audio Files === | ||
− | The audio files in the | + | The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds). |
=== Description of Text === | === Description of Text === | ||
Line 26: | Line 26: | ||
=== Description of Split === | === Description of Split === | ||
− | The | + | The MirexGen2024 dataset is only used for testing. |
+ | For training, any non-test-set data from the open-source world can be used. | ||
== Baseline == | == Baseline == | ||
− | ''' | + | '''MusicGen''' |
− | |||
− | |||
− | + | MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components. | |
− | |||
− | |||
− | + | (MusicGen-large)[https://huggingface.co/facebook/musicgen-large] and (MusicGen-medium)[https://huggingface.co/facebook/musicgen-medium] will be used as baselines. | |
== Metrics == | == Metrics == | ||
Line 44: | Line 41: | ||
The evaluation of the generated music will be based on a combination of objective and subjective metrics: | The evaluation of the generated music will be based on a combination of objective and subjective metrics: | ||
− | |||
* '''Inception Score (IS)''': An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model. | * '''Inception Score (IS)''': An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model. | ||
* '''FAD (Fréchet Audio Distance)''': Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity. | * '''FAD (Fréchet Audio Distance)''': Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity. | ||
− | * ''' | + | * '''CLAP-Score''': A metric designed to assess how well the generated music aligns with the provided textual prompts. |
− | Each metric will contribute to the final ranking | + | Each metric will contribute to the final ranking. |
== Download == | == Download == | ||
− | + | We do not provide the download of the dataset for now. | |
== Rules == | == Rules == | ||
− | Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, | + | Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, Participants should not use any test-split data from any open-source dataset for training or validation. |
== Submission == | == Submission == |
Latest revision as of 21:31, 11 October 2024
Contents
Task Description
The MIREX 2024 Music Audio Generation Task challenges participants to develop models capable of generating high-quality, original music audio clips. This task aims to advance the state-of-the-art in music generation by encouraging the creation of systems that can produce coherent, aesthetically pleasing, and musically diverse outputs across various genres and styles.
Participants will be required to generate music clips based on textual prompts or other conditioning information provided in the dataset. The generated audio will be evaluated based on its musical quality, creativity, adherence to the provided prompt, and overall listenability.
Dataset
Description
For training, any non-test-set data from the open-source world can be used.
An in-house music generation dataset, MirexGen2024, will serve as this task's evaluation benchmark. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:
- Audio Clips: A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
- Textual Prompts: Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.
Description of Audio Files
The audio files in the MirexGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds).
Description of Text
The textual prompts provided in the dataset are carefully crafted to guide the generation process. These prompts include specific instructions regarding the desired genre, mood, instrumentation, and other musical characteristics. They are designed to challenge the generative models to produce music that is not only coherent but also closely aligned with the given descriptions.
Description of Split
The MirexGen2024 dataset is only used for testing. For training, any non-test-set data from the open-source world can be used.
Baseline
MusicGen
MusicGen, developed by Meta, is a single-stage transformer-based Language Model (LM) designed for conditional music generation. It operates over multiple streams of compressed discrete music tokens, eliminating the need for multi-stage models like hierarchical or upsampling methods. MusicGen efficiently generates high-quality mono and stereo music samples conditioned on text descriptions or melodic features, providing enhanced control over the output. Extensive evaluations, including both automatic and human studies, demonstrate that MusicGen outperforms baseline models in text-to-music generation benchmarks. Ablation studies further highlight the significance of its key components.
(MusicGen-large)[1] and (MusicGen-medium)[2] will be used as baselines.
Metrics
The evaluation of the generated music will be based on a combination of objective and subjective metrics:
- Inception Score (IS): An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
- FAD (Fréchet Audio Distance): Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
- CLAP-Score: A metric designed to assess how well the generated music aligns with the provided textual prompts.
Each metric will contribute to the final ranking.
Download
We do not provide the download of the dataset for now.
Rules
Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, Participants should not use any test-split data from any open-source dataset for training or validation.
Submission
Submissions will be evaluated using CodaBench for automated assessment.
Participants are required to submit the following:
- Audio Files: A set of generated music clips corresponding to the prompts in the evaluation dataset.
- PDF File: A detailed report describing the system architecture, training process, and any external data or models used.
Each participant or team may submit up to three versions of their system. The final ranking will be based on the metrics outlined above.