2024:Music Description & Captioning
Contents
Task Description
The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.
Participants are tasked with creating systems that generate captions for a collection of music clips from the MusicCaps dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.
Dataset
Description
The MusicCaps dataset serves as the benchmark for this task. MusicCaps is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by musicians. This dataset provides a robust foundation for training and evaluating music captioning models.
MusicCaps comprises 5,521 music examples, each extracted as 10-second clips from the larger AudioSet dataset. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.
Description of Audio Files
- The audio clips in MusicCaps are carefully selected and processed to reflect a broad spectrum of musical experiences. Each clip is 10 seconds long and is sourced from AudioSet, providing a rich diversity of musical genres and styles.
Description of Text
- Each clip in the MusicCaps dataset is accompanied by a free-text caption written by professional musicians. These captions focus on describing the musical elements such as genre, instrumentation, mood, and other relevant characteristics. Importantly, the captions exclude metadata like artist names and concentrate solely on the auditory content.
Description of Split
- The MusicCaps dataset is divided into training and evaluation subsets. Participants must not use the evaluation split for training or validation purposes. This split is designed to ensure a fair and balanced assessment of the model's performance, with a diverse set of music clips and captions.
Baseline
LP-MusicCaps: Model Architecture
- LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.
- The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.
- A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.
Metrics
- The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:
- ROUGE-L
- Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.
- BLEU (B1~B4)
- Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.
- METEOR
- Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.
- While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.
Download
The dataset, including both the audio clips and their corresponding captions, will be made available for download. Participants will be provided with a link to access the dataset, which will be posted here.
Rules
- Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the MusicCaps evaluation split for training or validation is strictly prohibited, and any overlap with the evaluation data must be avoided.
Submission
- Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.
- Participants are required to submit the following:
- JSON file
- A JSON file containing the generated captions for the evaluation dataset.
- PDF file
- A PDF file detailing the system architecture, training process, and any external data or models used.
- Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above.

