Difference between revisions of "2024:Music Description & Captioning"
|  (Created page with "= Task Description =  The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This...") | |||
| Line 3: | Line 3: | ||
| The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music. | The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music. | ||
| − | Participants are tasked with creating systems that generate captions for a collection of music clips from the  | + | Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models. | 
| + | |||
| + | Step 2: Dataset | ||
| + | |||
| + | This section needs to be completely rewritten to describe the Song Describer dataset: | ||
| = Dataset = | = Dataset = | ||
| Line 9: | Line 13: | ||
| == Description == | == Description == | ||
| − | The  | + | The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models. | 
| − | + | SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions. | |
| === Description of Audio Files === | === Description of Audio Files === | ||
| − | * The audio clips in  | + | * The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles. | 
| + | * Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding. | ||
| === Description of Text === | === Description of Text === | ||
| − | * Each clip in the  | + | * Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics. | 
| + | * Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset. | ||
| === Description of Split === | === Description of Split === | ||
| − | *  | + | * While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation. | 
| + | * A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines. | ||
| + | |||
| + | Step 3: Baseline | ||
| + | |||
| + | The baseline section can remain largely the same, as it describes a model architecture rather than dataset-specific information. However, we should update the name: | ||
| = Baseline = | = Baseline = | ||
| − | ==  | + | == SD-MusicCaps: Model Architecture == | 
| − | + | [Keep the existing description of the model architecture] | |
| − | + | Step 4: Metrics | |
| − | + | The metrics section can remain largely the same, but we should ensure it aligns with the evaluation methods used in the Song Describer dataset paper: | |
| = Metrics = | = Metrics = | ||
| Line 47: | Line 58: | ||
| ; METEOR | ; METEOR | ||
| : Incorporates precision, recall, and synonymy matching to improve alignment with human judgment. | : Incorporates precision, recall, and synonymy matching to improve alignment with human judgment. | ||
| + | |||
| + | ; BERT-Score | ||
| + | : Computes token similarity using contextual embeddings from BERT. | ||
| * While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings. | * While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings. | ||
| + | |||
| + | Step 5: Download | ||
| + | |||
| + | Update the download section to reflect the Song Describer dataset: | ||
| = Download = | = Download = | ||
| − | The dataset, including both the audio clips and their corresponding captions,  | + | The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge. | 
| + | |||
| + | Step 6: Rules | ||
| + | |||
| + | The rules section can be updated to reflect the specific requirements of using the Song Describer dataset: | ||
| = Rules = | = Rules = | ||
| − | * Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the  | + | * Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited. | 
| + | * Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset. | ||
| + | * All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released. | ||
| + | |||
| + | Step 7: Submission | ||
| + | |||
| + | The submission section can remain largely the same, but we should update it to reflect any specific requirements related to the Song Describer dataset: | ||
| = Submission = | = Submission = | ||
| Line 65: | Line 93: | ||
| ; JSON file | ; JSON file | ||
| − | : A JSON file containing the generated captions for the evaluation dataset. | + | : A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset. | 
| ; PDF file | ; PDF file | ||
| − | : A PDF file detailing the system architecture, training process, and any external data or models used. | + | : A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation. | 
| * Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above. | * Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above. | ||
Revision as of 05:38, 10 September 2024
Contents
Task Description
The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.
Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.
Step 2: Dataset
This section needs to be completely rewritten to describe the Song Describer dataset:
Dataset
Description
The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.
SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.
Description of Audio Files
- The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
- Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.
Description of Text
- Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
- Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.
Description of Split
- While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
- A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.
Step 3: Baseline
The baseline section can remain largely the same, as it describes a model architecture rather than dataset-specific information. However, we should update the name:
Baseline
SD-MusicCaps: Model Architecture
[Keep the existing description of the model architecture]
Step 4: Metrics
The metrics section can remain largely the same, but we should ensure it aligns with the evaluation methods used in the Song Describer dataset paper:
Metrics
- The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:
- ROUGE-L
- Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.
- BLEU (B1~B4)
- Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.
- METEOR
- Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.
- BERT-Score
- Computes token similarity using contextual embeddings from BERT.
- While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.
Step 5: Download
Update the download section to reflect the Song Describer dataset:
Download
The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.
Step 6: Rules
The rules section can be updated to reflect the specific requirements of using the Song Describer dataset:
Rules
- Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
- Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
- All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.
Step 7: Submission
The submission section can remain largely the same, but we should update it to reflect any specific requirements related to the Song Describer dataset:
Submission
- Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.
- Participants are required to submit the following:
- JSON file
- A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.
- PDF file
- A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.
- Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above.

