MIREX Wiki - User contributions [en]

2025:Music Structure Analysis Results

2025-09-16T03:29:13Z

Ldzhangyx: /* Results */

=Results=

{| class="wikitable" style="vertical-align:bottom;"
|- style="font-weight:bold;"
! System
! Methods Used
! Trained on the training set of
! style="text-align:right;" | ACC
! style="text-align:right;" | HR.5
! style="text-align:right;" | HR3
|-
| Baseline 1
| MusicFM
| Harmonix Set
| style="text-align:right;" | 0.705
| style="text-align:right;" | '''0.644'''
| style="text-align:right;" | 0.710
|-
| kgstruct
| All-in-One variance
| External data (6k songs)
| style="text-align:right;" | '''0.720'''
| style="text-align:right;" | 0.590
| style="text-align:right;" | '''0.762'''
|}

* Reported by paper results.

2025:Music Structure Analysis Results

2025-09-16T03:28:52Z

Ldzhangyx:

=Results=

{| class="wikitable" style="vertical-align:bottom;"
|- style="font-weight:bold;"
! System
! Methods Used
! Trained on the training set of
! style="text-align:right;" | ACC
! style="text-align:right;" | HR.5
! style="text-align:right;" | HR3
|-
| Baseline 1
| MusicFM
| Harmonix Set
| style="text-align:right;" | 0.705
| style="text-align:right;" | 0.644
| style="text-align:right;" | 0.710
|-
| kgstruct
| All-in-One variance
| External data (6k songs)
| style="text-align:right;" | 0.720
| style="text-align:right;" | 0.590
| style="text-align:right;" | 0.762
|}

* Reported by paper results.

2025:Music Structure Analysis Results

2025-09-12T06:54:16Z

Ldzhangyx: Created page with "=Results= {| class="wikitable" style="vertical-align:bottom;" |- style="font-weight:bold;" ! System ! Methods Used ! Trained on the training set of ! style="text-align:right;..."

2025:Music Description & Captioning Results

2025-09-11T10:20:07Z

Ldzhangyx: /* Results */

=Results=

{| class="wikitable" style="vertical-align:bottom;"
|- style="font-weight:bold;"
! System
! Methods Used
! Trained on the training set of
! style="text-align:right;" | ROUGE-L
! style="text-align:right;" | BLEU (Avg. 1-4)
! style="text-align:right;" | METEOR
|-
| Baseline 1
| LP-MusicCaps
| LP-MusicCaps Dataset
| style="text-align:right;" | 0.119
| style="text-align:right;" | 0.033
| style="text-align:right;" | 0.167
|-
| Baseline 2
| MusiLingo
| MusicInstruct Dataset
| style="text-align:right;" | '''0.302'''*
| style="text-align:right;" | 0.081*
| style="text-align:right;" | 0.143*
|-
| Baseline 3
| Qwen2-Audio
| Multiple datasets
| style="text-align:right;" | 0.285*
| style="text-align:right;" | '''0.234'''*
| style="text-align:right;" | '''0.285'''*
|}

* Reported by paper results.

2025:Music Description & Captioning Results

2025-09-11T10:19:22Z

=Results=

{| class="wikitable" style="vertical-align:bottom;"
|- style="font-weight:bold;"
! System
! Methods Used
! Trained on the training set of
! style="text-align:right;" | ROUGE-L
! style="text-align:right;" | BLEU (Avg. 1-4)
! style="text-align:right;" | METEOR
|-
| Baseline 1
| LP-MusicCaps
| LP-MusicCaps Dataset
| style="text-align:right;" | 0.119
| style="text-align:right;" | 0.033
| style="text-align:right;" | 0.167
|-
| Baseline 2
| MusiLingo
| MusicInstruct Dataset
| style="text-align:right;" | 0.302*
| style="text-align:right;" | 0.081*
| style="text-align:right;" | 0.143*
|- style="background-color:#FF0;"
| Baseline 3
| Qwen2-Audio
| Multiple datasets
| style="text-align:right;" | 0.285*
| style="text-align:right;" | 0.234*
| style="text-align:right;" | 0.285*
|}

* Reported by paper results.

2025:Music Structure Analysis

2025-09-08T01:54:23Z

Ldzhangyx: /* Data */

== Music Structure Analysis (MIREX 2025) ==

'''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the opportunity to be showcased in the ISMIR 2025 Late Breaking Demo Track.'''

=== Description ===

The aim of the MIREX Music Structure Analysis task is to identify and label key structural sections in musical audio. Understanding the musical form (e.g., intro, verse, chorus) is fundamental to music understanding and a crucial component in many music information retrieval applications. While traditional approaches focused on segmenting music into internally consistent, but arbitrarily labeled, sections (e.g., A, B, C), this task has evolved.

Since 2020, a new paradigm has emerged, focusing on '''functional structure analysis'''. The goal is to segment the audio and assign a specific functional label to each segment from a predefined set of common musical functions. This task challenges systems to perform both accurate boundary detection and correct functional classification.

This task builds upon a history of structural segmentation evaluations, first run in MIREX 2009. Recent works driving this updated focus include:
* Wang, J. C., Hung, Y. N., & Smith, J. B. (2022, May). To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions. In ''ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)'' (pp. 416-420). IEEE.
* Kim, T., & Nam, J. (2023, October). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In ''2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)'' (pp. 1-5). IEEE.
* Buisson, M., McFee, B., Essid, S., & Crayencour, H. C. (2024). Self-supervised learning of multi-level audio representations for music segmentation. ''IEEE/ACM Transactions on Audio, Speech, and Language Processing''.

For MIREX 2025, participants are required to segment musical audio and classify each segment into one of seven functional categories: '''‘intro’, ‘verse’, ‘chorus’, ‘bridge’, ‘inst’ (instrumental), ‘outro’, or ‘other’'''. The 'other' category can be used for segments that do not fit into the primary six functional labels or for non-musical content if explicitly defined by the dataset annotations being mapped.

=== Data ===

We use the relabeled Harmonix Dataset for evaluation. The test set is used for eval, and participants can use the train and validation split for training models.

https://huggingface.co/datasets/m-a-p/harmonixset_bigvgan/tree/main

==== Collections ====
The evaluation will utilize datasets previously established in MIREX. Annotations from these diverse collections will be mapped to the seven target functional labels for consistent evaluation.
* '''The MIREX 2009 Collection''': 297 pieces, largely derived from the work of the Beatles.
* '''MIREX 2010 RWC collection''': 100 pieces of popular music. This collection has two sets of ground truths. The first was originally included with the RWC dataset. The second set provides segment boundary annotations (see [http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf Pechuho et al., 2010] for details).
* '''MIREX 2012 dataset''': Over 1,000 annotated pieces covering a range of musical styles, with the majority annotated by two independent annotators.

Participants should be aware that original labels in these datasets (e.g., 'verse1', 'solo', 'fade-out') will need to be mapped to the seven specified functional categories for evaluation. Guidelines for this mapping will be provided, or a standard mapping will be applied during evaluation.

==== Audio Formats (Input to Algorithms) ====
Algorithms should be prepared to process audio with the following characteristics:
* Sample rate: 44.1 kHz
* Bit depth: 16 bit
* Number of channels: 1 (mono)
* Encoding: WAV

=== Submission Format ===

Submissions will be handled via '''CodeBench'''. Participants are required to submit their results in a specific format, as detailed below. You will upload a single file containing the segmentation results for all test audio files.

==== Output Data Format ====
The output must be a '''list of dictionaries''' in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: <tt>'id'</tt> (the identifier of the audio file, e.g., '1.wav') and <tt>'result'</tt> (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the <tt>[start_time, end_time]</tt> of the segment in seconds, and the <tt>label</tt> string for that segment.

The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'silence'</tt>.

Example of the content of the submitted file:
<pre>
[
{
'id': 'track01.wav',
'result': [
[[0.000, 15.500], 'intro'],
[[15.500, 45.230], 'verse'],
[[45.230, 75.800], 'chorus'],
[[75.800, 90.000], 'outro']
]
},
{
'id': 'track02.wav',
'result': [
[[0.000, 20.100], 'verse'],
[[20.100, 38.500], 'chorus'],
[[38.500, 55.000], 'verse'],
[[55.000, 72.600], 'chorus'],
[[72.600, 89.000], 'bridge'],
[[89.000, 105.000], 'chorus'],
[[105.000, 115.500], 'outro']
]
}
]
</pre>
Ensure that <tt>offset_time</tt> of one segment is the <tt>onset_time</tt> of the next segment, and segments cover the entire duration of the piece analyzed. The first segment must start at <tt>0.0</tt>.

=== Evaluation Procedures ===

Evaluation will focus on both the accuracy of the detected segment boundaries and the correctness of the assigned functional labels. The primary metrics are:

# '''Frame-Level Accuracy (ACC)''':
# Both the system output and the ground truth will be converted into time-series of labels at a fine temporal resolution (e.g., 10ms or 100ms frames). Accuracy is calculated as the proportion of frames that are correctly labeled by the system compared to the ground truth across the entire dataset. This metric evaluates the overall correctness of segment labels and their temporal extents.

# '''Boundary Retrieval Hit Rate F-Measures (HR.5F and HR3F)''':
# This metric assesses the system's ability to correctly identify segment boundaries.
# * A predicted boundary is considered a '''hit''' if it falls within a certain tolerance window of a ground truth boundary.
# * Two tolerance windows will be used:
# ** 0.5 seconds: For finer precision.
# ** 3.0 seconds: For coarser, more perceptually relevant boundaries.
# * Based on these hits, '''Precision (P)''', '''Recall (R)''', and '''F-measure (F1-score)''' will be calculated for boundary detection at both tolerance levels.
# <math>P = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of retrieved boundaries}}</math>
# <math>R = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of ground truth boundaries}}</math>
# <math>F = \frac{2 \times P \times R}{P + R}</math>
# * The reported metrics will be '''HR.5F''' (F-measure with 0.5s tolerance) and '''HR3F''' (F-measure with 3s tolerance).

==== Baseline ====
The performance of the method described in '''Kim, T., & Nam, J. (2023). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio.''' will serve as a baseline for this task. Participants are encouraged to develop systems that surpass this baseline.

=== Relevant Development Collections ===
While the MIREX datasets will be used for evaluation, participants may find the following publicly available annotated corpora useful for development. Please note that the annotations in these corpora will also need to be mapped to the 7-class functional labeling scheme if used for training models for this task.

* Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The TUTstructure07 dataset, containing 557 songs, is also listed [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
* Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of his corpus: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).

These public corpora offer over 200 songs that can be adapted for development purposes.

=== Time and Hardware Limits ===
Due to the nature of the CodeBench platform and the potentially high number of participants, limits on the runtime and computational resources for submissions may be imposed. Specific details regarding these limits will be provided closer to the submission deadline. A general guideline is that analysis should be computationally feasible. For reference, a hard limit of '''24 hours''' for total analysis time over the evaluation dataset was imposed in previous iterations, and a similar constraint might apply.

2025:Music Description & Captioning

2025-08-27T08:00:26Z

Ldzhangyx: /* Task Description */

= Task Description =

The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': https://www.codabench.org/competitions/10282/

= Dataset =

== Test Dataset: The Song Describer dataset (SDD) ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

== Additional Datasets: JamendoMaxCaps ==

* Roy, A., Liu, R., Lu, T., & Herremans, D. (2025). JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata. arXiv preprint arXiv:2502.07461.

= Baselines =

== LP-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

== Additional Baselines ==

* Qwen2-Audio. Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., ... & Zhou, J. (2024). Qwen2-audio technical report. arXiv preprint arXiv:2407.10759.

* Gemini 2.5 Pro. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#enhanced-reasoning

* MuMu-llama. Liu, S., Hussain, A. S., Wu, Q., Sun, C., & Shan, Y. (2024). MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models. arXiv preprint arXiv:2412.06660.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B2~B4)
: Evaluates n-gram precision, with B2 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (TBD) for automated assessment.

* '''Submission Deadline: 1st Sept., AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''LBD track''' or the '''LLM4Music Satellite Event''' at ISMIR 2025.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

MIREX HOME

2025-05-20T01:58:06Z

Ldzhangyx: /* Contact Us */

==Welcome to MIREX 2025==

After a break of 3 years, we want to bring back the MIREX (Music Information Retrieval Evaluation eXchange) competition starting from 2024. We want to bring in new tasks, benchmarks, and datasets in response to the rapid development of computer music research.

The MIREX community will hold its annual meeting as part of [https://ismir.net/ The International Society for Music Information Retrieval Conference]. This year, the conference will be held in [https://ismir2025.ismir.net/ Daejeon, South Korea] from September 21-25, 2025.

In a long run, we want to make MIREX a platform for researchers to share their latest research results, to compare their systems with others, and to promote the development of the field.

==MIREX 2025 Task Descriptions==

To be announced.

==Call for Challenges==

Starting with MIREX 2024, we invite the ISMIR community to propose new research challenges that address cutting-edge problems in Music Information Retrieval (MIR). These challenges should aim to push the boundaries of current research and foster innovation in the field.

We also welcome challenge sponsors from both industry and research institutions, particularly those willing to contribute datasets and computational resources to support the competition.

For the format and requirements for the challenge proposal, please go to [[2025:Call for Challenges]].

===What's new:===

Starting with MIREX 2025, we invite the ISMIR community to participate in shaping the future of Music Information Retrieval (MIR) by either '''proposing new research challenges''' or '''volunteering as task captains''' for existing ones.

* '''New challenge proposals''' should aim to address cutting-edge problems and push the boundaries of current MIR research.
* '''Task captains for established tasks''' are encouraged to help revitalize previous tasks—potentially by updating evaluation methodologies, datasets, or other aspects to reflect recent advances in the field.

Task Captain Responsibilities:

* Register on the [https://www.music-ir.org/mirex MIREX Wiki] and maintain a task description page.
* Collect submissions via the MIREX submission server (or provide customized submission instructions).
* Execute and evaluate the submissions.
* Report results to MIREX and create a results page on the MIREX Wiki.
* (Optional) Present a MIREX task captain poster at the Late-Breaking and Demo (LBD) session at ISMIR 2025.

==How to Participate==

* Read the [[Participant Agreement]] and task description carefully.
* Program your system. For some tasks, a docker image is required for submission. See the [[Submission Guidelines]].
* Write a 2-4 page extended abstract PDF describing your system.
* Submit your system and extended abstract through the submission portal (to be announced). Check individual task forums for updates.
* Top-performing teams will have the opportunity to present their MIREX posters at the LBD session at ISMIR 2025.

==Important Dates==

* Challenge proposals due: May 9, 2025
* Notification of acceptance: May 16, 2025
* Submission open: May 30, 2025
* Submission close: See task descriptions for details
* Result published: See task descriptions for details

==Contact Us==

For general questions, feedback, and suggestions, please send messages to our mailing list future-mirex@googlegroups.com. For task-specific inquiries, please email individual task captain, or post a question in the [http://futuremirex.com/portal/ submission portal] forum.

We are looking forward to seeing you at MIREX 2025!

Future MIREX Team, 2025

MIREX 2025 Organizers:
* Gus Xia, MBZUAI
* Junyan Jiang, New York University
* Akira Maezawa, Yamaha
* Ziyu Wang, New York University
* Yixiao Zhang, ByteDance Inc.
* Ruibin Yuan, Hong Kong University of Science and Technology
* J. Stephen Downie, University of Illinois

2025:Music Structure Analysis

2025-05-20T00:27:59Z

Ldzhangyx: /* Output Data Format */

== Music Structure Analysis (MIREX 2025) ==

'''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the opportunity to be showcased in the ISMIR 2025 Late Breaking Demo Track.'''

=== Description ===

The aim of the MIREX Music Structure Analysis task is to identify and label key structural sections in musical audio. Understanding the musical form (e.g., intro, verse, chorus) is fundamental to music understanding and a crucial component in many music information retrieval applications. While traditional approaches focused on segmenting music into internally consistent, but arbitrarily labeled, sections (e.g., A, B, C), this task has evolved.

Since 2020, a new paradigm has emerged, focusing on '''functional structure analysis'''. The goal is to segment the audio and assign a specific functional label to each segment from a predefined set of common musical functions. This task challenges systems to perform both accurate boundary detection and correct functional classification.

This task builds upon a history of structural segmentation evaluations, first run in MIREX 2009. Recent works driving this updated focus include:
* Wang, J. C., Hung, Y. N., & Smith, J. B. (2022, May). To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions. In ''ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)'' (pp. 416-420). IEEE.
* Kim, T., & Nam, J. (2023, October). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In ''2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)'' (pp. 1-5). IEEE.
* Buisson, M., McFee, B., Essid, S., & Crayencour, H. C. (2024). Self-supervised learning of multi-level audio representations for music segmentation. ''IEEE/ACM Transactions on Audio, Speech, and Language Processing''.

For MIREX 2025, participants are required to segment musical audio and classify each segment into one of seven functional categories: '''‘intro’, ‘verse’, ‘chorus’, ‘bridge’, ‘inst’ (instrumental), ‘outro’, or ‘other’'''. The 'other' category can be used for segments that do not fit into the primary six functional labels or for non-musical content if explicitly defined by the dataset annotations being mapped.

=== Data ===

==== Collections ====
The evaluation will utilize datasets previously established in MIREX. Annotations from these diverse collections will be mapped to the seven target functional labels for consistent evaluation.
* '''The MIREX 2009 Collection''': 297 pieces, largely derived from the work of the Beatles.
* '''MIREX 2010 RWC collection''': 100 pieces of popular music. This collection has two sets of ground truths. The first was originally included with the RWC dataset. The second set provides segment boundary annotations (see [http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf Pechuho et al., 2010] for details).
* '''MIREX 2012 dataset''': Over 1,000 annotated pieces covering a range of musical styles, with the majority annotated by two independent annotators.

Participants should be aware that original labels in these datasets (e.g., 'verse1', 'solo', 'fade-out') will need to be mapped to the seven specified functional categories for evaluation. Guidelines for this mapping will be provided, or a standard mapping will be applied during evaluation.

==== Audio Formats (Input to Algorithms) ====
Algorithms should be prepared to process audio with the following characteristics:
* Sample rate: 44.1 kHz
* Bit depth: 16 bit
* Number of channels: 1 (mono)
* Encoding: WAV

=== Submission Format ===

Submissions will be handled via '''CodeBench'''. Participants are required to submit their results in a specific format, as detailed below. You will upload a single file containing the segmentation results for all test audio files.

==== Output Data Format ====
The output must be a '''list of dictionaries''' in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: <tt>'id'</tt> (the identifier of the audio file, e.g., '1.wav') and <tt>'result'</tt> (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the <tt>[start_time, end_time]</tt> of the segment in seconds, and the <tt>label</tt> string for that segment.

The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'silence'</tt>.

Example of the content of the submitted file:
<pre>
[
{
'id': 'track01.wav',
'result': [
[[0.000, 15.500], 'intro'],
[[15.500, 45.230], 'verse'],
[[45.230, 75.800], 'chorus'],
[[75.800, 90.000], 'outro']
]
},
{
'id': 'track02.wav',
'result': [
[[0.000, 20.100], 'verse'],
[[20.100, 38.500], 'chorus'],
[[38.500, 55.000], 'verse'],
[[55.000, 72.600], 'chorus'],
[[72.600, 89.000], 'bridge'],
[[89.000, 105.000], 'chorus'],
[[105.000, 115.500], 'outro']
]
}
]
</pre>
Ensure that <tt>offset_time</tt> of one segment is the <tt>onset_time</tt> of the next segment, and segments cover the entire duration of the piece analyzed. The first segment must start at <tt>0.0</tt>.

=== Evaluation Procedures ===

Evaluation will focus on both the accuracy of the detected segment boundaries and the correctness of the assigned functional labels. The primary metrics are:

# '''Frame-Level Accuracy (ACC)''':
# Both the system output and the ground truth will be converted into time-series of labels at a fine temporal resolution (e.g., 10ms or 100ms frames). Accuracy is calculated as the proportion of frames that are correctly labeled by the system compared to the ground truth across the entire dataset. This metric evaluates the overall correctness of segment labels and their temporal extents.

# '''Boundary Retrieval Hit Rate F-Measures (HR.5F and HR3F)''':
# This metric assesses the system's ability to correctly identify segment boundaries.
# * A predicted boundary is considered a '''hit''' if it falls within a certain tolerance window of a ground truth boundary.
# * Two tolerance windows will be used:
# ** 0.5 seconds: For finer precision.
# ** 3.0 seconds: For coarser, more perceptually relevant boundaries.
# * Based on these hits, '''Precision (P)''', '''Recall (R)''', and '''F-measure (F1-score)''' will be calculated for boundary detection at both tolerance levels.
# <math>P = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of retrieved boundaries}}</math>
# <math>R = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of ground truth boundaries}}</math>
# <math>F = \frac{2 \times P \times R}{P + R}</math>
# * The reported metrics will be '''HR.5F''' (F-measure with 0.5s tolerance) and '''HR3F''' (F-measure with 3s tolerance).

==== Baseline ====
The performance of the method described in '''Kim, T., & Nam, J. (2023). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio.''' will serve as a baseline for this task. Participants are encouraged to develop systems that surpass this baseline.

=== Relevant Development Collections ===
While the MIREX datasets will be used for evaluation, participants may find the following publicly available annotated corpora useful for development. Please note that the annotations in these corpora will also need to be mapped to the 7-class functional labeling scheme if used for training models for this task.

* Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The TUTstructure07 dataset, containing 557 songs, is also listed [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
* Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of his corpus: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).

These public corpora offer over 200 songs that can be adapted for development purposes.

=== Time and Hardware Limits ===
Due to the nature of the CodeBench platform and the potentially high number of participants, limits on the runtime and computational resources for submissions may be imposed. Specific details regarding these limits will be provided closer to the submission deadline. A general guideline is that analysis should be computationally feasible. For reference, a hard limit of '''24 hours''' for total analysis time over the evaluation dataset was imposed in previous iterations, and a similar constraint might apply.

2025:Music Structure Analysis

2025-05-20T00:27:08Z

Ldzhangyx: Created page with "== Music Structure Analysis (MIREX 2025) == '''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the..."

== Music Structure Analysis (MIREX 2025) ==

'''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the opportunity to be showcased in the ISMIR 2025 Late Breaking Demo Track.'''

=== Description ===

The aim of the MIREX Music Structure Analysis task is to identify and label key structural sections in musical audio. Understanding the musical form (e.g., intro, verse, chorus) is fundamental to music understanding and a crucial component in many music information retrieval applications. While traditional approaches focused on segmenting music into internally consistent, but arbitrarily labeled, sections (e.g., A, B, C), this task has evolved.

Since 2020, a new paradigm has emerged, focusing on '''functional structure analysis'''. The goal is to segment the audio and assign a specific functional label to each segment from a predefined set of common musical functions. This task challenges systems to perform both accurate boundary detection and correct functional classification.

This task builds upon a history of structural segmentation evaluations, first run in MIREX 2009. Recent works driving this updated focus include:
* Wang, J. C., Hung, Y. N., & Smith, J. B. (2022, May). To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions. In ''ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)'' (pp. 416-420). IEEE.
* Kim, T., & Nam, J. (2023, October). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In ''2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)'' (pp. 1-5). IEEE.
* Buisson, M., McFee, B., Essid, S., & Crayencour, H. C. (2024). Self-supervised learning of multi-level audio representations for music segmentation. ''IEEE/ACM Transactions on Audio, Speech, and Language Processing''.

For MIREX 2025, participants are required to segment musical audio and classify each segment into one of seven functional categories: '''‘intro’, ‘verse’, ‘chorus’, ‘bridge’, ‘inst’ (instrumental), ‘outro’, or ‘other’'''. The 'other' category can be used for segments that do not fit into the primary six functional labels or for non-musical content if explicitly defined by the dataset annotations being mapped.

=== Data ===

==== Collections ====
The evaluation will utilize datasets previously established in MIREX. Annotations from these diverse collections will be mapped to the seven target functional labels for consistent evaluation.
* '''The MIREX 2009 Collection''': 297 pieces, largely derived from the work of the Beatles.
* '''MIREX 2010 RWC collection''': 100 pieces of popular music. This collection has two sets of ground truths. The first was originally included with the RWC dataset. The second set provides segment boundary annotations (see [http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf Pechuho et al., 2010] for details).
* '''MIREX 2012 dataset''': Over 1,000 annotated pieces covering a range of musical styles, with the majority annotated by two independent annotators.

Participants should be aware that original labels in these datasets (e.g., 'verse1', 'solo', 'fade-out') will need to be mapped to the seven specified functional categories for evaluation. Guidelines for this mapping will be provided, or a standard mapping will be applied during evaluation.

==== Audio Formats (Input to Algorithms) ====
Algorithms should be prepared to process audio with the following characteristics:
* Sample rate: 44.1 kHz
* Bit depth: 16 bit
* Number of channels: 1 (mono)
* Encoding: WAV

=== Submission Format ===

Submissions will be handled via '''CodeBench'''. Participants are required to submit their results in a specific format, as detailed below. You will upload a single file containing the segmentation results for all test audio files.

==== Output Data Format ====
The output must be a '''list of dictionaries''' in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: <tt>'id'</tt> (the identifier of the audio file, e.g., '1.wav') and <tt>'result'</tt> (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the <tt>[start_time, end_time]</tt> of the segment in seconds, and the <tt>label</tt> string for that segment.

The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'other'</tt>.

Example of the content of the submitted file:
<pre>
[
{
'id': 'track01.wav',
'result': [
[[0.000, 15.500], 'intro'],
[[15.500, 45.230], 'verse'],
[[45.230, 75.800], 'chorus'],
[[75.800, 90.000], 'outro']
]
},
{
'id': 'track02.wav',
'result': [
[[0.000, 20.100], 'verse'],
[[20.100, 38.500], 'chorus'],
[[38.500, 55.000], 'verse'],
[[55.000, 72.600], 'chorus'],
[[72.600, 89.000], 'bridge'],
[[89.000, 105.000], 'chorus'],
[[105.000, 115.500], 'outro']
]
}
]
</pre>
Ensure that <tt>offset_time</tt> of one segment is the <tt>onset_time</tt> of the next segment, and segments cover the entire duration of the piece analyzed. The first segment must start at <tt>0.0</tt>.

=== Evaluation Procedures ===

Evaluation will focus on both the accuracy of the detected segment boundaries and the correctness of the assigned functional labels. The primary metrics are:

# '''Frame-Level Accuracy (ACC)''':
# Both the system output and the ground truth will be converted into time-series of labels at a fine temporal resolution (e.g., 10ms or 100ms frames). Accuracy is calculated as the proportion of frames that are correctly labeled by the system compared to the ground truth across the entire dataset. This metric evaluates the overall correctness of segment labels and their temporal extents.

# '''Boundary Retrieval Hit Rate F-Measures (HR.5F and HR3F)''':
# This metric assesses the system's ability to correctly identify segment boundaries.
# * A predicted boundary is considered a '''hit''' if it falls within a certain tolerance window of a ground truth boundary.
# * Two tolerance windows will be used:
# ** 0.5 seconds: For finer precision.
# ** 3.0 seconds: For coarser, more perceptually relevant boundaries.
# * Based on these hits, '''Precision (P)''', '''Recall (R)''', and '''F-measure (F1-score)''' will be calculated for boundary detection at both tolerance levels.
# <math>P = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of retrieved boundaries}}</math>
# <math>R = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of ground truth boundaries}}</math>
# <math>F = \frac{2 \times P \times R}{P + R}</math>
# * The reported metrics will be '''HR.5F''' (F-measure with 0.5s tolerance) and '''HR3F''' (F-measure with 3s tolerance).

==== Baseline ====
The performance of the method described in '''Kim, T., & Nam, J. (2023). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio.''' will serve as a baseline for this task. Participants are encouraged to develop systems that surpass this baseline.

=== Relevant Development Collections ===
While the MIREX datasets will be used for evaluation, participants may find the following publicly available annotated corpora useful for development. Please note that the annotations in these corpora will also need to be mapped to the 7-class functional labeling scheme if used for training models for this task.

* Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The TUTstructure07 dataset, containing 557 songs, is also listed [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
* Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of his corpus: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).

These public corpora offer over 200 songs that can be adapted for development purposes.

=== Time and Hardware Limits ===
Due to the nature of the CodeBench platform and the potentially high number of participants, limits on the runtime and computational resources for submissions may be imposed. Specific details regarding these limits will be provided closer to the submission deadline. A general guideline is that analysis should be computationally feasible. For reference, a hard limit of '''24 hours''' for total analysis time over the evaluation dataset was imposed in previous iterations, and a similar constraint might apply.

2025:Music Description & Captioning

2025-05-20T00:01:26Z

Ldzhangyx: /* SD-MusicCaps: Model Architecture */

= Task Description =

The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': TBD

= Dataset =

== Test Dataset: The Song Describer dataset (SDD) ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

== Additional Datasets: JamendoMaxCaps ==

* Roy, A., Liu, R., Lu, T., & Herremans, D. (2025). JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata. arXiv preprint arXiv:2502.07461.

= Baselines =

== LP-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

== Additional Baselines ==

* Qwen2-Audio. Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., ... & Zhou, J. (2024). Qwen2-audio technical report. arXiv preprint arXiv:2407.10759.

* Gemini 2.5 Pro. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#enhanced-reasoning

* MuMu-llama. Liu, S., Hussain, A. S., Wu, Q., Sun, C., & Shan, Y. (2024). MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models. arXiv preprint arXiv:2412.06660.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B2~B4)
: Evaluates n-gram precision, with B2 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (TBD) for automated assessment.

* '''Submission Deadline: October 30, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''LBD track''' at ISMIR 2025.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2025:Music Description & Captioning

2025-05-20T00:00:56Z

Ldzhangyx:

= Task Description =

The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': TBD

= Dataset =

== Test Dataset: The Song Describer dataset (SDD) ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

== Additional Datasets: JamendoMaxCaps ==

* Roy, A., Liu, R., Lu, T., & Herremans, D. (2025). JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata. arXiv preprint arXiv:2502.07461.

= Baselines =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

== Additional Baselines ==

* Qwen2-Audio. Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., ... & Zhou, J. (2024). Qwen2-audio technical report. arXiv preprint arXiv:2407.10759.

* Gemini 2.5 Pro. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#enhanced-reasoning

* MuMu-llama. Liu, S., Hussain, A. S., Wu, Q., Sun, C., & Shan, Y. (2024). MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models. arXiv preprint arXiv:2412.06660.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B2~B4)
: Evaluates n-gram precision, with B2 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (TBD) for automated assessment.

* '''Submission Deadline: October 30, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''LBD track''' at ISMIR 2025.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2025:Music Description & Captioning

2025-05-20T00:00:39Z

Ldzhangyx:

= Task Description =

The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': TBD

= Dataset =

== Test Dataset: The Song Describer dataset (SDD) ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

=== Additional Datasets: JamendoMaxCaps ===

* Roy, A., Liu, R., Lu, T., & Herremans, D. (2025). JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata. arXiv preprint arXiv:2502.07461.

= Baselines =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

== Additional Baselines ==

* Qwen2-Audio. Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., ... & Zhou, J. (2024). Qwen2-audio technical report. arXiv preprint arXiv:2407.10759.

* Gemini 2.5 Pro. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#enhanced-reasoning

* MuMu-llama. Liu, S., Hussain, A. S., Wu, Q., Sun, C., & Shan, Y. (2024). MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models. arXiv preprint arXiv:2412.06660.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B2~B4)
: Evaluates n-gram precision, with B2 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (TBD) for automated assessment.

* '''Submission Deadline: October 30, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''LBD track''' at ISMIR 2025.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2025:Music Description & Captioning

2025-05-19T23:59:56Z

Ldzhangyx: Created page with "= Task Description = The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This..."

= Task Description =

The MIREX 2025 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': TBD

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

=== Additional Datasets ===

* Roy, A., Liu, R., Lu, T., & Herremans, D. (2025). JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata. arXiv preprint arXiv:2502.07461.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

== Additional Baselines ==

* Qwen2-Audio. Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., ... & Zhou, J. (2024). Qwen2-audio technical report. arXiv preprint arXiv:2407.10759.

* Gemini 2.5 Pro. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#enhanced-reasoning

* MuMu-llama. Liu, S., Hussain, A. S., Wu, Q., Sun, C., & Shan, Y. (2024). MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models. arXiv preprint arXiv:2412.06660.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B2~B4)
: Evaluates n-gram precision, with B2 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (TBD) for automated assessment.

* '''Submission Deadline: October 30, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''LBD track''' at ISMIR 2025.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

MIREX HOME

2025-05-19T23:49:38Z

Ldzhangyx: /* Contact Us */

==Welcome to MIREX 2025==

After a break of 3 years, we want to bring back the MIREX (Music Information Retrieval Evaluation eXchange) competition starting from 2024. We want to bring in new tasks, benchmarks, and datasets in response to the rapid development of computer music research.

The MIREX community will hold its annual meeting as part of [https://ismir.net/ The International Society for Music Information Retrieval Conference]. This year, the conference will be held in [https://ismir2025.ismir.net/ Daejeon, South Korea] from September 21-25, 2025.

In a long run, we want to make MIREX a platform for researchers to share their latest research results, to compare their systems with others, and to promote the development of the field.

==MIREX 2025 Task Descriptions==

To be announced.

==Call for Challenges==

Starting with MIREX 2024, we invite the ISMIR community to propose new research challenges that address cutting-edge problems in Music Information Retrieval (MIR). These challenges should aim to push the boundaries of current research and foster innovation in the field.

We also welcome challenge sponsors from both industry and research institutions, particularly those willing to contribute datasets and computational resources to support the competition.

For the format and requirements for the challenge proposal, please go to [[2025:Call for Challenges]].

===What's new:===

Starting with MIREX 2025, we invite the ISMIR community to participate in shaping the future of Music Information Retrieval (MIR) by either '''proposing new research challenges''' or '''volunteering as task captains''' for existing ones.

* '''New challenge proposals''' should aim to address cutting-edge problems and push the boundaries of current MIR research.
* '''Task captains for established tasks''' are encouraged to help revitalize previous tasks—potentially by updating evaluation methodologies, datasets, or other aspects to reflect recent advances in the field.

Task Captain Responsibilities:

* Register on the [https://www.music-ir.org/mirex MIREX Wiki] and maintain a task description page.
* Collect submissions via the MIREX submission server (or provide customized submission instructions).
* Execute and evaluate the submissions.
* Report results to MIREX and create a results page on the MIREX Wiki.
* (Optional) Present a MIREX task captain poster at the Late-Breaking and Demo (LBD) session at ISMIR 2025.

==How to Participate==

* Read the [[Participant Agreement]] and task description carefully.
* Program your system. For some tasks, a docker image is required for submission. See the [[Submission Guidelines]].
* Write a 2-4 page extended abstract PDF describing your system.
* Submit your system and extended abstract through the submission portal (to be announced). Check individual task forums for updates.
* Top-performing teams will have the opportunity to present their MIREX posters at the LBD session at ISMIR 2025.

==Important Dates==

* Challenge proposals due: May 9, 2025
* Notification of acceptance: May 16, 2025
* Submission open: May 30, 2025
* Submission close: See task descriptions for details
* Result published: See task descriptions for details

==Contact Us==

For general questions, feedback, and suggestions, please send messages to our mailing list future-mirex@googlegroups.com. For task-specific inquiries, please email individual task captain, or post a question in the [http://futuremirex.com/portal/ submission portal] forum.

We are looking forward to seeing you at MIREX 2025!

Future MIREX Team, 2025

MIREX 2025 Organizers:
* Gus Xia, MBZUAI
* Junyan Jiang, New York University
* Akira Maezawa, Yamaha
* Ziyu Wang, New York University
* Yixiao Zhang, ByteDance Inc
* Ruibin Yuan, Hong Kong University of Science and Technology
* J. Stephen Downie, University of Illinois

2024:Music Description & Captioning

2024-10-18T12:48:44Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': https://www.codabench.org/competitions/3847/

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/competitions/3847/) for automated assessment.

* '''Submission Deadline: October 30, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Main Page

2024-10-11T12:30:45Z

Ldzhangyx: /* Contact Us */

==Welcome to MIREX 2024==

After a break of 3 years, we want to bring back the MIREX (Music Information Retrieval Evaluation eXchange) competition starting from 2024. We want to bring in new tasks, benchmarks, and datasets in response to the rapid development of computer music research.

The MIREX community will hold its annual meeting as part of [https://ismir.net/ The International Society for Music Information Retrieval Conference]. This year, the conference will be held in [https://ismir2024.ismir.net/ San Francisco, CA, USA and online] from November 10–14, 2024.

In a long run, we want to make MIREX a platform for researchers to share their latest research results, to compare their systems with others, and to promote the development of the field.

==MIREX 2024 Task Descriptions==

We will start with a small set of tasks and will expand the list based on the community's feedback. We also welcome volunteers for task captains (TC). The following tasks are currently planned for MIREX 2024:

Traditional MIR tasks

* [[2024:Audio Chord Estimation]] <TC: [mailto:jj2731@nyu.edu Junyan Jiang]>
* [[2024:Lyrics-to-Audio Alignment]] <TC: [mailto:jj2731@nyu.edu Junyan Jiang]>
* [[2024:Cover Song Identification]] <TC: [mailto:x.du@rochester.edu Xingjian Du] & [mailto:ruibiny@alumni.cmu.edu Ruibin Yuan]>

Modern MIR tasks

* [[2024:Symbolic Music Generation]] <TC: [mailto:ziyu.wang@nyu.edu Ziyu Wang]>
* [[2024:Music Audio Generation]] <TC: [mailto:ruibiny@alumni.cmu.edu Ruibin Yuan]>
* [[2024:Music Description & Captioning]] <TC: [mailto:yixiao.zhang@qmul.ac.uk Yixiao Zhang]>
* [[2024:Polyphonic Transcription]] <TC: [mailto:yujia.yan@rochester.edu Yujia Yan] & [mailto:ziyu.wang@nyu.edu Ziyu Wang]>
* [[2024:Singing Voice Deepfake Detection]] <TC: [mailto:you.zhang@rochester.edu Neil Zhang] & [mailto:yixiao.zhang@qmul.ac.uk Yixiao Zhang]>

==Call for Challenges==

Starting from MIREX 2024, we invite proposals for challenges addressing new research problems from the ISMIR community. These challenges should aim to push the boundaries of current research and foster innovation within the field of music information retrieval.

For the format and requirements for the challenge proposal, please go to [[2024:Call for Challenges]]. This year's call for challenge winner is the singing voice deepfake detection proposal, which has been added to the task list.

==How to Participate==

* Read the [[Participant Agreement]] and task description carefully.
* Program your system. For some tasks, a docker image is required for submission. See the [[Submission Guidelines]].
* Write a 2-4 page extended abstract PDF describing your system.
* Submit your system and the extended abstract to the [http://futuremirex.com/portal/ new submission portal]. Be sure to also check the announcements in the forum of each task.
* Top-rated teams will be presenting their MIREX posters alongside the LBD session at ISMIR2024. (details will follow)

==Important Dates==

* <del>Challenge proposals due: August 7, 2024</del>
* <del>Notification of acceptance for challenge proposals: August 14, 2024</del>
* Submission open: September 15, 2024
* Submission close: around October 15, 2024 (different tasks may vary, see task descriptions for details)
* Result published: around October 21, 2024 (different tasks may vary, see task descriptions for details)

==Contact Us==

For general questions, feedback, and suggestions, please send messages to our mailing list future-mirex@googlegroups.com. For task-specific inquiries, please email individual task captain, or post a question in the [http://futuremirex.com/portal/ submission portal] forum.

We are looking forward to seeing you at MIREX 2024!

Future MIREX Team, 2024

MIREX 2024 Organizers:
* Junyan Jiang, New York University.
* Akira Maezawa, Yamaha
* Ziyu Wang, New York University
* Yixiao Zhang, Queen Mary University of London
* Ruibin Yuan, Hong Kong University of Science and Technology
* J. Stephen Downie, University of Illinois
* Gus Xia, MBZUAI.

2024:Singing Voice Deepfake Detection

2024-10-10T16:22:00Z

Ldzhangyx: /* Submission */

= Task Description =

The WildSVDD challenge aims to detect AI-generated singing voices in real-world scenarios. The task involves distinguishing authentic human-sung songs from AI-generated deepfake songs at the clip level. Participants are required to identify whether each segmented clip contains a genuine singer or an AI-generated fake singer. The developed systems are expected to account for the complexities introduced by background music and various musical contexts. For more information about our prior work, please visit: https://main.singfake.org/

;Background
:With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. These synthesized voices can now emulate the vocal characteristics of any singer with minimal training data. While this technological advancement is impressive, it has sparked widespread concerns among artists, record labels, and publishing houses. The potential for unauthorized synthetic reproductions that mimic well-known singers poses a real threat to original artists' commercial value and intellectual property rights, igniting urgent calls for efficient and accurate methods to detect these deepfake singing voices.

:This challenge is an extension of our precious work SingFake [1] and was initially introduced at the 2024 IEEE Spoken Language Technology Workshop (SLT 2024) [2] with CtrSVDD track and WildSVDD track. The CtrSVDD track [3] garnered significant attention from the speech community. We aim to raise more awareness for WildSVDD within the ISMIR community and leverage the expertise of music experts.

:[1] Zang, Yongyi, You Zhang, Mojtaba Heydari, and Zhiyao Duan. "SingFake: Singing voice deepfake detection." In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156-12160. IEEE, 2024. https://ieeexplore.ieee.org/document/10448184

:[2] Zhang, You, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. "SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge." In Proc. IEEE Spoken Language Technology (SLT), 2024. https://arxiv.org/abs/2408.16132

:[3] Zang, Yongyi, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu et al. “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection.” In Proc. Interspeech, pp. 4783-4787, 2024. https://doi.org/10.21437/Interspeech.2024-2242

Contact: [mailto:you.zhang@rochester.edu Neil Zhang] & [mailto:yixiao.zhang@qmul.ac.uk Yixiao Zhang]

= Dataset =

;Description
:The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. We gathered data annotations from social media platforms. The annotators, who were familiar with the singers they covered, manually verified the user-specified labels during the annotation process to ensure accuracy, especially in cases where the singer(s) did not actually perform certain songs. We cross-checked the annotations against song titles and descriptions and manually reviewed any discrepancies for further verification. See "Download" section for details.

;Description of Audio Files
:The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

;Description of Split
:The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset of the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

= Baseline =

;Model Architecture
:Participants are referred to baseline systems from the SingFake [1] and SingGraph [2] projects. SingGraph includes state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling. The key features of these baselines include robust handling of background music and adaptation to different musical styles. Some results of how baseline systems in SingFake perform on the WildSVDD test data can be found in our SVDD@SLT challenge overview paper [3].

:[1] SingFake: https://github.com/yongyizang/SingFake

:[2] SingGraph: https://github.com/xjchenGit/SingGraph

:[3] SVDD 2024@SLT: https://arxiv.org/abs/2408.16132

= Metrics =

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

= Download =

The dataset and necessary resources can be accessed via the following links:

* Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
* Download tools: https://pastebin.com/bFeruNA0, https://cobalt.tools/, https://github.com/ytdl-org/youtube-dl, https://github.com/yt-dlp/yt-dlp, https://www.locoloader.com/bilibili-video-downloader/
* Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools to download and segment song clips to ensure consistency in evaluation. If you have concerns about downloading data, please reach out to [mailto:svddchallenge@gmail.com svddchallenge@gmail.com].

= Rules =

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

= Submission =

* '''Submission Deadline: October 20, AOE'''

;Results submission

:Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

;System description submission
:Participants are required to describe their system, including the data preprocessing, model architecture, training details, post-processing, etc.

;Research paper submission
:Participants are encouraged to submit a research paper to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

Please send your submission to [mailto:you.zhang@rochester.edu Neil Zhang].

2024:Music Description & Captioning

2024-09-16T14:33:00Z

Ldzhangyx: /* Task Description */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

'''Submission''': https://www.codabench.org/competitions/3847/

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/competitions/3847/) for automated assessment.

* '''Submission Deadline: October 15, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-16T14:32:47Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/competitions/3847/) for automated assessment.

* '''Submission Deadline: October 15, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Singing Voice Deepfake Detection

2024-09-10T12:59:46Z

Ldzhangyx: /* Submission */

= Task Description =

The WildSVDD challenge aims to detect AI-generated singing voices in real-world scenarios. The task involves distinguishing authentic human-sung songs from AI-generated deepfake songs at the clip level. Participants are required to identify whether each segmented clip contains a genuine singer or an AI-generated fake singer. The developed systems are expected to account for the complexities introduced by background music and various musical contexts.

;Background
:With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. These synthesized voices can now emulate the vocal characteristics of any singer with minimal training data. While this technological advancement is impressive, it has sparked widespread concerns among artists, record labels, and publishing houses. The potential for unauthorized synthetic reproductions that mimic well-known singers poses a real threat to original artists' commercial value and intellectual property rights, igniting urgent calls for efficient and accurate methods to detect these deepfake singing voices.

:This challenge is an extension of our precious work SingFake [1] and was initially introduced at the 2024 IEEE Spoken Language Technology Workshop (SLT 2024) [2] with CtrSVDD track and WildSVDD track. The CtrSVDD track [3] garnered significant attention from the speech community. We aim to raise more awareness for WildSVDD within the ISMIR community and leverage the expertise of music experts.

:[1] Zang, Yongyi, You Zhang, Mojtaba Heydari, and Zhiyao Duan. "SingFake: Singing voice deepfake detection." In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156-12160. IEEE, 2024. https://ieeexplore.ieee.org/document/10448184

:[2] Zhang, You, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. "SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge." In Proc. IEEE Spoken Language Technology (SLT), 2024. https://arxiv.org/abs/2408.16132

:[3] Zang, Yongyi, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu et al. “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection.” In Proc. Interspeech, pp. 4783-4787, 2024. https://doi.org/10.21437/Interspeech.2024-2242

= Dataset =

;Description
:The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. We gathered data annotations from social media platforms. The annotators, who were familiar with the singers they covered, manually verified the user-specified labels during the annotation process to ensure accuracy, especially in cases where the singer(s) did not actually perform certain songs. We cross-checked the annotations against song titles and descriptions and manually reviewed any discrepancies for further verification.

;Description of Audio Files
:The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

;Description of Split
:The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset of the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

= Baseline =

;Model Architecture
:Participants are referred to baseline systems from the SingFake [1] and SingGraph [2] projects. SingGraph includes state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling. The key features of these baselines include robust handling of background music and adaptation to different musical styles. Some results of how baseline systems in SingFake perform on the WildSVDD test data can be found in our SVDD@SLT challenge overview paper [3].

:[1] SingFake: https://github.com/yongyizang/SingFake

:[2] SingGraph: https://github.com/xjchenGit/SingGraph

:[3] SVDD 2024@SLT: https://arxiv.org/abs/2408.16132

= Metrics =

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

= Download =

The dataset and necessary resources can be accessed via the following links:

* Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
* Download tools: https://pastebin.com/YhpYXT9z, https://cobalt.tools/, https://github.com/ytdl-org/youtube-dl, https://github.com/yt-dlp/yt-dlp, https://www.locoloader.com/bilibili-video-downloader/
* Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools to download and segment song clips to ensure consistency in evaluation.

= Rules =

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

= Submission =

* '''Submission Deadline: October 15, AOE'''

;Results submission

:Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

;System description submission
:Participants are required to describe their system, including the data preprocessing, model architecture, training details, post-processing, etc.

;Research paper submission
:Participants are encouraged to submit a research paper to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

2024:Music Description & Captioning

2024-09-10T12:59:25Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* '''Submission Deadline: October 15, AOE'''

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:59:08Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Submission Deadline: October 15, AOE

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:28:54Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Paper =

;Research paper submission
:Participants are encouraged to submit the technical report to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Main Page

2024-09-10T12:27:54Z

Ldzhangyx: /* Important Dates */

==Welcome to MIREX 2024==

After a break of 3 years, we want to bring back the MIREX (Music Information Retrieval Evaluation eXchange) competition starting from 2024. We want to bring in new tasks, benchmarks, and datasets in response to the rapid development of computer music research.

The MIREX community will hold its annual meeting as part of [https://ismir.net/ The International Society for Music Information Retrieval Conference]. This year, the conference will be held in [https://ismir2024.ismir.net/ San Francisco, CA, USA and online] from November 10–14, 2024.

In a long run, we want to make MIREX a platform for researchers to share their latest research results, to compare their systems with others, and to promote the development of the field.

==MIREX 2024 Evaluation Tasks==

We will start with a small set of tasks and will expand the list based on the community's feedback. We also welcome volunteers for task leaders. The following tasks are currently planned for MIREX 2024:

Traditional MIR tasks

* [[2024:Audio Chord Estimation]] <TC: Junyan Jiang>
* [[2024:Lyrics-to-Audio Alignment]] <TC: Junyan Jiang>
* [[2024:Cover Song Identification]] <TC: Ruibin Yuan>

Modern MIR tasks

* [[2024:Symbolic Music Generation]] <TC: Ziyu Wang>
* [[2024:Music Audio Generation]] <TC: Ruibin Yuan>
* [[2024:Music Description & Captioning]] <TC: Yixiao Zhang>
* [[2024:Polyphonic Transcription]] <TC: Yujia Yan & Ziyu Wang>
* [[2024:Singing Voice Deepfake Detection]] <TC: Neil Zhang & Yixiao Zhang>

We also invite proposals for new challenges in MIR. For details, see [[2024:Call for Challenges]].

==Call for Challenges==

Starting from MIREX 2024, we invite proposals for challenges addressing new research problems from the ISMIR community. These challenges should aim to push the boundaries of current research and foster innovation within the field of music information retrieval.

For the format and requirements for the challenge proposal, please go to [[2024:Call for Challenges]].

==Call for Task Captains==

The call for task captains has not been opened this year since MIREX is running with limited scalability. It will be opened next year.

==How to Participate==

* Read the task descriptions and the submission guidelines (coming soon).
* Wrap your system in a Docker container. Required dependencies should be specified in the Dockerfile. Make sure to keep the I/O format consistent with the task description.
* Write a 2-3 page extended abstract PDF describing your system.
* Submit your system and the extended abstract to the new submission portal (coming soon).
* Notice that some tasks will use external submission systems like Kaggle or email submission this year. Please refer to the task description for details.

==Important Dates==

* <del>Challenge proposals due: August 7, 2024</del>
* <del>Notification of acceptance for challenge proposals: August 14, 2024</del>
* '''Submission open: September 15, 2024'''
* Submission close: starting from October 15, 2024 (different tasks may vary, see task descriptions for details)
* Result published: starting from October 21, 2024 (different tasks may vary, see task descriptions for details)

==Contact Us==

Since we are adapting MIREX to a new format, we welcome any feedback, suggestions, and task leader volunteers. Please send your email to [mailto:future-mirex@googlegroups.com future-mirex@googlegroups.com].

We are looking forward to seeing you at MIREX 2024!

Future MIREX Team, 2024

MIREX 2024 Organizers:
* Junyan Jiang, New York University.
* Akira Maezawa, Yamaha
* Ziyu Wang, New York University
* Yixiao Zhang, Queen Mary University of London
* Ruibin Yuan, Hong Kong University of Science and Technology
* J. Stephen Downie, University of Illinois
* Gus Xia, MBZUAI.

2024:Main Page

2024-09-10T12:27:47Z

Ldzhangyx: /* Contact Us */

==Welcome to MIREX 2024==

After a break of 3 years, we want to bring back the MIREX (Music Information Retrieval Evaluation eXchange) competition starting from 2024. We want to bring in new tasks, benchmarks, and datasets in response to the rapid development of computer music research.

The MIREX community will hold its annual meeting as part of [https://ismir.net/ The International Society for Music Information Retrieval Conference]. This year, the conference will be held in [https://ismir2024.ismir.net/ San Francisco, CA, USA and online] from November 10–14, 2024.

In a long run, we want to make MIREX a platform for researchers to share their latest research results, to compare their systems with others, and to promote the development of the field.

==MIREX 2024 Evaluation Tasks==

We will start with a small set of tasks and will expand the list based on the community's feedback. We also welcome volunteers for task leaders. The following tasks are currently planned for MIREX 2024:

Traditional MIR tasks

* [[2024:Audio Chord Estimation]] <TC: Junyan Jiang>
* [[2024:Lyrics-to-Audio Alignment]] <TC: Junyan Jiang>
* [[2024:Cover Song Identification]] <TC: Ruibin Yuan>

Modern MIR tasks

* [[2024:Symbolic Music Generation]] <TC: Ziyu Wang>
* [[2024:Music Audio Generation]] <TC: Ruibin Yuan>
* [[2024:Music Description & Captioning]] <TC: Yixiao Zhang>
* [[2024:Polyphonic Transcription]] <TC: Yujia Yan & Ziyu Wang>
* [[2024:Singing Voice Deepfake Detection]] <TC: Neil Zhang & Yixiao Zhang>

We also invite proposals for new challenges in MIR. For details, see [[2024:Call for Challenges]].

==Call for Challenges==

Starting from MIREX 2024, we invite proposals for challenges addressing new research problems from the ISMIR community. These challenges should aim to push the boundaries of current research and foster innovation within the field of music information retrieval.

For the format and requirements for the challenge proposal, please go to [[2024:Call for Challenges]].

==Call for Task Captains==

The call for task captains has not been opened this year since MIREX is running with limited scalability. It will be opened next year.

==How to Participate==

* Read the task descriptions and the submission guidelines (coming soon).
* Wrap your system in a Docker container. Required dependencies should be specified in the Dockerfile. Make sure to keep the I/O format consistent with the task description.
* Write a 2-3 page extended abstract PDF describing your system.
* Submit your system and the extended abstract to the new submission portal (coming soon).
* Notice that some tasks will use external submission systems like Kaggle or email submission this year. Please refer to the task description for details.

==Important Dates==

* <del>Challenge proposals due: August 7, 2024</del>
* <del>Notification of acceptance for challenge proposals: August 14, 2024</del>
* Submission open: September 15, 2024
* Submission close: starting from October 15, 2024 (different tasks may vary, see task descriptions for details)
* Result published: starting from October 21, 2024 (different tasks may vary, see task descriptions for details)

==Contact Us==

Since we are adapting MIREX to a new format, we welcome any feedback, suggestions, and task leader volunteers. Please send your email to [mailto:future-mirex@googlegroups.com future-mirex@googlegroups.com].

We are looking forward to seeing you at MIREX 2024!

Future MIREX Team, 2024

MIREX 2024 Organizers:
* Junyan Jiang, New York University.
* Akira Maezawa, Yamaha
* Ziyu Wang, New York University
* Yixiao Zhang, Queen Mary University of London
* Ruibin Yuan, Hong Kong University of Science and Technology
* J. Stephen Downie, University of Illinois
* Gus Xia, MBZUAI.

2024:Singing Voice Deepfake Detection

2024-09-10T12:27:28Z

Ldzhangyx: /* Submission */

= Task Description =

The WildSVDD challenge aims to detect AI-generated singing voices in real-world scenarios. The task involves distinguishing authentic human-sung songs from AI-generated deepfake songs at the clip level. Participants are required to identify whether each segmented clip contains a genuine singer or an AI-generated fake singer. The developed systems are expected to account for the complexities introduced by background music and various musical contexts.

;Background
:With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. These synthesized voices can now emulate the vocal characteristics of any singer with minimal training data. While this technological advancement is impressive, it has sparked widespread concerns among artists, record labels, and publishing houses. The potential for unauthorized synthetic reproductions that mimic well-known singers poses a real threat to original artists' commercial value and intellectual property rights, igniting urgent calls for efficient and accurate methods to detect these deepfake singing voices.

:This challenge is an extension of our precious work SingFake [1] and was initially introduced at the 2024 IEEE Spoken Language Technology Workshop (SLT 2024) [2] with CtrSVDD track and WildSVDD track. The CtrSVDD track [3] garnered significant attention from the speech community. We aim to raise more awareness for WildSVDD within the ISMIR community and leverage the expertise of music experts.

:[1] Zang, Yongyi, You Zhang, Mojtaba Heydari, and Zhiyao Duan. "SingFake: Singing voice deepfake detection." In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156-12160. IEEE, 2024. https://ieeexplore.ieee.org/document/10448184

:[2] Zhang, You, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. "SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge." In Proc. IEEE Spoken Language Technology (SLT), 2024. https://arxiv.org/abs/2408.16132

:[3] Zang, Yongyi, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu et al. “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection.” In Proc. Interspeech, pp. 4783-4787, 2024. https://doi.org/10.21437/Interspeech.2024-2242

= Dataset =

;Description
:The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. We gathered data annotations from social media platforms. The annotators, who were familiar with the singers they covered, manually verified the user-specified labels during the annotation process to ensure accuracy, especially in cases where the singer(s) did not actually perform certain songs. We cross-checked the annotations against song titles and descriptions and manually reviewed any discrepancies for further verification.

;Description of Audio Files
:The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

;Description of Split
:The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset of the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

= Baseline =

;Model Architecture
:Participants are referred to baseline systems from the SingFake [1] and SingGraph [2] projects. SingGraph includes state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling. The key features of these baselines include robust handling of background music and adaptation to different musical styles. Some results of how baseline systems in SingFake perform on the WildSVDD test data can be found in our SVDD@SLT challenge overview paper [3].

:[1] SingFake: https://github.com/yongyizang/SingFake

:[2] SingGraph: https://github.com/xjchenGit/SingGraph

:[3] SVDD 2024@SLT: https://arxiv.org/abs/2408.16132

= Metrics =

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

= Download =

The dataset and necessary resources can be accessed via the following links:

* Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
* Download tools: https://pastebin.com/YhpYXT9z, https://cobalt.tools/, https://github.com/ytdl-org/youtube-dl, https://github.com/yt-dlp/yt-dlp, https://www.locoloader.com/bilibili-video-downloader/
* Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools to download and segment song clips to ensure consistency in evaluation.

= Rules =

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

= Submission =

;Results submission

:Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

;System description submission
:Participants are required to describe their system, including the data preprocessing, model architecture, training details, post-processing, etc.

;Research paper submission
:Participants are encouraged to submit a research paper to the '''MIREX track''' at ISMIR 2024.

;Workshop presentation
:We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

2024:Music Description & Captioning

2024-09-10T12:22:22Z

Ldzhangyx: /* Bibliography */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.

[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:21:14Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

Note: Although an audio file in the dataset may correspond to multiple captions, participants only need to submit one generated caption for each audio file (identified by track_id). During the evaluation phase, multiple reference captions will be reflected in the calculation of metrics through the multi-reference evaluation method. The number of entries in the submitted JSON file should '''match the number of audio files''' in the assessment dataset, not the total number of original description texts.

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:17:11Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

; Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:15:33Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

: Example

<pre>
{
"1004034": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere.",
"1007274": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood.",
"1009321": "Energetic rock song with distorted electric guitars, powerful drumming, and passionate vocals, ideal for an intense workout session."
...
}
</pre>

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:14:19Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

: Example

<pre>
{
"captions": [
{
"track_id": "1004034",
"caption": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere."
},
{
"track_id": "1007274",
"caption": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood."
},
// ... more captions for other tracks ...
]
}
</pre>

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:14:09Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

* Example

<pre>
{
"captions": [
{
"track_id": "1004034",
"caption": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere."
},
{
"track_id": "1007274",
"caption": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood."
},
// ... more captions for other tracks ...
]
}
</pre>

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T12:13:48Z

Ldzhangyx: /* Submission */

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

Example:

<pre>
{
"captions": [
{
"track_id": "1004034",
"caption": "Upbeat electronic dance music with a pulsing synthesizer melody and rhythmic drum patterns, suitable for a lively party atmosphere."
},
{
"track_id": "1007274",
"caption": "Gentle acoustic guitar instrumental featuring intricate fingerpicking and a soothing melody, perfect for a calm and reflective mood."
},
// ... more captions for other tracks ...
]
}
<pre>

* Each participant or team may submit up to '''four versions''' of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T10:49:25Z

Ldzhangyx:

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

* Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above.

= Bibliography =

[1] Doh, S., Choi, K., Lee, J., & Nam, J. (2023, November). LP-MusicCaps: LLM-Based Pseudo Music Captioning. In ISMIR 2023 Hybrid Conference.
[2] Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y., Bodganov, D., ... & Nam, J. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.

2024:Music Description & Captioning

2024-09-10T10:39:35Z

Ldzhangyx:

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

= Baseline =

== SD-MusicCaps: Model Architecture ==

* LP-MusicCaps utilizes a cross-modal encoder-decoder transformer architecture, designed to generate high-quality captions for music clips. The encoder processes 10-second audio signals by converting them into log-mel spectrograms, which are then refined through convolutional layers with GELU activation to extract critical audio features. These features, combined with positional encoding, are further processed by transformer blocks that understand the sequence and context of the audio data.

* The decoder is responsible for generating text captions from these encoded audio features. It uses transformer blocks similar to those in the encoder, processes tokenized text, and employs multi-head attention to ensure that the generated captions are contextually relevant.

* A key feature of LP-MusicCaps is the augmentation with a large language model (LLM), which enhances the model's ability to generate sophisticated and contextually rich captions.

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

* Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above.

2024:Music Description & Captioning

2024-09-10T10:38:27Z

Ldzhangyx:

= Task Description =

The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This task aims to push the boundaries of music understanding by advancing models that can interpret and describe musical content in natural language, thereby enhancing accessibility and comprehension of music.

Participants are tasked with creating systems that generate captions for a collection of music clips from the Song Describer dataset. The generated captions will be assessed using several evaluation metrics to gauge the effectiveness and performance of the models.

Step 2: Dataset

This section needs to be completely rewritten to describe the Song Describer dataset:

= Dataset =

== Description ==

The Song Describer dataset (SDD) serves as the benchmark for this task. SDD is a meticulously curated collection of music clips, each paired with detailed textual descriptions crafted by volunteers. This dataset provides a robust foundation for evaluating music captioning models.

SDD comprises 1,106 captions for 706 music recordings, with a validated subset of 746 captions for 547 recordings. The clips represent a wide array of genres and styles, ensuring a comprehensive representation of musical content. The dataset is annotated with an emphasis on capturing the intricate details of the audio through precise textual descriptions.

=== Description of Audio Files ===

* The audio clips in SDD are carefully selected from the MTG-Jamendo dataset. Each clip is up to 2 minutes long (95% are 2 minutes), providing a rich diversity of musical genres and styles.
* Audio files are provided in 320kbps 44.1 kHz MP3 audio encoding.

=== Description of Text ===

* Each clip in the SDD is accompanied by one to five free-text captions written by volunteers. These captions focus on describing musical elements such as genre, instrumentation, mood, and other relevant characteristics.
* Captions are single-sentence descriptions, with an average length of 21.7 words in the full dataset and 18.2 words in the validated subset.

=== Description of Split ===

* While there is no recommended split for training and evaluation, the dataset is intended to be used solely for evaluation purposes. Participants should not use any part of SDD for training or validation.
* A validated subset is provided, containing manually reviewed captions that adhere strictly to the annotation guidelines.

Step 3: Baseline

The baseline section can remain largely the same, as it describes a model architecture rather than dataset-specific information. However, we should update the name:

= Baseline =

== SD-MusicCaps: Model Architecture ==

[Keep the existing description of the model architecture]

Step 4: Metrics

The metrics section can remain largely the same, but we should ensure it aligns with the evaluation methods used in the Song Describer dataset paper:

= Metrics =

* The evaluation of submitted systems will be based on multiple metrics, with ROUGE-L serving as the primary metric for determining the final ranking. The metrics include:

; ROUGE-L
: Measures the overlap of the longest common subsequence between the generated and reference captions, serving as the main determinant of the final ranking.

; BLEU (B1~B4)
: Evaluates n-gram precision, with B1 to B4 representing unigram to 4-gram matches.

; METEOR
: Incorporates precision, recall, and synonymy matching to improve alignment with human judgment.

; BERT-Score
: Computes token similarity using contextual embeddings from BERT.

* While each metric will contribute to a ranking, ROUGE-L will primarily determine the final standings.

Step 5: Download

Update the download section to reflect the Song Describer dataset:

= Download =

The Song Describer dataset, including both the audio clips and their corresponding captions, is available for download from Zenodo (DOI: https://doi.org/10.5281/zenodo.10072001). Participants should download the dataset from this source to ensure they are using the correct version for the challenge.

Step 6: Rules

The rules section can be updated to reflect the specific requirements of using the Song Describer dataset:

= Rules =

* Participants are allowed to utilize external datasets and pre-trained models in developing their systems. However, the use of the Song Describer dataset for training or validation is strictly prohibited.
* Participants must ensure that their models do not use any information from the MTG-Jamendo dataset beyond what is provided in the Song Describer dataset.
* All submissions must respect the CC BY-SA 4.0 license under which the Song Describer dataset is released.

Step 7: Submission

The submission section can remain largely the same, but we should update it to reflect any specific requirements related to the Song Describer dataset:

= Submission =

* Submissions will be evaluated using CodaBench (https://www.codabench.org/) for automated assessment.

* Participants are required to submit the following:

; JSON file
: A JSON file containing the generated captions for the evaluation dataset. The format should match the structure provided in the Song Describer dataset.

; PDF file
: A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.

* Each participant or team may submit up to four versions of their system. The final ranking will be based on the metrics outlined above.

2024:Singing Voice Deepfake Detection

2024-08-25T19:08:59Z

Ldzhangyx:

= Task Description =

The WildSVDD challenge focuses on the detection of AI-generated singing voices in the wild. With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. This task challenges participants to develop systems capable of accurately distinguishing real singing voices from AI-generated ones, especially within the complex context of background music and diverse musical environments. Participants will leverage the WildSVDD dataset, which includes a wide variety of song clips, both bonafide and deepfake, to develop and evaluate their systems.

= Dataset =

;Description
:The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. It comprises 97 singers with 2,007 deepfake and 1,216 bonafide song clips, annotated for accuracy.

;Description of Audio Files
:The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

;Description of Split
:The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset from the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

= Baseline =

;Model Architecture
:Participants are referred to baseline systems from the SingFake and SingGraph projects. These baselines include state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling and controlled SVDD analysis. The key features of these baselines include robust handling of background music and adaptation to different musical styles.

= Metrics =

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

= Download =

The dataset and necessary resources can be accessed via the following links:

* Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
* Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools for segmenting song clips to ensure consistency in evaluation.

= Rules =

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

= Submission =

Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

2024:Singing Voice Deepfake Detection

2024-08-25T19:07:44Z

Ldzhangyx: Created page with "= **Task Description** = The WildSVDD challenge focuses on the detection of AI-generated singing voices in the wild. With the advancement of AI technology, singing voices gen..."

= **Task Description** =

The WildSVDD challenge focuses on the detection of AI-generated singing voices in the wild. With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. This task challenges participants to develop systems capable of accurately distinguishing real singing voices from AI-generated ones, especially within the complex context of background music and diverse musical environments. Participants will leverage the WildSVDD dataset, which includes a wide variety of song clips, both bonafide and deepfake, to develop and evaluate their systems.

= **Dataset** =

;**Description**:
The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. It comprises 97 singers with 2,007 deepfake and 1,216 bonafide song clips, annotated for accuracy.

;**Description of Audio Files**:
The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

;**Description of Split**:
The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset from the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

= **Baseline** =

;**Model Architecture**:
Participants are referred to baseline systems from the SingFake and SingGraph projects. These baselines include state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling and controlled SVDD analysis. The key features of these baselines include robust handling of background music and adaptation to different musical styles.

= **Metrics** =

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

= **Download** =

The dataset and necessary resources can be accessed via the following links:

* Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
* Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools for segmenting song clips to ensure consistency in evaluation.

= **Rules** =

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

= **Submission** =

Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

2024:Music Description & Captioning

2024-08-25T18:39:57Z

Ldzhangyx: Created page with "= Task Description = The MIREX 2024 Music Captioning Task invites participants to develop models capable of generating accurate and descriptive captions for music clips. This..."