Ldzhangyx: /* Data */

2025-09-08T01:54:23Z

‎Data

Ldzhangyx: /* Output Data Format */

2025-05-20T00:27:59Z

‎Output Data Format

Ldzhangyx: Created page with "== Music Structure Analysis (MIREX 2025) == '''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the..."

2025-05-20T00:27:08Z

Created page with "== Music Structure Analysis (MIREX 2025) == '''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the..."

New page

== Music Structure Analysis (MIREX 2025) ==

'''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the opportunity to be showcased in the ISMIR 2025 Late Breaking Demo Track.'''

=== Description ===

The aim of the MIREX Music Structure Analysis task is to identify and label key structural sections in musical audio. Understanding the musical form (e.g., intro, verse, chorus) is fundamental to music understanding and a crucial component in many music information retrieval applications. While traditional approaches focused on segmenting music into internally consistent, but arbitrarily labeled, sections (e.g., A, B, C), this task has evolved.

Since 2020, a new paradigm has emerged, focusing on '''functional structure analysis'''. The goal is to segment the audio and assign a specific functional label to each segment from a predefined set of common musical functions. This task challenges systems to perform both accurate boundary detection and correct functional classification.

This task builds upon a history of structural segmentation evaluations, first run in MIREX 2009. Recent works driving this updated focus include:
* Wang, J. C., Hung, Y. N., & Smith, J. B. (2022, May). To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions. In ''ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)'' (pp. 416-420). IEEE.
* Kim, T., & Nam, J. (2023, October). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In ''2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)'' (pp. 1-5). IEEE.
* Buisson, M., McFee, B., Essid, S., & Crayencour, H. C. (2024). Self-supervised learning of multi-level audio representations for music segmentation. ''IEEE/ACM Transactions on Audio, Speech, and Language Processing''.

For MIREX 2025, participants are required to segment musical audio and classify each segment into one of seven functional categories: '''‘intro’, ‘verse’, ‘chorus’, ‘bridge’, ‘inst’ (instrumental), ‘outro’, or ‘other’'''. The 'other' category can be used for segments that do not fit into the primary six functional labels or for non-musical content if explicitly defined by the dataset annotations being mapped.

=== Data ===

==== Collections ====
The evaluation will utilize datasets previously established in MIREX. Annotations from these diverse collections will be mapped to the seven target functional labels for consistent evaluation.
* '''The MIREX 2009 Collection''': 297 pieces, largely derived from the work of the Beatles.
* '''MIREX 2010 RWC collection''': 100 pieces of popular music. This collection has two sets of ground truths. The first was originally included with the RWC dataset. The second set provides segment boundary annotations (see [http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf Pechuho et al., 2010] for details).
* '''MIREX 2012 dataset''': Over 1,000 annotated pieces covering a range of musical styles, with the majority annotated by two independent annotators.

Participants should be aware that original labels in these datasets (e.g., 'verse1', 'solo', 'fade-out') will need to be mapped to the seven specified functional categories for evaluation. Guidelines for this mapping will be provided, or a standard mapping will be applied during evaluation.

==== Audio Formats (Input to Algorithms) ====
Algorithms should be prepared to process audio with the following characteristics:
* Sample rate: 44.1 kHz
* Bit depth: 16 bit
* Number of channels: 1 (mono)
* Encoding: WAV

=== Submission Format ===

Submissions will be handled via '''CodeBench'''. Participants are required to submit their results in a specific format, as detailed below. You will upload a single file containing the segmentation results for all test audio files.

==== Output Data Format ====
The output must be a '''list of dictionaries''' in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: <tt>'id'</tt> (the identifier of the audio file, e.g., '1.wav') and <tt>'result'</tt> (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the <tt>[start_time, end_time]</tt> of the segment in seconds, and the <tt>label</tt> string for that segment.

The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'other'</tt>.

Example of the content of the submitted file:
<pre>
[
{
'id': 'track01.wav',
'result': [
[[0.000, 15.500], 'intro'],
[[15.500, 45.230], 'verse'],
[[45.230, 75.800], 'chorus'],
[[75.800, 90.000], 'outro']
]
},
{
'id': 'track02.wav',
'result': [
[[0.000, 20.100], 'verse'],
[[20.100, 38.500], 'chorus'],
[[38.500, 55.000], 'verse'],
[[55.000, 72.600], 'chorus'],
[[72.600, 89.000], 'bridge'],
[[89.000, 105.000], 'chorus'],
[[105.000, 115.500], 'outro']
]
}
]
</pre>
Ensure that <tt>offset_time</tt> of one segment is the <tt>onset_time</tt> of the next segment, and segments cover the entire duration of the piece analyzed. The first segment must start at <tt>0.0</tt>.

=== Evaluation Procedures ===

Evaluation will focus on both the accuracy of the detected segment boundaries and the correctness of the assigned functional labels. The primary metrics are:

# '''Frame-Level Accuracy (ACC)''':
# Both the system output and the ground truth will be converted into time-series of labels at a fine temporal resolution (e.g., 10ms or 100ms frames). Accuracy is calculated as the proportion of frames that are correctly labeled by the system compared to the ground truth across the entire dataset. This metric evaluates the overall correctness of segment labels and their temporal extents.

# '''Boundary Retrieval Hit Rate F-Measures (HR.5F and HR3F)''':
# This metric assesses the system's ability to correctly identify segment boundaries.
# * A predicted boundary is considered a '''hit''' if it falls within a certain tolerance window of a ground truth boundary.
# * Two tolerance windows will be used:
# ** 0.5 seconds: For finer precision.
# ** 3.0 seconds: For coarser, more perceptually relevant boundaries.
# * Based on these hits, '''Precision (P)''', '''Recall (R)''', and '''F-measure (F1-score)''' will be calculated for boundary detection at both tolerance levels.
# <math>P = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of retrieved boundaries}}</math>
# <math>R = \frac{\text{Number of correctly retrieved boundaries}}{\text{Total number of ground truth boundaries}}</math>
# <math>F = \frac{2 \times P \times R}{P + R}</math>
# * The reported metrics will be '''HR.5F''' (F-measure with 0.5s tolerance) and '''HR3F''' (F-measure with 3s tolerance).

==== Baseline ====
The performance of the method described in '''Kim, T., & Nam, J. (2023). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio.''' will serve as a baseline for this task. Participants are encouraged to develop systems that surpass this baseline.

=== Relevant Development Collections ===
While the MIREX datasets will be used for evaluation, participants may find the following publicly available annotated corpora useful for development. Please note that the annotations in these corpora will also need to be mapped to the 7-class functional labeling scheme if used for training models for this task.

* Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The TUTstructure07 dataset, containing 557 songs, is also listed [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
* Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of his corpus: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).

These public corpora offer over 200 songs that can be adapted for development purposes.

=== Time and Hardware Limits ===
Due to the nature of the CodeBench platform and the potentially high number of participants, limits on the runtime and computational resources for submissions may be imposed. Specific details regarding these limits will be provided closer to the submission deadline. A general guideline is that analysis should be computationally feasible. For reference, a hard limit of '''24 hours''' for total analysis time over the evaluation dataset was imposed in previous iterations, and a similar constraint might apply.

← Older revision		Revision as of 01:54, 8 September 2025
Line 17:		Line 17:

	=== Data ===		=== Data ===
		+
		+	We use the relabeled Harmonix Dataset for evaluation. The test set is used for eval, and participants can use the train and validation split for training models.
		+
		+	https://huggingface.co/datasets/m-a-p/harmonixset_bigvgan/tree/main

	==== Collections ====		==== Collections ====

@@ Line 40: / Line 40: @@
 The output must be a '''list of dictionaries''' in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: <tt>'id'</tt> (the identifier of the audio file, e.g., '1.wav') and <tt>'result'</tt> (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the <tt>[start_time, end_time]</tt> of the segment in seconds, and the <tt>label</tt> string for that segment.
-The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'other'</tt>.
+The labels must be one of the seven target functional categories: <tt>'intro'</tt>, <tt>'verse'</tt>, <tt>'chorus'</tt>, <tt>'bridge'</tt>, <tt>'inst'</tt>, <tt>'outro'</tt>, <tt>'silence'</tt>.
 Example of the content of the submitted file:

2025:Music Structure Analysis - Revision history

Ldzhangyx: /* Data */

Ldzhangyx: /* Output Data Format */

Ldzhangyx: Created page with "== Music Structure Analysis (MIREX 2025) == '''Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the..."