2010:Structural Segmentation

Description

The aim of the MIREX structural segmentation evaluation is to identify the key structural sections in musical audio. The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres (e.g. verse, chorus, etc.) -- is accessible to everybody: it needs no particular musical knowledge. This task was first run in 2009.

Data

Collections

The final MIREX data set for structural segmentation is comprised of 297 songs. The majority come from the Beatles collection. Works from other artists round out the evaluation dataset.

Audio Formats

CD-quality (PCM, 16-bit, 44100 Hz)
single channel (mono)

Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.

Input Data

Participating algorithms will have to read audio in the following format:

Sample rate: 44.1 KHz
Sample size: 16 bit
Number of channels: 1 (mono)
Encoding: WAV

Output Data

The structural segmentation algorithms will return the segmentation in an ASCII text file for each input .wav audio file. The specification of this output file is immediately below.

Output File Format (Structural Segmentation)

The Structural Segmentation output file format is a tab-delimited ASCII text format. This is the same as Chris Harte's chord labelling files (.lab), and so is the same format as the ground truth as well. Onset and offset times are given in seconds, and the labels are simply letters: 'A', 'B', ... with segments referring to the same structural element having the same label.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
...

where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:

0.000    5.223    A
5.223    15.101   B
15.101   20.334   A

Algorithm Calling Format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the structural segmentation on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input %output
foobar -i %input -o %output

Moreover, if your submission takes additional parameters, foobar could be called like:

foobar .1 %input %output
foobar -param1 .1 -i %input -o %output

If your submission is in MATLAB, it should be submitted as a function. Once again, the function must contain String inputs for the full path and names of the input and output files. Parameters could also be specified as input arguments of the function. For example:

foobar('%input','%output')
foobar(.1,'%input','%output')

README File

A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

For instance, to test the program foobar with a specific value for parameter param1, the README file would look like:

foobar -param1 .1 -i %input -o %output

For a submission using MATLAB, the README file could look like:

matlab -r "foobar(.1,'%input','%output');quit;"

Evaluation Procedures

At the last ISMIR conference Lukashevich proposed a measure for segmentation evaluation. Because of the complexity of the structural segmentation task definition, several different evaluation measures will be employed to address different aspects. It should be noted that none of the evaluation measures cares about the true labels of the sections: they only denote the clustering. This means that it does not matter if the systems produce true labels such as "chorus" and "verse", or arbitrary labels such as "A" and "B".

Boundary retrieval

Hit rate Found segment boundaries are accepted to be correct if they are within 0.5s (Turnbull et al. ISMIR2007) or 3s (Levy & Sandler TASLP2008) from a border in the ground truth. Based on the matched hits, boundary retrieval recall rate, boundary retrieval precision rate, and boundary retrieval F-measure are be calculated.

Median deviation Two median deviation measure between boundaries in the result and ground truth are calculated: median true-to-guess is the median time from boundaries in ground truth to the closest boundaries in the result, and median guess-to-true is similarly the median time from boundaries in the result to boundaries in ground truth. (Turnbull et al. ISMIR2007)

Frame clustering

Both the result and the ground truth are handled in short frames (e.g., beat or fixed 100ms). All frame pairs in a structure description are handled. The pairs in which both frames are assigned to the same cluster (i.e., have the same label) form the sets $P_{E}$ (for the system result) and $P_{A}$ (for the ground truth). The pairwise precision rate can be calculated by $P={\frac {|P_{E}\cap P_{A}|}{|P_{E}|}}$ , pairwise recall rate by $R={\frac {|P_{E}\cap P_{A}|}{|P_{A}|}}$ , and pairwise F-measure by $F={\frac {2PR}{P+R}}$ . (Levy & Sandler TASLP2008)

Normalised conditional entropies

Over- and under segmentation based evaluation measures proposed in Lukashevich ISMIR2008. Structure descriptions are represented as frame sequences with the associated cluster information (similar to the Frame clustering measure). Confusion matrix between the labels in ground truth and the result is calculated. The matrix C is of size |L_A| * |L_E|, i.e., number of unique labels in the ground truth times number of unique labels in the result. From the confusion matrix, the joint distribution is calculated by normalising the values with the total number of frames F:

$p_{i,j}=C_{i,j}/F$

Similarly, the two marginals are calculated:

$p_{i}^{a}=\sum _{j=1}^{|L_{E}|}C{i,j}/F$ , and

$p_{j}^{e}=\sum _{i=1}^{|L_{A}|}C{i,j}/F$

Conditional distributions:

$p_{i,j}^{a|e}=C_{i,j}/\sum _{i=1}^{|L_{A}|}C{i,j}$ , and

$p_{i,j}^{e|a}=C_{i,j}/\sum _{j=1}^{|L_{E}|}C{i,j}$

The conditional entropies will then be

$H(E|A)=-\sum _{i=1}^{|L_{A}|}p_{i}^{a}\sum _{j=1}^{|L_{E}|}p_{i,j}^{e|a}\log _{2}(p_{i,j}^{e|a})$ , and

$H(A|E)=-\sum _{j=1}^{|L_{E}|}p_{j}^{e}\sum _{i=1}^{|L_{A}|}p_{i,j}^{a|e}\log _{2}(p_{i,j}^{a|e})$

The final evaluation measures will then be the oversegmentation score

$S_{O}=1-{\frac {H(E|A)}{\log _{2}(|L_{E}|)}}$ , and the undersegmentation score

$S_{U}=1-{\frac {H(A|E)}{\log _{2}(|L_{A}|)}}$

Relevant Development Collections

Jouni Paulus's structure analysis page links to a corpus of 177 Beatles songs (zip file). The Beatles annotations are not a part of the TUTstructure07 dataset. That dataset contains 557 songs, a list of which is available here.

Ewald Peiszer's thesis page links to a portion of the corpus he used: 43 non-Beatles pop songs (including 10 J-pop songs) (zip file).

These public corpora give a combined 220 songs.

2010:Structural Segmentation

Contents

Description

Data

Collections

Audio Formats

Submission Format

Input Data

Output Data

Output File Format (Structural Segmentation)

Algorithm Calling Format

README File

Evaluation Procedures

Boundary retrieval

Frame clustering

Normalised conditional entropies

Relevant Development Collections

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools