Difference between revisions of "2010:Structural Segmentation"

From MIREX Wiki
(Created page with '== Description == This task was first run in 2009. The text of this section is copied from the 2009 page. Please add your comments and discussions for 2010. The segment struct…')
 
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== Description ==
 
== Description ==
  
 +
The aim of the MIREX structural segmentation evaluation is to identify the key structural sections in musical audio. The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres (e.g. verse, chorus, etc.) -- is accessible to everybody: it needs no particular musical knowledge. This task was first run in 2009.
  
This task was first run in 2009. The text of this section is copied from the 2009 page. Please add your comments and discussions for 2010.
+
== Data ==
  
The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres -- is accessible to everybody: it needs no particular musical knowledge.
+
=== Collections ===
 +
The final MIREX data set for structural segmentation is comprised of 297 songs. The majority come from the Beatles collection. Works from other artists round out the evaluation dataset.
  
== Data format ==
+
There is a good chance a second dataset donated by the QUERO project will be included. This data includes segment boundaries for 100 songs from its "popular music" section of the RWC dataset. Since there is no grouping information of the segments, only boundary retrieval metrics will be calculated. More info about this annotations can be found at http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf .
  
=== Input ===
+
=== Audio Formats ===
  
Single channel, CD-quality audio (PCM, 16-bit, 44100 Hz).
+
* CD-quality (PCM, 16-bit, 44100 Hz)
 +
* single channel (mono)
 +
 
 +
== Submission Format ==
 +
 
 +
Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.
 +
 
 +
=== Input Data ===
 +
Participating algorithms will have to read audio in the following format:
 +
 
 +
* Sample rate: 44.1 KHz
 +
* Sample size: 16 bit
 +
* Number of channels: 1 (mono)
 +
* Encoding: WAV
 +
 
 +
=== Output Data ===
 +
 
 +
The structural segmentation algorithms will return the segmentation in an ASCII text file for each input .wav audio file. The specification of this output file is immediately below.
 +
 
 +
=== Output File Format (Structural Segmentation) ===
 +
 
 +
The Structural Segmentation output file format is a tab-delimited ASCII text format. This is the same as Chris Harte's chord labelling files (.lab), and so is the same format as the ground truth as well. Onset and offset times are given in seconds, and the labels are simply letters: 'A', 'B', ... with segments referring to the same structural element having the same label.
  
=== Output ===
 
 
Three column text file of the format
 
Three column text file of the format
  
  <onset_time> <offset_time> <label>
+
  <onset_time(sec)>\t<offset_time(sec)>\t<label>\n
  <onset_time> <offset_time> <label>
+
  <onset_time(sec)>\t<offset_time(sec)>\t<label>\n
 
  ...
 
  ...
  
This is the same as Chris Harte's chord labelling files (.lab), and so is the same format as the ground truth as well. Onset and offset times are given in seconds, and the labels are simply letters: 'A', 'B', ... with segments referring to the same structural element having the same label.
+
where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:
 +
 
 +
0.000    5.223    A
 +
5.223    15.101  B
 +
15.101  20.334  A
 +
 
 +
=== Algorithm Calling Format ===
 +
 
 +
The submitted algorithm must take as arguments a SINGLE .wav file to perform the structural segmentation on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
 +
 
 +
foobar %input %output
 +
foobar -i %input -o %output
 +
 
 +
Moreover, if your submission takes additional parameters, foobar could be called like:
 +
 
 +
foobar .1 %input %output
 +
foobar -param1 .1 -i %input -o %output 
 +
 
 +
If your submission is in MATLAB, it should be submitted as a function. Once again, the function must contain String inputs for the full path and names of the input and output files. Parameters could also be specified as input arguments of the function. For example:
  
== Ground Truth ==
+
foobar('%input','%output')
 +
foobar(.1,'%input','%output')
  
Ground truth data on audio is available for more than 200 songs, so given a quality measure everyone agrees on, evaluation wouldn't be harder than on other MIREX tasks.
+
=== README File ===
  
Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The Beatles annotations are not a part of the TUTstructure07 dataset. That dataset contains 557 songs, a list of which is available [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
+
A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
  
Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of the corpus he used: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).
+
For instance, to test the program foobar with a specific value for parameter param1, the README file would look like:
  
Those public corpora give a combined 220 songs.
+
foobar -param1 .1 -i %input -o %output
  
== Evaluation Measures ==
+
For a submission using MATLAB, the README file could look like:
 +
 
 +
matlab -r "foobar(.1,'%input','%output');quit;"
 +
 
 +
== Evaluation Procedures ==
 
At the last ISMIR conference [http://ismir2008.ismir.net/papers/ISMIR2008_219.pdf Lukashevich] proposed a measure for segmentation evaluation. Because of the complexity of the structural segmentation task definition, several different evaluation measures will be employed to address different aspects. It should be noted that none of the evaluation measures cares about the true labels of the sections: they only denote the clustering. This means that it does not matter if the systems produce true labels such as "chorus" and "verse", or arbitrary labels such as "A" and "B".
 
At the last ISMIR conference [http://ismir2008.ismir.net/papers/ISMIR2008_219.pdf Lukashevich] proposed a measure for segmentation evaluation. Because of the complexity of the structural segmentation task definition, several different evaluation measures will be employed to address different aspects. It should be noted that none of the evaluation measures cares about the true labels of the sections: they only denote the clustering. This means that it does not matter if the systems produce true labels such as "chorus" and "verse", or arbitrary labels such as "A" and "B".
  
Line 40: Line 85:
  
 
=== Frame clustering ===
 
=== Frame clustering ===
Both the result and the ground truth are handled in short frames (e.g., beat or fixed 100ms). All frame pairs in a structure description are handled. The pairs in which both frames are assigned to the same cluster (i.e., have the same label) form the sets P_E (for the system result) and P_A (for the ground truth). The ''pairwise precision rate'' can be calculated by P = \frac{|P_E \cap P_A|}{|P_E|}, ''pairwise recall rate'' by R = \frac{|P_E \cap P_A|}{|P_A|}, and ''pairwise F-measure'' by F=\frac{2 P R}{P + R}. ([http://dx.doi.org/10.1109/TASL.2007.910781 Levy & Sandler TASLP2008])
+
Both the result and the ground truth are handled in short frames (e.g., beat or fixed 100ms). All frame pairs in a structure description are handled. The pairs in which both frames are assigned to the same cluster (i.e., have the same label) form the sets <math>P_E</math> (for the system result) and <math>P_A</math> (for the ground truth). The ''pairwise precision rate'' can be calculated by <math>P = \frac{|P_E \cap P_A|}{|P_E|}</math>, ''pairwise recall rate'' by <math>R = \frac{|P_E \cap P_A|}{|P_A|}</math>, and ''pairwise F-measure'' by <math>F=\frac{2 P R}{P + R}</math>. ([http://dx.doi.org/10.1109/TASL.2007.910781 Levy & Sandler TASLP2008])
  
 
=== Normalised conditional entropies ===
 
=== Normalised conditional entropies ===
Line 46: Line 91:
 
Structure descriptions are represented as frame sequences with the associated cluster information (similar to the Frame clustering measure). Confusion matrix between the labels in ground truth and the result is calculated. The matrix C is of size |L_A| * |L_E|, i.e., number of unique labels in the ground truth times number of unique labels in the result. From the confusion matrix, the joint distribution is calculated by normalising the values with the total number of frames F:
 
Structure descriptions are represented as frame sequences with the associated cluster information (similar to the Frame clustering measure). Confusion matrix between the labels in ground truth and the result is calculated. The matrix C is of size |L_A| * |L_E|, i.e., number of unique labels in the ground truth times number of unique labels in the result. From the confusion matrix, the joint distribution is calculated by normalising the values with the total number of frames F:
  
p_{i,j} = C_{i,j} / F
+
<math>p_{i,j} = C_{i,j} / F</math>
  
 
Similarly, the two marginals are calculated:
 
Similarly, the two marginals are calculated:
  
p_i^a = \sum_{j=1}^{|L_E|} C{i,j}/F , and
+
<math>p_i^a = \sum_{j=1}^{|L_E|} C{i,j}/F</math>, and
  
p_j^e = \sum_{i=1}^{|L_A|} C{i,j}/F
+
<math>p_j^e = \sum_{i=1}^{|L_A|} C{i,j}/F</math>
  
 
Conditional distributions:
 
Conditional distributions:
  
p_{i,j}^{a|e} = C_{i,j} / \sum_{i=1}^{|L_A|} C{i,j} , and
+
<math>p_{i,j}^{a|e} = C_{i,j} / \sum_{i=1}^{|L_A|} C{i,j}</math>, and
  
p_{i,j}^{e|a} = C_{i,j} / \sum_{j=1}^{|L_E|} C{i,j}
+
<math>p_{i,j}^{e|a} = C_{i,j} / \sum_{j=1}^{|L_E|} C{i,j}</math>
  
 
The conditional entropies will then be
 
The conditional entropies will then be
  
H(E|A) = - \sum_{i=1}^{|L_A|} p_i^a \sum_{j=1}^{|L_E|} p_{i,j}^{e|a} \log_2(p_{i,j}^{e|a}), and
+
<math>H(E|A) = - \sum_{i=1}^{|L_A|} p_i^a \sum_{j=1}^{|L_E|} p_{i,j}^{e|a} \log_2(p_{i,j}^{e|a})</math>, and
  
H(A|E) = - \sum_{j=1}^{|L_E|} p_j^e \sum_{i=1}^{|L_A|} p_{i,j}^{a|e} \log_2(p_{i,j}^{a|e})
+
<math>H(A|E) = - \sum_{j=1}^{|L_E|} p_j^e \sum_{i=1}^{|L_A|} p_{i,j}^{a|e} \log_2(p_{i,j}^{a|e})</math>
  
 
The final evaluation measures will then be the oversegmentation score
 
The final evaluation measures will then be the oversegmentation score
  
S_O = 1 - \frac{H(E|A)}{\log_2(|L_E|)} , and the undersegmentation score
+
<math>S_O = 1 - \frac{H(E|A)}{\log_2(|L_E|)}</math> , and the undersegmentation score
 
 
S_U = 1 - \frac{H(A|E)}{\log_2(|L_A|)}
 
 
 
== Potential Participants ==
 
 
 
== Discussion for 2010 ==
 
 
 
== Discussion from 2009 ==
 
 
 
Thanks for the initiative! I might be interested in participating. Are you referring to segmentation of audio, or symbolic data? What set of annotated data did you refer to? [Maarten Grachten]
 
 
 
Yes, sorry, forgot to specify that. I'm mainly interested in audio, so I changed that above. --[[User:Matthiasmauch|Matthias]] 11:04, 30 June 2009 (UTC)
 
 
 
The more the merrier: I could as well throw in the algo I implemented 2 years ago for my thesis [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html]. I'm also curious about the annotated data mentioned. Thanks for your effort! --[[User:Ewald|Ewald]] 17:33, 1 July 2009 (UTC)
 
 
 
Regarding ground truth: at Queen Mary we have the complete Beatles segmentations (with starts at bar beginnings), plus tens of other songs by Carole King, Queen, and Zweieck. We could leave the latter three untouched (i.e. I would not train my own algorithm on them), or publish them soon, so everyone can train their method on them. --[[User:Matthiasmauch|Matthias]] 16:07, 7 August 2009 (UTC)
 
 
 
Defining the segment: In my opinion a segment would be a state with similar acoustical content (like in Lukashevich). I just want to make clear what the algo should do. --[[User:Timezone|Stephan]] 10:04, 10 August 2009 (UTC)
 
 
 
Some notes: The proposed output with Wavesurfer -like format is probably the best at this first go at the task. For the evaluation metric: I'd propose using both the F-measure for frame pairs (as per [http://dx.doi.org/10.1109/TASL.2007.910781 Levy&Sandler]) and the over/under segmentation measure by Lukashevich because they provide slightly different information. Both of these assume a "state" based description of the structure, so the hierarchical differences will not be handled very gracefully (hierarchical differences do exist if different persons annotate the same piece and a better metric should perhaps be developed at some point). Still, for the sake of simplicity the would be adequate for the task. The question of the data is bit more interesting. We used three different data sets in a recent [http://dx.doi.org/10.1109/TASL.2009.2020533 publication]: a large in-house set that can't be distributed even for MIREX, 174 songs by The Beatles [http://www.iua.upf.edu/~perfe/annotations/sections/license.html from UPF] and [http://www.cs.tut.fi/sgn/arg/paulus/structure.html#beatles_data from TUT], and [http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/ RWC Pop]. The last two of these are publicly available, so basically anybody could train the system with them (if there is something to train). --[[User:Paulus|Paulus]] 13:21, 10 August 2009 (UTC)
 
 
 
Some comments:
 
 
 
1) Using acoustical similarities would be the best (therefore we must be carefull with some test-sets merging acoustical similarities description with timeline-based description such as "intro" or "outro"; how do we deal with this timeline-based description ?).
 
A deep analysis of the content of each test-set will be necessary in order to do that. We could share this work for those interrested.
 
 
 
2) concerning evaluation, I would be in favor of having
 
*A) a measure of the segmentation precision for a set of precision windows (using Recall, Precision, F-measure curves versus Precision Window)
 
*B) a labeling/segmentation measure: for this the Normalized Conditionnal Entropy [Lukashevich2008] is OK, or the more demanding "modeling error" obtained by aligning (without re-use) the annotated and estimated labels [Peeters2007]
 
 
 
3) Defining the number of labels would help a lot (some test-set used a very restricted vocabulary, some others a very large one); or giving the possibility to output an estimated  hierarchical structure
 
--[[User:GPeeters|Peeters]] 17:28, 17 August 2009 (UTC)
 
  
 +
<math>S_U = 1 - \frac{H(A|E)}{\log_2(|L_A|)}</math>
  
Prompted by Ehmann's email, I'm wondering about how to really factor hierarchical levels into the evaluation. We would want to have the evaluation not penalize a misestimation of structural scale (rather than of actual structural organization), but unless I'm mistaken, the only ground truth we have is at a single structural scale. For instance, if a song's ground truth was ABABCABCC, we may hope that an estimated analysis of DDCDCC (which is correct except that, say, it didn't subdivide a verse D into its 2 important sub-units AB) would be correct. Unless we have ground truth annotated at several hierarchical levels, we can't do this comparison.
+
== Relevant Development Collections ==
 +
*Jouni Paulus's [http://www.cs.tut.fi/sgn/arg/paulus/structure.html structure analysis page] links to a corpus of 177 Beatles songs ([http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip zip file]). The Beatles annotations are not a part of the TUTstructure07 dataset. That dataset contains 557 songs, a list of which is available [http://www.cs.tut.fi/sgn/arg/paulus/TUTstructure07_files.html here].
  
But: we could allow the algorithms to produce for each song not a single structural estimation, but a hierarchical tree of related estimates. So for this example, an algorithm may answer "AABABB ~ (AB)(AB)C(AB)CC ~ (AA'B)(AA'B)(CD)(AA'B)(CD)(CD) ~ ..." and so forth.
+
*Ewald Peiszer's [http://www.ifs.tuwien.ac.at/mir/audiosegmentation.html thesis page] links to a portion of the corpus he used: 43 non-Beatles pop songs (including 10 J-pop songs) ([http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip zip file]).
  
This potentially addresses the issue of not knowing ahead of time how many different labels to expect: naturally, at larger scales, fewer labels may be necessary. And hopefully, whichever of one's guesses winds up being at the "correct" scale will have also correctly guessed the number of labels.
+
These public corpora give a combined 220 songs.
  
One obvious drawback to this: this kind of output is dramatically different (I think) from most people's current algorithms, and to redesign these in less than a month is perhaps unfeasible. The alternative is to get multi-scale annotations, but I don't think we have that either. Which leaves the 3rd alternative: forgetting hierarchical concerns altogether and just getting something that works well enough.
+
== Time and hardware limits ==
 +
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
  
It seems like this is a very deep and far-reaching issue, one that probably lies at the heart of the structural segmentation task. Maybe we need a few more follow-ups to Lukashevich before this gets ironed out at MIREX! --[[User:jordan|Jordan]] 23:12, 17 August 2009 (UTC)
+
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.
  
 +
== Submission opening date ==
  
An attempt to revive the discussion. If we want to run the task this year (or at all), some compromises have to be made. I think we all can agree that the hierarchical nature of music piece structure is a problem that will need to be addressed at some point. However, I am not at all sure if it should be addressed now. It would be more important to get the ball rolling, run the task, and then revise it in the following years to be more accurate. I'd see the task to be defined the following points (and please, do comment on these):
+
Friday 4th June 2010
* The provided ground truth is taken as it is, and the algorithms should replicate the way it was produced. Meaning, no hierarchical evaluation attempts etc. People provide different segmentations and groupings on the same song, but let's just take an engineering approach to mimic an "average" music listener.
 
* No overlapping segments. The ending of the previous indicates the start of the next one. The segments cover the entire duration of the piece. (Special handling of "Silence" label in ground truth should be taken to exclude them from the actual evaluation.)
 
* Two subtasks: segmentation and segmentation&grouping. In plain segmentation only the locations of the borders matter. Evaluation can be made with precision/recall with some/multiple acceptance window(s) (as Geoffroy proposed). In the second, piece is segmented and the segments are grouped (all verses in one, all choruses in another etc). We can use all the three evaluation measures as none of these depend on the absolute labels: clustering F-measure (Levy), under/oversegmentation (Lukashevich), modelling error (Peeters), and maybe something else?
 
* The label assigned to a segment serves only to indicate the grouping, and the evaluation should handle this.
 
* For each song, only the acoustic data is provided, no any additional information. It is the job of the algorithm to do the analysis.
 
--[[User:Paulus|Paulus]] 14:53, 2 September 2009 (UTC)
 
  
I agree, sounds much more practical. Thanks Paulus. --[[User:Matthiasmauch|Matthias]] 15:36, 15 September 2009 (UTC)
+
== Submission closing date ==
 +
TBA

Latest revision as of 04:27, 5 June 2010

Description

The aim of the MIREX structural segmentation evaluation is to identify the key structural sections in musical audio. The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres (e.g. verse, chorus, etc.) -- is accessible to everybody: it needs no particular musical knowledge. This task was first run in 2009.

Data

Collections

The final MIREX data set for structural segmentation is comprised of 297 songs. The majority come from the Beatles collection. Works from other artists round out the evaluation dataset.

There is a good chance a second dataset donated by the QUERO project will be included. This data includes segment boundaries for 100 songs from its "popular music" section of the RWC dataset. Since there is no grouping information of the segments, only boundary retrieval metrics will be calculated. More info about this annotations can be found at http://hal.inria.fr/docs/00/47/34/79/PDF/PI-1948.pdf .

Audio Formats

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)

Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.

Input Data

Participating algorithms will have to read audio in the following format:

  • Sample rate: 44.1 KHz
  • Sample size: 16 bit
  • Number of channels: 1 (mono)
  • Encoding: WAV

Output Data

The structural segmentation algorithms will return the segmentation in an ASCII text file for each input .wav audio file. The specification of this output file is immediately below.

Output File Format (Structural Segmentation)

The Structural Segmentation output file format is a tab-delimited ASCII text format. This is the same as Chris Harte's chord labelling files (.lab), and so is the same format as the ground truth as well. Onset and offset times are given in seconds, and the labels are simply letters: 'A', 'B', ... with segments referring to the same structural element having the same label.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
...

where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:

0.000    5.223    A
5.223    15.101   B
15.101   20.334   A

Algorithm Calling Format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the structural segmentation on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input %output
foobar -i %input -o %output

Moreover, if your submission takes additional parameters, foobar could be called like:

foobar .1 %input %output
foobar -param1 .1 -i %input -o %output  

If your submission is in MATLAB, it should be submitted as a function. Once again, the function must contain String inputs for the full path and names of the input and output files. Parameters could also be specified as input arguments of the function. For example:

foobar('%input','%output')
foobar(.1,'%input','%output')

README File

A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

For instance, to test the program foobar with a specific value for parameter param1, the README file would look like:

foobar -param1 .1 -i %input -o %output

For a submission using MATLAB, the README file could look like:

matlab -r "foobar(.1,'%input','%output');quit;"

Evaluation Procedures

At the last ISMIR conference Lukashevich proposed a measure for segmentation evaluation. Because of the complexity of the structural segmentation task definition, several different evaluation measures will be employed to address different aspects. It should be noted that none of the evaluation measures cares about the true labels of the sections: they only denote the clustering. This means that it does not matter if the systems produce true labels such as "chorus" and "verse", or arbitrary labels such as "A" and "B".

Boundary retrieval

Hit rate Found segment boundaries are accepted to be correct if they are within 0.5s (Turnbull et al. ISMIR2007) or 3s (Levy & Sandler TASLP2008) from a border in the ground truth. Based on the matched hits, boundary retrieval recall rate, boundary retrieval precision rate, and boundary retrieval F-measure are be calculated.

Median deviation Two median deviation measure between boundaries in the result and ground truth are calculated: median true-to-guess is the median time from boundaries in ground truth to the closest boundaries in the result, and median guess-to-true is similarly the median time from boundaries in the result to boundaries in ground truth. (Turnbull et al. ISMIR2007)

Frame clustering

Both the result and the ground truth are handled in short frames (e.g., beat or fixed 100ms). All frame pairs in a structure description are handled. The pairs in which both frames are assigned to the same cluster (i.e., have the same label) form the sets (for the system result) and (for the ground truth). The pairwise precision rate can be calculated by , pairwise recall rate by , and pairwise F-measure by . (Levy & Sandler TASLP2008)

Normalised conditional entropies

Over- and under segmentation based evaluation measures proposed in Lukashevich ISMIR2008. Structure descriptions are represented as frame sequences with the associated cluster information (similar to the Frame clustering measure). Confusion matrix between the labels in ground truth and the result is calculated. The matrix C is of size |L_A| * |L_E|, i.e., number of unique labels in the ground truth times number of unique labels in the result. From the confusion matrix, the joint distribution is calculated by normalising the values with the total number of frames F:

Similarly, the two marginals are calculated:

, and

Conditional distributions:

, and

The conditional entropies will then be

, and

The final evaluation measures will then be the oversegmentation score

, and the undersegmentation score

Relevant Development Collections

  • Jouni Paulus's structure analysis page links to a corpus of 177 Beatles songs (zip file). The Beatles annotations are not a part of the TUTstructure07 dataset. That dataset contains 557 songs, a list of which is available here.
  • Ewald Peiszer's thesis page links to a portion of the corpus he used: 43 non-Beatles pop songs (including 10 J-pop songs) (zip file).

These public corpora give a combined 220 songs.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.

A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.

Submission opening date

Friday 4th June 2010

Submission closing date

TBA