Difference between revisions of "2009:Structural Segmentation"
|  (→Issues and Discussion) |  (→Issues and Discussion) | ||
| Line 32: | Line 32: | ||
| Some notes: The proposed output with Wavesurfer -like format is probably the best at this first go at the task. For the evaluation metric: I'd propose using both the F-measure for frame pairs (as per [http://dx.doi.org/10.1109/TASL.2007.910781 Levy&Sandler]) and the over/under segmentation measure by Lukashevich because they provide slightly different information. Both of these assume a "state" based description of the structure, so the hierarchical differences will not be handled very gracefully (hierarchical differences do exist if different persons annotate the same piece and a better metric should perhaps be developed at some point). Still, for the sake of simplicity the would be adequate for the task. The question of the data is bit more interesting. We used three different data sets in a recent [http://dx.doi.org/10.1109/TASL.2009.2020533 publication]: a large in-house set that can't be distributed even for MIREX, 174 songs by The Beatles [http://www.iua.upf.edu/~perfe/annotations/sections/license.html from UPF] and [http://www.cs.tut.fi/sgn/arg/paulus/structure.html#beatles_data from TUT], and [http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/ RWC Pop]. The last two of these are publicly available, so basically anybody could train the system with them (if there is something to train). --[[User:Paulus|Paulus]] 13:21, 10 August 2009 (UTC) | Some notes: The proposed output with Wavesurfer -like format is probably the best at this first go at the task. For the evaluation metric: I'd propose using both the F-measure for frame pairs (as per [http://dx.doi.org/10.1109/TASL.2007.910781 Levy&Sandler]) and the over/under segmentation measure by Lukashevich because they provide slightly different information. Both of these assume a "state" based description of the structure, so the hierarchical differences will not be handled very gracefully (hierarchical differences do exist if different persons annotate the same piece and a better metric should perhaps be developed at some point). Still, for the sake of simplicity the would be adequate for the task. The question of the data is bit more interesting. We used three different data sets in a recent [http://dx.doi.org/10.1109/TASL.2009.2020533 publication]: a large in-house set that can't be distributed even for MIREX, 174 songs by The Beatles [http://www.iua.upf.edu/~perfe/annotations/sections/license.html from UPF] and [http://www.cs.tut.fi/sgn/arg/paulus/structure.html#beatles_data from TUT], and [http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/ RWC Pop]. The last two of these are publicly available, so basically anybody could train the system with them (if there is something to train). --[[User:Paulus|Paulus]] 13:21, 10 August 2009 (UTC) | ||
| + | |||
| + | Some comments: | ||
| + | 1) Using acoustical similarities would be the best (therefore we must be carefull with some test-sets merging acoustical similarities description with timeline-based description such as "intro" or "outro"; how do we deal with this timeline-based description ?). | ||
| + | A deep analysis of the content of each test-set will be necessary in order to do that. We could share this work for those interrested. | ||
| + | 2) concerning evaluation, I would be in favor of having  | ||
| + | *A) a measure of the segmentation precision for a set of precision windows (using Recall, Precision, F-measure curves versus Precision Window)  | ||
| + | *B) a labeling/segmentation measure: for this the Normalized Conditionnal Entropy [Lukashevich2008] is OK, or the more demanding "modeling error" obtained by aligning (without re-use) the annotated and estimated labels [Peeters2007] | ||
| + | 3) Defining the number of labels would help a lot (some test-set used a very restricted vocabulary, some others a very large one); or giving the possibility to output an estimated  hierarchical structure | ||
Revision as of 10:27, 17 August 2009
The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres -- is accessible to everybody: it needs no particular musical knowledge.
Input: wave audio
Output: three column text file of the format <onset_time> <offset_time> <label>, i.e. like Chris Harte's chord labelling files (.lab). onset_time and offset_time in seconds, labels: 'A', 'B', ... where segments referring to the same structural element have the same label.
Ground truth data on audio is available for more than 200 songs, so given a quality measure everyone agrees on, evaluation wouldn't be harder than on other MIREX tasks. At the last ISMIR conference Lukashevich proposed a measure for segmentation evaluation.
Potential Participants
Matthias Mauch, Queen Mary, University of London --Matthias 08:49, 30 June 2009 (UTC)
Maarten Grachten, Johannes Kepler University, Linz, Austria -- Maarten
Geoffroy Peeters, IRCAM, Paris, France (depending on the kind of annotations)
Jouni Paulus, Tampere University of Technology, Finland
Stephan Huebler, Technical University of Dresden, Germany -- Stephan
Jordan Smith, McGill University, Montreal, Canada
Issues and Discussion
Thanks for the initiative! I might be interested in participating. Are you referring to segmentation of audio, or symbolic data? What set of annotated data did you refer to? [Maarten Grachten]
Yes, sorry, forgot to specify that. I'm mainly interested in audio, so I changed that above. --Matthias 11:04, 30 June 2009 (UTC)
The more the merrier: I could as well throw in the algo I implemented 2 years ago for my thesis [1]. I'm also curious about the annotated data mentioned. Thanks for your effort! --Ewald 17:33, 1 July 2009 (UTC)
Regarding ground truth: at Queen Mary we have the complete Beatles segmentations (with starts at bar beginnings), plus tens of other songs by Carole King, Queen, and Zweieck. We could leave the latter three untouched (i.e. I would not train my own algorithm on them), or publish them soon, so everyone can train their method on them. --Matthias 16:07, 7 August 2009 (UTC)
Defining the segment: In my opinion a segment would be a state with similar acoustical content (like in Lukashevich). I just want to make clear what the algo should do. --Stephan 10:04, 10 August 2009 (UTC)
Some notes: The proposed output with Wavesurfer -like format is probably the best at this first go at the task. For the evaluation metric: I'd propose using both the F-measure for frame pairs (as per Levy&Sandler) and the over/under segmentation measure by Lukashevich because they provide slightly different information. Both of these assume a "state" based description of the structure, so the hierarchical differences will not be handled very gracefully (hierarchical differences do exist if different persons annotate the same piece and a better metric should perhaps be developed at some point). Still, for the sake of simplicity the would be adequate for the task. The question of the data is bit more interesting. We used three different data sets in a recent publication: a large in-house set that can't be distributed even for MIREX, 174 songs by The Beatles from UPF and from TUT, and RWC Pop. The last two of these are publicly available, so basically anybody could train the system with them (if there is something to train). --Paulus 13:21, 10 August 2009 (UTC)
Some comments: 1) Using acoustical similarities would be the best (therefore we must be carefull with some test-sets merging acoustical similarities description with timeline-based description such as "intro" or "outro"; how do we deal with this timeline-based description ?). A deep analysis of the content of each test-set will be necessary in order to do that. We could share this work for those interrested. 2) concerning evaluation, I would be in favor of having
- A) a measure of the segmentation precision for a set of precision windows (using Recall, Precision, F-measure curves versus Precision Window)
- B) a labeling/segmentation measure: for this the Normalized Conditionnal Entropy [Lukashevich2008] is OK, or the more demanding "modeling error" obtained by aligning (without re-use) the annotated and estimated labels [Peeters2007]
3) Defining the number of labels would help a lot (some test-set used a very restricted vocabulary, some others a very large one); or giving the possibility to output an estimated hierarchical structure

