Difference between revisions of "2007:Audio Music Mood Classification"

From MIREX Wiki
(How many judgments and assessors)
(Human Assessment)
Line 80: Line 80:
  
 
To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.
 
To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.
 +
 +
=== Scoring ===
 +
Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.
  
 
== File Format ==
 
== File Format ==

Revision as of 16:19, 19 June 2007

Introduction

In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006#Related Papers). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 #Related Papers) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003#Related Papers). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.

However, evaluation of music mood classification is difficult as music mood is a very subjective notion. Each aforementioned experiement used different mood categories and different datasets, making comparison on previous work a virtually impossible mission. A contest on music mood classification in MIREX will help build the first ever community available test set and precious ground truth.

This is the first time in MIREX to attempt a music mood classification evaluation. There are many issues involved in this evaluation task, and let us start discuss them on this wiki. If needed, we will set up a mailing list devoting to the discussion.

Mood Categories

The IMIRSEL has derived a set of 5 mood clusters from the AMG mood repository (Hu & Downie 2007#Related Papers). The mood clusters effectively reduce the diverse mood space into a tangible set of categories, and yet root in the social-cultural context of pop music. Therefore, we propose to use the 5 mood clusters as the categories in this yearΓÇÖs audio mood classification contest. Each of the clusters is a collection of the AMG mood labels which collectively define the cluster:

  • Cluster_1: passionate, rousing, confident,boisterous, rowdy
  • Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
  • Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
  • Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
  • Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral

At this moment, the IMIRSEL and Cyril Laurier at the Music Technology Group of Barcelona have manually validated the mood clusters and exemplar songs in each cluster. Please see #Exemplar Songs in Each Category for details.


Previous Discussion on Mood Taxonomy

Discussion on Mood Categories

Exemplar Songs in Each Category

Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.

There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 49 songs are unanimously labeled by the 6 human assessors and AMG, and another 42 songs are unanimously labeled by the 6 human assessors. The song titles are listed in exemplar songs.

The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of #Training Set).

Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.


Previous Discussion on Ground Truth

Training Set

Some potential participants request a training set to be provided, and the exemplar songs described above can serve as (seeds of) training data. However, due to copyright issue, we cannot distribute the audio files of the exemplar songs. There are two ways to address this issue:

1) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) and the participants will locate the audio files by themselves and/or possibly find other audio clips guided by the exemplar songs (i.e. seeds) for training purposes. Participants train their models in house and submit trained models.

2) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) for helping participants understand the mood categories. The IMIRSEL prepares a certain number (e.g. 30) of short audio clips (e.g. 30 seconds) for each mood clusters. Participating algorithms/ models are trained and tested within IMIRSEL.


Song Pool

The pool of songs to be classified is from the same collection of the exemplar songs. Currently, the contest organizers are seeking additional songs in various genres other than Pop music to supplement the USPOP collection. Having songs in a variety of genres in each mood cluster will make the contest harder and more interesting. However, due to time and resource constraint, the song pool may still end up being dominated by pop music, which hopefully is still of interests to most participants.

Proposed audio format: 30 second clips, 22.05kHz, mono, 16bit, WAV files

We will randomly select a certain number of songs from the USPOP and other (to-be-decided) collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.

Classification Results

Each algorithm will return the top X songs in each cluster.

This is a single-label classification contest, and thus each song can only be classified into one mood cluster.

Note: unlike traditional classification problems where all testing samples have ground truth available, this contest does not have a well labeled testing set, and we are unlikely to be able to make such a set this year. (But this yearΓÇÖs contest, will make a human-assessed ground truth set for future use.) Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.

The actual value of X depends on human assessment protocol and number of available human assessors (see next section #Human Assessment).

Human Assessment

Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron 7000, to be developed by the IMIRSEL. (An introduction of a similar system Evalutron 6000 is shown here [[1]])

How many judgments and assessors

Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.

Human evaluators will be drawn from the participating labs and volunteers from IMIRSEL or on the MIREX lists. Suppose we can get W evaluators, each evaluator will evaluate S = (5*X*Y*Z) / W songs.

At this moment, there are 10 potential participants on the Wiki, so letΓÇÖs say Y = 6. Suppose each candidate song will be evaluated by 3 judges, Z = 3, and suppose we can get 20 assessors: W = 20:

  • If X = 20, number of judgments for each assessor: S = 90
  • If X = 10, S = 45
  • If X = 30, S = 135
  • If X = 50, S = 225
  • If X = 15, S = 67.5
  • ΓǪ

In audio similarity contest last year, each assessor made 205 judgments as average. As the judgment for mood is trickier, we may need to give our assessors less burden.

To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.

Scoring

Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.

File Format

TBD

Submission Format

TBD

Challenging Issues

  1. Mood changeable pieces: some pieces may start from one mood but end up with another one. For each of those,we can either label it with the most salient mood or just inconsistent judgments rule it out.
  1. Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.

Participants

If you think there is a slight chance that you might consider participating, please add your name and email address here.

  • Kris West (kw at cmp dot uea dot ac dot uk)
  • Cyril Laurier (claurier at iua dot upf dot edu)
  • Elias Pampalk (firstname.lastname@gmail.com)
  • Yuriy Molchanyuk (molchanyuk at onu.edu.ua)
  • Shigeki Sagayama (sagayama at hil dot t.u-tokyo.ac.jp)
  • Guillaume Nargeot (killy971 at gmail dot com)
  • Zhongzhe Xiao (zhongzhe dot xiao at ec-lyon dot fr)
  • Kyogu Lee (kglee at ccrma.stanford.edu)
  • Vitor Soares (firstname.lastname@clustermedialabs.com)
  • Wai Cheung (wlche1@infotech.monash.edu.au)

Moderators

  • J. Stephen Downie (IMIRSEL, University of Illinois, USA) - [2]
  • Xiao Hu (IMIRSEL, University of Illinois, USA) -[3]
  • Cyril Laurier (Music Technology Group, Barcelona, Spain) -[4]

Related Papers

  1. Dietterich, T. (1997). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1924.
  2. Hu, Xiao and J. Stephen Downie (2007). Exploring mood metadata: Relationships with genre, artist and usage metadata. Accepted in the Eighth International Conference on Music Information Retrieval (ISMIR 2007),Vienna, September 23-27, 2007.
  3. Juslin, P.N., Karlsson, J., Lindstr├╢m E., Friberg, A. and Schoonderwaldt, E(2006), Play It Again With Feeling: Computer Feedback in Musical Communication of Emotions. In Journal of Experimental Psychology: Applied 2006, Vol.12, No.2, 79-95.
  4. Vignoli (ISMIR 2004) Digital Music Interaction Concepts: A User Study
  5. Cunningham, Jones and Jones (ISMIR 2004) Organizing Digital Music For Use: An Examiniation of Personal Music Collections.
  6. Cunningham, Bainbridge and Falconer (ISMIR 2006) More of an Art than a Science': Supporting the Creation of Playlists and Mixes.
  7. Lu, Liu and Zhang (2006), Automatic Mood Detection and Tracking of Music Audio Signals. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 1, JANUARY 2006
    Part of this paper appeared in ISMIR 2003 http://ismir2003.ismir.net/papers/Liu.PDF
  8. Pohle, Pampalk, and Widmer (CBMI 2005) Evaluation of Frequently Used Audio Features for Classification of Music into Perceptual Categories.
    It separates "mood" and "emotion" as two classifcation dimensions, which are mostly combined in other studies.
  9. Mandel, Poliner and Ellis (2006) Support vector machine active learning for music retrieval. Multimedia Systems, Vol.12(1). Aug.2006.
  10. Feng, Zhuang and Pan (SIGIR 2003) Popular music retrieval by detecting mood
  11. Li and Ogihara (ISMIR 2003) Detecting emotion in music
  12. Hilliges, Holzer, Kl├╝ber and Butz (2006) AudioRadar: A metaphorical visualization for the navigation of large music collections.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada.
    It summarized implicit problems in traditional genre/artist based music organization.
  13. Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238.