2007:Audio Music Mood Classification
Contents
Introduction
In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006#Related Papers). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 #Related Papers) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003#Related Papers). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.
However, evaluation of music mood classification is difficult as music mood is a very subjective notion. Each aforementioned experiement used different mood categories and different datasets, making comparison on previous work a virtually impossible mission. A contest on music mood classification in MIREX will help build the first ever community available test set and precious ground truth.
This is the first time in MIREX to attempt a music mood classification evaluation. There are many issues involved in this evaluation task, and let us start discuss them on this wiki. If needed, we will set up a mailing list devoting to the discussion.
Mood Categories
The IMIRSEL has derived a set of 5 mood clusters from the AMG mood repository (Hu & Downie 2007#Related Papers). The mood clusters effectively reduce the diverse mood space into a tangible set of categories, and yet root in the social-cultural context of pop music. Therefore, we propose to use the 5 mood clusters as the categories in this yearΓÇÖs audio mood classification contest. Each of the clusters is a collection of the AMG mood labels which collectively define the cluster:
- Cluster_1: passionate, rousing, confident,boisterous, rowdy
- Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
- Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
- Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
- Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral
At this moment, the IMIRSEL and Cyril Laurier at the Music Technology Group of Barcelona have manually validated the mood clusters and exemplar songs in each cluster. Please see #Exemplar Songs in Each Category for details.
Previous Discussion on Mood Taxonomy
Exemplar Songs in Each Category
Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.
There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 49 songs are unanimously labeled by the 6 human assessors and AMG, and another 42 songs are unanimously labeled by the 6 human assessors. The song titles are listed in exemplar songs.
The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of #Training Set).
Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.
Previous Discussion on Ground Truth
Training Set
Some potential participants request a training set to be provided, and the exemplar songs described above can serve as (seeds of) training data. However, due to copyright issue, we cannot distribute the audio files of the exemplar songs. There are two ways to address this issue:
1) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) and the participants will locate the audio files by themselves and/or possibly find other audio clips guided by the exemplar songs (i.e. seeds) for training purposes. Participants train their models in house and submit trained models.
2) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) for helping participants understand the mood categories. The IMIRSEL prepares a certain number (e.g. 30) of short audio clips (e.g. 30 seconds) for each mood clusters. Participating algorithms/ models are trained and tested within IMIRSEL.
Song Pool
The pool of songs to be classified is from the same collection of the exemplar songs. Currently, the contest organizers are seeking additional songs in various genres other than Pop music to supplement the USPOP collection. Having songs in a variety of genres in each mood cluster will make the contest harder and more interesting. However, due to time and resource constraint, the song pool may still end up being dominated by pop music, which hopefully is still of interests to most participants.
Proposed audio format: 30 second clips, 22.05kHz, mono, 16bit, WAV files
We will randomly select a certain number of songs from the USPOP and other (to-be-decided) collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.
Classification Results
Each algorithm will return the top X songs in each cluster.
This is a single-label classification contest, and thus each song can only be classified into one mood cluster.
Note: unlike traditional classification problems where all testing samples have ground truth available, this contest does not have a well labeled testing set, and we are unlikely to be able to make such a set this year. (But this yearΓÇÖs contest, will make a human-assessed ground truth set for future use.) Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.
The actual value of X depends on human assessment protocol and number of available human assessors (see next section #Human Assessment).
Human Assessment
Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron 7000, to be developed by the IMIRSEL. (An introduction of a similar system Evalutron 6000 is shown here [[1]])
How many judgments and assessors
Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.
Human evaluators will be drawn from the participating labs and volunteers from IMIRSEL or on the MIREX lists. Suppose we can get W evaluators, each evaluator will evaluate S = (5*X*Y*Z) / W songs.
At this moment, there are 10 potential participants on the Wiki, so letΓÇÖs say Y = 6. Suppose each candidate song will be evaluated by 3 judges, Z = 3, and suppose we can get 20 assessors: W = 20:
- If X = 20, number of judgments for each assessor: S = 90
- If X = 10, S = 45
- If X = 30, S = 135
- If X = 50, S = 225
- If X = 15, S = 67.5
- …
In audio similarity contest last year, each assessor made 205 judgments as average. As the judgment for mood is trickier, we may need to give our assessors less burden.
To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.
Scoring
Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.
Evaluation Metrics
Algorithm score as mentioned in last section is a metrics that facilitates direct comparison.
Besides, metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). As mentioned above, the pooling approach results in a relative recall measure, therefore, the single most important metrics would be accuracy:
The original definition of accuracy is: Accuracy = # of correctly classified songs / #. of all songs.
According to the above human assessment method, ΓÇ£correctly classified songsΓÇ¥ in this contest can be defined as songs classified as the majority vote of the judges and, in the case of ties, songs classified as any of the tie votes. For example, suppose each song has 3 judges. If a song is labeled as Cluster_1 by at least 2 judges, then this song will be counted as correct for algorithms classifying it to Cluster_1; if a song is labeled as Cluster_1, Cluster_2 and Cluster_3 once by each of the judges, then this song will be counted as correct for algorithms classifying it to Cluster_1, Cluster_2 or Cluster_3.
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
Test significance of differences among systems, possibly using
- a) McNemarΓÇÖs test
McNemarΓÇÖs test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers. It was used in Audio Genre Classification and Audio Artist Identification contests in MIREX 2005.
- b) FriedmanΓÇÖs test
FriedmanΓÇÖs test used to detect differences in treatments across multiple test attempts. (http://en.wikipedia.org/wiki/Friedman_test). It was used in Audio Similarity, Audio cover song, and Query by Singing/Humming contests in MIREX 2006.
Besides, run time can be recorded and compared.
Submission Format
TBD
Challenging Issues
- Mood changeable pieces: some pieces may start from one mood but end up with another one.
We will use 30 second clips instead of whole songs. For training set, the clips will be manually checked to be representative to the songs; for testing set, the clips will be extracted automatically from the middle of the songs which have more chances to be representative.
- Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.
So, for this year, this is a single label classification problem.
Participants
If you think there is a slight chance that you might consider participating, please add your name and email address here.
- Kris West (kw at cmp dot uea dot ac dot uk)
- Cyril Laurier (claurier at iua dot upf dot edu)
- Elias Pampalk (firstname.lastname@gmail.com)
- Yuriy Molchanyuk (molchanyuk at onu.edu.ua)
- Shigeki Sagayama (sagayama at hil dot t.u-tokyo.ac.jp)
- Guillaume Nargeot (killy971 at gmail dot com)
- Zhongzhe Xiao (zhongzhe dot xiao at ec-lyon dot fr)
- Kyogu Lee (kglee at ccrma.stanford.edu)
- Vitor Soares (firstname.lastname@clustermedialabs.com)
- Wai Cheung (wlche1@infotech.monash.edu.au)
Moderators
- J. Stephen Downie (IMIRSEL, University of Illinois, USA) - [2]
- Xiao Hu (IMIRSEL, University of Illinois, USA) -[3]
- Cyril Laurier (Music Technology Group, Barcelona, Spain) -[4]
Related Papers
- Dietterich, T. (1997). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1924.
- Hu, Xiao and J. Stephen Downie (2007). Exploring mood metadata: Relationships with genre, artist and usage metadata. Accepted in the Eighth International Conference on Music Information Retrieval (ISMIR 2007),Vienna, September 23-27, 2007.
- Juslin, P.N., Karlsson, J., Lindstr├╢m E., Friberg, A. and Schoonderwaldt, E(2006), Play It Again With Feeling: Computer Feedback in Musical Communication of Emotions. In Journal of Experimental Psychology: Applied 2006, Vol.12, No.2, 79-95.
- Vignoli (ISMIR 2004) Digital Music Interaction Concepts: A User Study
- Cunningham, Jones and Jones (ISMIR 2004) Organizing Digital Music For Use: An Examiniation of Personal Music Collections.
- Cunningham, Bainbridge and Falconer (ISMIR 2006) More of an Art than a Science': Supporting the Creation of Playlists and Mixes.
- Lu, Liu and Zhang (2006), Automatic Mood Detection and Tracking of Music Audio Signals. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 1, JANUARY 2006
Part of this paper appeared in ISMIR 2003 http://ismir2003.ismir.net/papers/Liu.PDF - Pohle, Pampalk, and Widmer (CBMI 2005) Evaluation of Frequently Used Audio Features for Classification of Music into Perceptual Categories.
It separates "mood" and "emotion" as two classifcation dimensions, which are mostly combined in other studies. - Mandel, Poliner and Ellis (2006) Support vector machine active learning for music retrieval. Multimedia Systems, Vol.12(1). Aug.2006.
- Feng, Zhuang and Pan (SIGIR 2003) Popular music retrieval by detecting mood
- Li and Ogihara (ISMIR 2003) Detecting emotion in music
- Hilliges, Holzer, Kl├╝ber and Butz (2006) AudioRadar: A metaphorical visualization for the navigation of large music collections.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada.
It summarized implicit problems in traditional genre/artist based music organization. - Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238.