Difference between revisions of "2007:Audio Music Mood Classification"

From MIREX Wiki
(Mood Categories)
(Exemplar Songs in Each Category)
 
(44 intermediate revisions by 9 users not shown)
Line 1: Line 1:
 +
=FINAL 2007 AMC EVALUATION SCENARIO OVERVIEW=
 +
This section is put here to clarify what will happen for this year's "beta" run of the Audio Mood Classification (AMC) task.
 +
 +
# We will operate the AMC task as a classic train-test classification task.
 +
# We will n-fold the runs with n to be determined by the size of the final data set, number of participants, etc.
 +
# We will hand-craft the n-fold test-train split lists.
 +
# We will NOT be doing post-run human mood judgments this year using the Evalutron 6000.
 +
# Audio files: 30 sec., 22kHz, mono, 16 bit
 +
 +
Do take a look at the [[2007:Audio Genre Classification]] task wiki as we are basing the underlying structure of this task on Audio Genre. In fact, an Audio Genre submission should work out of the box with Audio Mood Classification. Note: we really want folks to do a FEATURE EXTRACTION phase first against all the files and then have these features cached some place for re-use during the TRAIN-TEST phase. This way we can really speed up the n-fold processing. Thus, like GENRE, we need to pass three input files to your algos:
 +
 +
==== 1. Feature extraction list file ====
 +
The list file passed for feature extraction will a simple ASCII list
 +
file. This file will contain one path per line with no header line.
 +
 +
==== 2. Training list file ====
 +
The list file passed for model training will be a simple ASCII list
 +
file. This file will contain one path per line, followed by a tab character and
 +
the genre label, again with no header line.
 +
 +
E.g. <example path and filename>\t<mood classification>
 +
 +
==== 3. Test (classification) list file ====
 +
The list file passed for testing classification will be a simple ASCII list
 +
file identical in format to the Feature extraction list file. This file will
 +
contain one path per line with no header line.
 +
 +
==== Classification output files ====
 +
Participating algorithms should produce a simple ASCII list file identical in
 +
format to the Training list file. This file will contain one path per line,
 +
followed by a tab character and the MOOD label, again with no header line.
 +
E.g.:
 +
<example path and filename>\t<mood classification>
 +
 +
The path to which this list file should be written must be accepted as a
 +
parameter on the command line.
 +
 +
********************************************
 +
 +
== Audio collection poll ==
 +
 +
<poll>
 +
Would you like to use 30 secs clips from tracks for analysis to avoid mood change within tracks and reduce processing load ?
 +
Yes
 +
No, I like 60 secs clips
 +
No, I like the whole track
 +
</poll>
 +
 +
<poll>
 +
How important do you think cross-validation is?
 +
Very important
 +
Important
 +
Not important
 +
</poll>
 +
 +
<poll>
 +
Would you like your algorithm(s) to be evaluated on a closed groundtruth set (as in traditional classification problems, both training and testing data are labeled well before the contest) or on an unlabeled audio pool (in the way described in this wiki page, please see section 7,8,9) ?
 +
On a closed groundtruth set (the size of the set is smaller, but evaluation metrics are more rigorous and support cross-validation)
 +
On an unlabeled audio pool (the size of the pool can be very big, but only a small portion will be judged by human.)
 +
Both
 +
</poll>
 +
 +
<poll>
 +
If you like a closed groundtruth set, what is the MINIMUM size of the set you can accept (including training and testing)?
 +
400 clips in total (~80 clips in each category)
 +
600 clips in total (~120 clips in each category)
 +
800 clips in total (~160 clips in each category)
 +
1000 clips in total (~200 clips in each category)
 +
more than 1000 clips
 +
</poll>
 +
 +
<poll>
 +
If you like an unlabeled audio pool, what is the MINIMUM size of training audio you can accept?
 +
30 clips in each category
 +
50 clips in each category
 +
80 clips in each category
 +
100 clips in each category
 +
more than 100 clips in each category
 +
</poll>
 +
 +
 +
<poll>
 +
What is your preferred audio format? (the less audio data to process the larger the dataset can be)
 +
22 khz mono WAV
 +
22 khz stereo WAV
 +
44 khz mono WAV
 +
44 khz stereo WAV
 +
22 khz mono MP3 128kb
 +
22 khz stereo MP3 128kb
 +
44 khz mono MP3 128kb
 +
44 khz stereo MP3 128kb
 +
</poll>
 +
 +
<poll>
 +
How many algorithms will you likely to submit? (for estimating the number of human assessors needed)
 +
0
 +
1
 +
2
 +
3
 +
</poll>
 +
 
== Introduction ==
 
== Introduction ==
 
In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006[[#Related Papers]]). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 [[#Related Papers]]) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003[[#Related Papers]]). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.  
 
In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006[[#Related Papers]]). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 [[#Related Papers]]) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003[[#Related Papers]]). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.  
Line 20: Line 121:
 
We are still seeking additional songs across different genres to enrich this set, and during the process, the cluster with least cross-listener consistency may be dropped, or two clusters often confusing each other may be combined.  
 
We are still seeking additional songs across different genres to enrich this set, and during the process, the cluster with least cross-listener consistency may be dropped, or two clusters often confusing each other may be combined.  
  
[[Previous Discussion on Mood Taxonomy]]
 
  
[[Discussion on Mood Categories]]
+
[[2007:Previous Discussion on Mood Taxonomy]]
 +
 
 +
[[2007:Discussion on Mood Categories]]
  
 
== Exemplar Songs in Each Category ==  
 
== Exemplar Songs in Each Category ==  
 
Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.
 
Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.
  
There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 49 songs are unanimously labeled by the 6 human assessors and AMG, and another 42 songs are unanimously labeled by the 6 human assessors. The song titles are listed in [[exemplar songs]].   
+
There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 50 songs are unanimously labeled by the 6 human assessors, 42 songs are unanimously labeled by 5 of the 6 human assessors, and another 40 songs by 4 of the 6 human assessors. The song titles are listed in [[2007:exemplar songs]].   
  
 
The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of [[#Training Set]]).  
 
The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of [[#Training Set]]).  
Line 33: Line 135:
 
Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.  
 
Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.  
  
 +
[[2007:Previous Discussion on Ground Truth]]
 +
 +
== Two Evaluation Scenarios ==
 +
 +
1. Evaluation on a closed groundtruth set.
 +
As in traditional classification problems, both training and testing data are labeled well before the contest.
 +
Pros: evaluation metrics are more rigorous; support cross-validation
 +
cons: training/testing set is limited
 +
 +
2. Training on a labeled set, but testing on an unlabeled audio pool
 +
As in audio similarity and retrieval contest, each algorithm returns a list of candidates in each mood category, then human assessors make judgments on the returned candidates.
 +
Pros: testing pool can be arbitrarily big; training set is bigger as well (which can be the whole groundtruth set in scenario 1 .)
 +
Cons: innovative but limited evaluation metrics (see below)
 +
 +
For both scenarios, this is a single-label classification contest, and thus each song can only be classified into one mood cluster.
 +
 +
'''We will go for scenario 1'''
 +
 +
== Groundtruth Set ==
 +
 +
The IMIRSEL is preparing a ground-truth set of audio clips selected from the USPOP collection decribed above and the APM collection (www.apmmusic.com). The bibliographic information of the exemplar songs has been released as above, which is to help participants reach agreements on the meanings of the mood categories.
 +
 +
The APM audio set has been pre-labeled with the 5 mood clusters according to their metadata provided by APM, and covers a variety of genres: each category covers about 7 major genres (with 20-30 tracks each) and a few minor genres. To make the problem more interesting, the distribution among major genres within each category is made as even as possible.
 +
 +
To make sure the mood labels are correct, this APM audio collection will subject to human validation before the contest. We prepared a set of 1250 audio clips (250 per category). The audio clips whose mood category assignments reach agreements among 2 out of 3 human assessors will serve as a ground truth set. We are aiming at least 120 audio clips in each mood category. 
 +
 +
After the human validation on this audio set, participating algorithms/ models will be trained and tested within IMIRSEL.
 +
 +
'''Audio format: 30 second clips, 22.05kHz, mono, 16bit, WAV files'''
 +
 +
=== Human Validation ===
 +
Subjective judgments by human assessors will be collected for the above mentioned APM audio set using a web-based system, Evalutron6000, developed by the IMIRSEL. (An introduction of this piece of Evalutron 6000 is shown here [[2007:Evalutron6000_Walkthrough_For_Audio_Mood_Classification]]
 +
 +
Each audio clip is 30 seconds long, and will have 3 human judges listen to it and choose which mood category it belongs to. If 2 of the 3 judges agree on its category, an audio clip will be selected into the groundtruth set.
 +
 +
== Evaluation Metrics ==
 +
 +
Metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). The single most important metrics would be accuracy, which allows direct system comparison:
 +
 +
''Accuracy = # of correctly classified songs / #. of all songs.''
 +
 +
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
 +
 +
Test significance of differences among systems, possibly using
 +
 +
*a) McNemarΓÇÖs test
 +
 +
McNemarΓÇÖs test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers. It was used in Audio Genre Classification and Audio Artist Identification contests in MIREX 2005.
 +
 +
*b) FriedmanΓÇÖs test
 +
 +
FriedmanΓÇÖs test used to detect differences in treatments across multiple test attempts. (http://en.wikipedia.org/wiki/Friedman_test). It was used in Audio Similarity, Audio cover song, and Query by Singing/Humming contests in MIREX 2006.
 +
 +
Besides, run time can be recorded and compared.
 +
 +
== Important Dates ==
 +
 +
* Human Validation for Groundtruth Set: August 1 - August 15
 +
* Algorithm Submission Deadline: August 25
 +
 +
== Packaging your Submission ==
 +
* Be sure that your submission follows the [[#Submission_Format]] outlined below.
 +
* Be sure that your submission accepts the proper [[#Input_File]] format
 +
* Be sure that your submission produces the proper [[#Output_File]] format
 +
* Be sure to follow the [[[2006:Best_Coding_Practices_for_MIREX]]
 +
* Be sure to follow the [[2007:MIREX 2007 Submission Instructions]]
 +
* In the README file that is included with your submission, please answer the following additional questions:
 +
** Approximately how long will the submission take to process ~1000 wav files?
 +
** Approximately how much scratch disk space will the submission need to store any feature/cache files?
 +
** Any special notice regarding to running your algorith
 +
* Submit your system via the URL located at the bottom of [[2007:MIREX 2007 Submission Instructions]] page
 +
 +
Note that the information that you place in the README file is '''extremely''' important in ensuring that your submission is evaluated properly.
 +
 +
== Submission Format ==
 +
A submission to the Audio Music Mood Classification evaluation is expected to follow the [[2006:Best_Coding_Practices_for_MIREX]] and must conform to the following for execution:
 +
 +
=== One Call Format ===
 +
The one call format is appropriate for systems that perform all phases of the classification (typically features extraction, training and testing) in one step.  A submission should be an executable program that takes 4 arguments:
 +
* path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see [[#File Formats]] below)
 +
* path/to/fileContainingListOfTestingAudioClips - the path to the list of testing audio clips (see [[#File Formats]] below)
 +
* path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
 +
* path/to/output/Results - the file where the output classification results should be placed. (see [[#File Formats]] below)
 +
 +
'''Example:'''
 +
 +
<pre>
 +
 +
doAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/fileContainingListOfTestingAudioClips" "path/to/cacheDir" "path/to/output/Results"
 +
 +
</pre>
 +
 +
 +
=== Two Call Format ===
 +
The one call format is appropriate for systems that perform the training and testing separately.  A submission should consists of two executable programs
 +
*trainAMC - this takes 3 arguments:
 +
** path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see [[#File Formats]] below)
 +
** path/to/trainingCacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
 +
** path/to/trainedClassificationModel  - the file where the classification model should be placed
 +
*testAMC - this takes 4 arguments:
 +
** path/to/trainedClassificationModel
 +
** path/to/fileContainingListofTestingAudioClips - the path to the list of testing audio clips (see [[#File Formats]] below)
 +
** path/to/testingCacheDir - a directory where the submission can place temporary or scratch files.
 +
** path/to/output/Results - the file where the output classification results should be placed. (see [[#File Formats]] below)
 +
 +
'''Example:'''
 +
 +
<pre>
 +
 +
trainAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/trainingcacheDir" "path/to/trainedClassificationModel"
 +
testAMC "path/to/trainedClassificationModel" "path/to/fileContainingListofTestingAudioClips" "path/to/testingCacheDir" "path/to/output/Results"
 +
 +
</pre>
 +
 +
=== Matlab format ===
 +
 +
Matlab will also be supported in the form of functions in the following formats:
 +
 +
==== Matlab One call format ====
 +
<pre>
 +
doMyMatlabAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/fileContainingListOfTestingAudioClips','path/to/cacheDir','path/to/output/Results')
 +
</pre>
 +
 +
 +
==== Matlab Two call format ====
 +
<pre>
 +
doMyMatlabTrainAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/trainingcacheDir','path/to/trainedClassificationModel')
 +
doMyMatlabTestAMC('path/to/trainedClassificationModel','path/to/fileContainingListofTestingAudioClips','path/to/testingCacheDir','path/to/output/Results')
 +
</pre>
 +
 +
== File Formats ==
 +
 +
=== Input Files ===
 +
 +
The input training list file format will be of the form:
 +
 +
<pre>
 +
path/to/training/audio/file/000001.wav\tCluster_3
 +
path/to/training/audio/file/000002.wav\tCluster_5
 +
path/to/training/audio/file/000003.wav\tCluster_2
 +
...
 +
path/to/training/audio/file/00000N.wav\tCluster_1
 +
</pre>
 +
 +
"\t" stands for tab.
 +
 +
The input testing list file format will be of the form:
  
[[Previous Discussion on Ground Truth]]
+
<pre>
 +
path/to/testing/audio/file/000010.wav
 +
path/to/testing/audio/file/000020.wav
 +
path/to/testing/audio/file/000030.wav
 +
...
 +
path/to/testing/audio/file/0000N0.wav
 +
</pre>
  
== Training Set ==
+
"\t" stands for tab.
Some potential participants request a training set to be provided, and the exemplar songs described above can serve as (seeds of) training data. However, due to copyright issue, we cannot distribute the audio files of the exemplar songs. There are two ways to address this issue:
 
  
1) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) and the participants will locate the audio files by themselves and/or possibly find other audio clips guided by the exemplar songs (i.e. seeds) for training purposes. Participants train their models in house and submit trained models. 
+
=== Output File ===
 +
The only output will be a file containing classification results in the following format:
  
2) the IMIRSEL announces the bibliographic information of the exemplar songs (e.g. title, artist) for helping participants understand the mood categories. The IMIRSEL prepares a certain number (e.g. 30) of short audio clips (e.g. 30 seconds) for each mood clusters. Participating algorithms/ models are trained and tested within IMIRSEL.  
+
<pre>
 +
Example Classification Results 0.1 (replace this line with your system name)
 +
path/to/testing/audio/file/000010.wav\tCluster_3
 +
path/to/testing/audio/file/000020.wav\tCluster_1
 +
path/to/testing/audio/file/000030.wav\tCluster_5
 +
...
 +
path/to/testing/audio/file/0000N0.wav\tCluster_2
 +
</pre>
  
 +
"\t" indicates tab. All audio clips should have one and only one mood cluster label.
  
== Song Pool ==
+
==Evaluation Scenario 2==
The pool of songs to be classified is from the same collection of the exemplar songs. Currently, the contest organizers are seeking additional songs in various genres other than Pop music to supplement the USPOP collection. Having songs in a variety of genres in each mood cluster will make the contest harder and more interesting. However, due to time and resource constraint, the song pool may still end up being dominated by pop music, which hopefully is still of interests to most participants.
 
  
Proposed audio format:
+
=== Training Set ===
30 second clips, 22.05kHz, mono, 16bit, WAV files
 
  
We will randomly select a certain number of songs from the USPOP and other (to-be-decided) collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.
+
Under evaluation scenario 2, the training set would be the whole ground truth set in scenario 1 (see [[#Groundtruth Set]]).
  
== Classification Results ==
+
=== Unlabeled Song Pool ===
 +
Under evaluation scenario 2, the pool of testing audio to be classified is from the same collection of the training set, i.e. USPOP and APM. We will make sure the audio covers a variety of genres in each mood cluster, which will make the contest harder and more interesting.
 +
 
 +
We will randomly select a certain number (say, 1000) of songs from the collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.
 +
 
 +
=== Classification Results ===
 
Each algorithm will return the top X songs in each cluster.  
 
Each algorithm will return the top X songs in each cluster.  
  
 
This is a single-label classification contest, and thus each song can only be classified into one mood cluster.  
 
This is a single-label classification contest, and thus each song can only be classified into one mood cluster.  
  
Note: unlike traditional classification problems where all testing samples have ground truth available, this contest does not have a well labeled testing set, and we are unlikely to be able to make such a set this year. (But this yearΓÇÖs contest, will make a human-assessed ground truth set for future use.) Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.  
+
Note: unlike traditional classification problems where all testing samples have ground truth available, this scenario does not have a well labeled testing set. Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.  
  
 
The actual value of X depends on human assessment protocol and number of available human assessors (see next section [[#Human Assessment]]).
 
The actual value of X depends on human assessment protocol and number of available human assessors (see next section [[#Human Assessment]]).
  
== Human Assessment==
+
=== Human Assessment===
Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron 7000, to be developed by the IMIRSEL. (An introduction of a similar system Evalutron 6000 is shown here [[https://www.music-ir.org/mirex2006/index.php/Evalutron6000_Walkthrough]])
+
Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron6000, developed by the IMIRSEL. (An introduction of this piece of Evalutron 6000 is shown here [[2007:Evalutron6000_Walkthrough_For_Audio_Mood_Classification]]
  
=== How many judgments and assessors ===
+
==== How many judgments and assessors ====
 
Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.  
 
Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.  
  
Line 85: Line 351:
 
Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.
 
Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.
  
== Evaluation Metrics ==  
+
=== Evaluation Metrics ===
 
Algorithm score as mentioned in last section is a metrics that facilitates direct comparison.  
 
Algorithm score as mentioned in last section is a metrics that facilitates direct comparison.  
  
Line 93: Line 359:
 
''Accuracy = # of correctly classified songs / #. of all songs.''  
 
''Accuracy = # of correctly classified songs / #. of all songs.''  
  
According to the above human assessment method, ΓÇ£correctly classified songsΓÇ¥ in this contest can be defined as songs classified as the majority vote of the judges and, in the case of ties, songs classified as any of the tie votes. For example, suppose each song has 3 judges. If a song is labeled as Cluster_1 by at least 2 judges, then this song will be counted as correct for algorithms classifying it to Cluster_1; if a song is labeled as Cluster_1, Cluster_2 and Cluster_3 once by each of the judges, then this song will be counted as correct for algorithms classifying it to Cluster_1, Cluster_2 or Cluster_3.   
+
According to the above human assessment method, ΓÇ£correctly classified songsΓÇ¥ in this scenario can be defined as songs classified as the majority vote of the judges and, in the case of ties, songs classified as any of the tie votes. For example, suppose each song has 3 judges. If a song is labeled as Cluster_1 by at least 2 judges, then this song will be counted as correct for algorithms classifying it to Cluster_1; if a song is labeled as Cluster_1, Cluster_2 and Cluster_3 once by each of the judges, then this song will be counted as correct for algorithms classifying it to Cluster_1, Cluster_2 or Cluster_3.   
  
 
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
 
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
 
  
 
Test significance of differences among systems, possibly using
 
Test significance of differences among systems, possibly using
  
 
*a) McNemarΓÇÖs test  
 
*a) McNemarΓÇÖs test  
 
McNemarΓÇÖs test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers. It was used in Audio Genre Classification and Audio Artist Identification contests in MIREX 2005.
 
 
 
*b) FriedmanΓÇÖs test
 
*b) FriedmanΓÇÖs test
 
FriedmanΓÇÖs test used to detect differences in treatments across multiple test attempts. (http://en.wikipedia.org/wiki/Friedman_test). It was used in Audio Similarity, Audio cover song, and Query by Singing/Humming contests in MIREX 2006.
 
  
 
Besides, run time can be recorded and compared.
 
Besides, run time can be recorded and compared.
 
== Submission Format ==
 
 
TBD
 
  
 
== Challenging Issues ==  
 
== Challenging Issues ==  
 
# Mood changeable pieces: some pieces may start from one mood but end up with another one.  
 
# Mood changeable pieces: some pieces may start from one mood but end up with another one.  
  
We will use 30 second clips instead of whole songs. For training set, the clips will be manually checked to be representative to the songs; for testing set, the clips will be extracted automatically from the middle of the songs which have more chances to be representative.
+
We will use 30 second clips instead of whole songs. The clips will be extracted automatically from the middle of the songs which have more chances to be representative.
 
 
# Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.
 
  
So, for this year, this is a single label classification problem.
+
# Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.So, for this year, this is a single label classification problem.
  
 
== Participants ==
 
== Participants ==
Line 136: Line 390:
 
* Vitor Soares (<i>firstname.lastname</i>@clustermedialabs.com)
 
* Vitor Soares (<i>firstname.lastname</i>@clustermedialabs.com)
 
* Wai Cheung (wlche1@infotech.monash.edu.au)
 
* Wai Cheung (wlche1@infotech.monash.edu.au)
 +
* Matt Hoffman (mdhoffma <i>a t</i> cs <i>d o t</i> princeton <i>d o t</i> edu)
 +
* Yi-Hsuan Yang (affige at gmail dot com)
 +
* Jose Fornari ( fornari at campus dot jyu dot fi )
  
 
== Moderators ==
 
== Moderators ==
Line 156: Line 413:
 
# [http://pubdb.medien.ifi.lmu.de/cgi-bin//info.pl?hilliges2006audio Hilliges, Holzer, Kl├╝ber and Butz (2006)] '''AudioRadar: A metaphorical visualization for the navigation of large music collections'''.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada. <br> It summarized implicit problems in traditional genre/artist based music organization.
 
# [http://pubdb.medien.ifi.lmu.de/cgi-bin//info.pl?hilliges2006audio Hilliges, Holzer, Kl├╝ber and Butz (2006)] '''AudioRadar: A metaphorical visualization for the navigation of large music collections'''.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada. <br> It summarized implicit problems in traditional genre/artist based music organization.
 
# Juslin, P. N., & Laukka, P. (2004). '''Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening'''. Journal of New Music Research, 33(3), 217-238.
 
# Juslin, P. N., & Laukka, P. (2004). '''Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening'''. Journal of New Music Research, 33(3), 217-238.
 +
# [http://mpac.ee.ntu.edu.tw/~yihsuan/ Yang, Liu, and Chen (ACMMM 2006)] '''Music emotion classification: A fuzzy approach '''

Latest revision as of 23:05, 19 December 2011

FINAL 2007 AMC EVALUATION SCENARIO OVERVIEW

This section is put here to clarify what will happen for this year's "beta" run of the Audio Mood Classification (AMC) task.

  1. We will operate the AMC task as a classic train-test classification task.
  2. We will n-fold the runs with n to be determined by the size of the final data set, number of participants, etc.
  3. We will hand-craft the n-fold test-train split lists.
  4. We will NOT be doing post-run human mood judgments this year using the Evalutron 6000.
  5. Audio files: 30 sec., 22kHz, mono, 16 bit

Do take a look at the 2007:Audio Genre Classification task wiki as we are basing the underlying structure of this task on Audio Genre. In fact, an Audio Genre submission should work out of the box with Audio Mood Classification. Note: we really want folks to do a FEATURE EXTRACTION phase first against all the files and then have these features cached some place for re-use during the TRAIN-TEST phase. This way we can really speed up the n-fold processing. Thus, like GENRE, we need to pass three input files to your algos:

1. Feature extraction list file

The list file passed for feature extraction will a simple ASCII list file. This file will contain one path per line with no header line.

2. Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the genre label, again with no header line.

E.g. <example path and filename>\t<mood classification>

3. Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

Classification output files

Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the MOOD label, again with no header line. E.g.:

<example path and filename>\t<mood classification>

The path to which this list file should be written must be accepted as a parameter on the command line.

Audio collection poll

<poll> Would you like to use 30 secs clips from tracks for analysis to avoid mood change within tracks and reduce processing load ? Yes No, I like 60 secs clips No, I like the whole track </poll>

<poll> How important do you think cross-validation is? Very important Important Not important </poll>

<poll> Would you like your algorithm(s) to be evaluated on a closed groundtruth set (as in traditional classification problems, both training and testing data are labeled well before the contest) or on an unlabeled audio pool (in the way described in this wiki page, please see section 7,8,9) ? On a closed groundtruth set (the size of the set is smaller, but evaluation metrics are more rigorous and support cross-validation) On an unlabeled audio pool (the size of the pool can be very big, but only a small portion will be judged by human.) Both </poll>

<poll> If you like a closed groundtruth set, what is the MINIMUM size of the set you can accept (including training and testing)? 400 clips in total (~80 clips in each category) 600 clips in total (~120 clips in each category) 800 clips in total (~160 clips in each category) 1000 clips in total (~200 clips in each category) more than 1000 clips </poll>

<poll> If you like an unlabeled audio pool, what is the MINIMUM size of training audio you can accept? 30 clips in each category 50 clips in each category 80 clips in each category 100 clips in each category more than 100 clips in each category </poll>


<poll> What is your preferred audio format? (the less audio data to process the larger the dataset can be) 22 khz mono WAV 22 khz stereo WAV 44 khz mono WAV 44 khz stereo WAV 22 khz mono MP3 128kb 22 khz stereo MP3 128kb 44 khz mono MP3 128kb 44 khz stereo MP3 128kb </poll>

<poll> How many algorithms will you likely to submit? (for estimating the number of human assessors needed) 0 1 2 3 </poll>

Introduction

In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006#Related Papers). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 #Related Papers) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003#Related Papers). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.

However, evaluation of music mood classification is difficult as music mood is a very subjective notion. Each aforementioned experiement used different mood categories and different datasets, making comparison on previous work a virtually impossible mission. A contest on music mood classification in MIREX will help build the first ever community available test set and precious ground truth.

This is the first time in MIREX to attempt a music mood classification evaluation. There are many issues involved in this evaluation task, and let us start discuss them on this wiki. If needed, we will set up a mailing list devoting to the discussion.

Mood Categories

The IMIRSEL has derived a set of 5 mood clusters from the AMG mood repository (Hu & Downie 2007#Related Papers). The mood clusters effectively reduce the diverse mood space into a tangible set of categories, and yet root in the social-cultural context of pop music. Therefore, we propose to use the 5 mood clusters as the categories in this yearΓÇÖs audio mood classification contest. Each of the clusters is a collection of the AMG mood labels which collectively define the cluster:

  • Cluster_1: passionate, rousing, confident,boisterous, rowdy
  • Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
  • Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
  • Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
  • Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral

At this moment, the IMIRSEL and Cyril Laurier at the Music Technology Group of Barcelona have manually validated the mood clusters and exemplar songs in each cluster. Please see #Exemplar Songs in Each Category for details.

We are still seeking additional songs across different genres to enrich this set, and during the process, the cluster with least cross-listener consistency may be dropped, or two clusters often confusing each other may be combined.


2007:Previous Discussion on Mood Taxonomy

2007:Discussion on Mood Categories

Exemplar Songs in Each Category

Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.

There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 50 songs are unanimously labeled by the 6 human assessors, 42 songs are unanimously labeled by 5 of the 6 human assessors, and another 40 songs by 4 of the 6 human assessors. The song titles are listed in 2007:exemplar songs.

The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of #Training Set).

Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.

2007:Previous Discussion on Ground Truth

Two Evaluation Scenarios

1. Evaluation on a closed groundtruth set. As in traditional classification problems, both training and testing data are labeled well before the contest. Pros: evaluation metrics are more rigorous; support cross-validation cons: training/testing set is limited

2. Training on a labeled set, but testing on an unlabeled audio pool As in audio similarity and retrieval contest, each algorithm returns a list of candidates in each mood category, then human assessors make judgments on the returned candidates. Pros: testing pool can be arbitrarily big; training set is bigger as well (which can be the whole groundtruth set in scenario 1 .) Cons: innovative but limited evaluation metrics (see below)

For both scenarios, this is a single-label classification contest, and thus each song can only be classified into one mood cluster.

We will go for scenario 1

Groundtruth Set

The IMIRSEL is preparing a ground-truth set of audio clips selected from the USPOP collection decribed above and the APM collection (www.apmmusic.com). The bibliographic information of the exemplar songs has been released as above, which is to help participants reach agreements on the meanings of the mood categories.

The APM audio set has been pre-labeled with the 5 mood clusters according to their metadata provided by APM, and covers a variety of genres: each category covers about 7 major genres (with 20-30 tracks each) and a few minor genres. To make the problem more interesting, the distribution among major genres within each category is made as even as possible.

To make sure the mood labels are correct, this APM audio collection will subject to human validation before the contest. We prepared a set of 1250 audio clips (250 per category). The audio clips whose mood category assignments reach agreements among 2 out of 3 human assessors will serve as a ground truth set. We are aiming at least 120 audio clips in each mood category.

After the human validation on this audio set, participating algorithms/ models will be trained and tested within IMIRSEL.

Audio format: 30 second clips, 22.05kHz, mono, 16bit, WAV files

Human Validation

Subjective judgments by human assessors will be collected for the above mentioned APM audio set using a web-based system, Evalutron6000, developed by the IMIRSEL. (An introduction of this piece of Evalutron 6000 is shown here 2007:Evalutron6000_Walkthrough_For_Audio_Mood_Classification

Each audio clip is 30 seconds long, and will have 3 human judges listen to it and choose which mood category it belongs to. If 2 of the 3 judges agree on its category, an audio clip will be selected into the groundtruth set.

Evaluation Metrics

Metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). The single most important metrics would be accuracy, which allows direct system comparison:

Accuracy = # of correctly classified songs / #. of all songs.

Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).

Test significance of differences among systems, possibly using

  • a) McNemarΓÇÖs test

McNemarΓÇÖs test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers. It was used in Audio Genre Classification and Audio Artist Identification contests in MIREX 2005.

  • b) FriedmanΓÇÖs test

FriedmanΓÇÖs test used to detect differences in treatments across multiple test attempts. (http://en.wikipedia.org/wiki/Friedman_test). It was used in Audio Similarity, Audio cover song, and Query by Singing/Humming contests in MIREX 2006.

Besides, run time can be recorded and compared.

Important Dates

  • Human Validation for Groundtruth Set: August 1 - August 15
  • Algorithm Submission Deadline: August 25

Packaging your Submission

  • Be sure that your submission follows the #Submission_Format outlined below.
  • Be sure that your submission accepts the proper #Input_File format
  • Be sure that your submission produces the proper #Output_File format
  • Be sure to follow the [[[2006:Best_Coding_Practices_for_MIREX]]
  • Be sure to follow the 2007:MIREX 2007 Submission Instructions
  • In the README file that is included with your submission, please answer the following additional questions:
    • Approximately how long will the submission take to process ~1000 wav files?
    • Approximately how much scratch disk space will the submission need to store any feature/cache files?
    • Any special notice regarding to running your algorith
  • Submit your system via the URL located at the bottom of 2007:MIREX 2007 Submission Instructions page

Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.

Submission Format

A submission to the Audio Music Mood Classification evaluation is expected to follow the 2006:Best_Coding_Practices_for_MIREX and must conform to the following for execution:

One Call Format

The one call format is appropriate for systems that perform all phases of the classification (typically features extraction, training and testing) in one step. A submission should be an executable program that takes 4 arguments:

  • path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see #File Formats below)
  • path/to/fileContainingListOfTestingAudioClips - the path to the list of testing audio clips (see #File Formats below)
  • path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
  • path/to/output/Results - the file where the output classification results should be placed. (see #File Formats below)

Example:


doAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/fileContainingListOfTestingAudioClips" "path/to/cacheDir" "path/to/output/Results" 


Two Call Format

The one call format is appropriate for systems that perform the training and testing separately. A submission should consists of two executable programs

  • trainAMC - this takes 3 arguments:
    • path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see #File Formats below)
    • path/to/trainingCacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
    • path/to/trainedClassificationModel - the file where the classification model should be placed
  • testAMC - this takes 4 arguments:
    • path/to/trainedClassificationModel
    • path/to/fileContainingListofTestingAudioClips - the path to the list of testing audio clips (see #File Formats below)
    • path/to/testingCacheDir - a directory where the submission can place temporary or scratch files.
    • path/to/output/Results - the file where the output classification results should be placed. (see #File Formats below)

Example:


trainAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/trainingcacheDir" "path/to/trainedClassificationModel" 
testAMC "path/to/trainedClassificationModel" "path/to/fileContainingListofTestingAudioClips" "path/to/testingCacheDir" "path/to/output/Results"

Matlab format

Matlab will also be supported in the form of functions in the following formats:

Matlab One call format

doMyMatlabAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/fileContainingListOfTestingAudioClips','path/to/cacheDir','path/to/output/Results')


Matlab Two call format

doMyMatlabTrainAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/trainingcacheDir','path/to/trainedClassificationModel')
doMyMatlabTestAMC('path/to/trainedClassificationModel','path/to/fileContainingListofTestingAudioClips','path/to/testingCacheDir','path/to/output/Results')

File Formats

Input Files

The input training list file format will be of the form:

path/to/training/audio/file/000001.wav\tCluster_3
path/to/training/audio/file/000002.wav\tCluster_5
path/to/training/audio/file/000003.wav\tCluster_2
...
path/to/training/audio/file/00000N.wav\tCluster_1

"\t" stands for tab.

The input testing list file format will be of the form:

path/to/testing/audio/file/000010.wav
path/to/testing/audio/file/000020.wav
path/to/testing/audio/file/000030.wav
...
path/to/testing/audio/file/0000N0.wav

"\t" stands for tab.

Output File

The only output will be a file containing classification results in the following format:

Example Classification Results 0.1 (replace this line with your system name)
path/to/testing/audio/file/000010.wav\tCluster_3
path/to/testing/audio/file/000020.wav\tCluster_1
path/to/testing/audio/file/000030.wav\tCluster_5
...
path/to/testing/audio/file/0000N0.wav\tCluster_2

"\t" indicates tab. All audio clips should have one and only one mood cluster label.

Evaluation Scenario 2

Training Set

Under evaluation scenario 2, the training set would be the whole ground truth set in scenario 1 (see #Groundtruth Set).

Unlabeled Song Pool

Under evaluation scenario 2, the pool of testing audio to be classified is from the same collection of the training set, i.e. USPOP and APM. We will make sure the audio covers a variety of genres in each mood cluster, which will make the contest harder and more interesting.

We will randomly select a certain number (say, 1000) of songs from the collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.

Classification Results

Each algorithm will return the top X songs in each cluster.

This is a single-label classification contest, and thus each song can only be classified into one mood cluster.

Note: unlike traditional classification problems where all testing samples have ground truth available, this scenario does not have a well labeled testing set. Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.

The actual value of X depends on human assessment protocol and number of available human assessors (see next section #Human Assessment).

Human Assessment

Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron6000, developed by the IMIRSEL. (An introduction of this piece of Evalutron 6000 is shown here 2007:Evalutron6000_Walkthrough_For_Audio_Mood_Classification

How many judgments and assessors

Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.

Human evaluators will be drawn from the participating labs and volunteers from IMIRSEL or on the MIREX lists. Suppose we can get W evaluators, each evaluator will evaluate S = (5*X*Y*Z) / W songs.

At this moment, there are 10 potential participants on the Wiki, so letΓÇÖs say Y = 6. Suppose each candidate song will be evaluated by 3 judges, Z = 3, and suppose we can get 20 assessors: W = 20:

  • If X = 20, number of judgments for each assessor: S = 90
  • If X = 10, S = 45
  • If X = 30, S = 135
  • If X = 50, S = 225
  • If X = 15, S = 67.5
  • ΓǪ

In audio similarity contest last year, each assessor made 205 judgments as average. As the judgment for mood is trickier, we may need to give our assessors less burden.

To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.

Scoring

Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.

Evaluation Metrics

Algorithm score as mentioned in last section is a metrics that facilitates direct comparison.

Besides, metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). As mentioned above, the pooling approach results in a relative recall measure, therefore, the single most important metrics would be accuracy:

The original definition of accuracy is: Accuracy = # of correctly classified songs / #. of all songs.

According to the above human assessment method, ΓÇ£correctly classified songsΓÇ¥ in this scenario can be defined as songs classified as the majority vote of the judges and, in the case of ties, songs classified as any of the tie votes. For example, suppose each song has 3 judges. If a song is labeled as Cluster_1 by at least 2 judges, then this song will be counted as correct for algorithms classifying it to Cluster_1; if a song is labeled as Cluster_1, Cluster_2 and Cluster_3 once by each of the judges, then this song will be counted as correct for algorithms classifying it to Cluster_1, Cluster_2 or Cluster_3.

Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).

Test significance of differences among systems, possibly using

  • a) McNemarΓÇÖs test
  • b) FriedmanΓÇÖs test

Besides, run time can be recorded and compared.

Challenging Issues

  1. Mood changeable pieces: some pieces may start from one mood but end up with another one.

We will use 30 second clips instead of whole songs. The clips will be extracted automatically from the middle of the songs which have more chances to be representative.

  1. Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.So, for this year, this is a single label classification problem.

Participants

If you think there is a slight chance that you might consider participating, please add your name and email address here.

  • Kris West (kw at cmp dot uea dot ac dot uk)
  • Cyril Laurier (claurier at iua dot upf dot edu)
  • Elias Pampalk (firstname.lastname@gmail.com)
  • Yuriy Molchanyuk (molchanyuk at onu.edu.ua)
  • Shigeki Sagayama (sagayama at hil dot t.u-tokyo.ac.jp)
  • Guillaume Nargeot (killy971 at gmail dot com)
  • Zhongzhe Xiao (zhongzhe dot xiao at ec-lyon dot fr)
  • Kyogu Lee (kglee at ccrma.stanford.edu)
  • Vitor Soares (firstname.lastname@clustermedialabs.com)
  • Wai Cheung (wlche1@infotech.monash.edu.au)
  • Matt Hoffman (mdhoffma a t cs d o t princeton d o t edu)
  • Yi-Hsuan Yang (affige at gmail dot com)
  • Jose Fornari ( fornari at campus dot jyu dot fi )

Moderators

  • J. Stephen Downie (IMIRSEL, University of Illinois, USA) - [1]
  • Xiao Hu (IMIRSEL, University of Illinois, USA) -[2]
  • Cyril Laurier (Music Technology Group, Barcelona, Spain) -[3]

Related Papers

  1. Dietterich, T. (1997). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1924.
  2. Hu, Xiao and J. Stephen Downie (2007). Exploring mood metadata: Relationships with genre, artist and usage metadata. Accepted in the Eighth International Conference on Music Information Retrieval (ISMIR 2007),Vienna, September 23-27, 2007.
  3. Juslin, P.N., Karlsson, J., Lindstr├╢m E., Friberg, A. and Schoonderwaldt, E(2006), Play It Again With Feeling: Computer Feedback in Musical Communication of Emotions. In Journal of Experimental Psychology: Applied 2006, Vol.12, No.2, 79-95.
  4. Vignoli (ISMIR 2004) Digital Music Interaction Concepts: A User Study
  5. Cunningham, Jones and Jones (ISMIR 2004) Organizing Digital Music For Use: An Examiniation of Personal Music Collections.
  6. Cunningham, Bainbridge and Falconer (ISMIR 2006) More of an Art than a Science': Supporting the Creation of Playlists and Mixes.
  7. Lu, Liu and Zhang (2006), Automatic Mood Detection and Tracking of Music Audio Signals. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 1, JANUARY 2006
    Part of this paper appeared in ISMIR 2003 http://ismir2003.ismir.net/papers/Liu.PDF
  8. Pohle, Pampalk, and Widmer (CBMI 2005) Evaluation of Frequently Used Audio Features for Classification of Music into Perceptual Categories.
    It separates "mood" and "emotion" as two classifcation dimensions, which are mostly combined in other studies.
  9. Mandel, Poliner and Ellis (2006) Support vector machine active learning for music retrieval. Multimedia Systems, Vol.12(1). Aug.2006.
  10. Feng, Zhuang and Pan (SIGIR 2003) Popular music retrieval by detecting mood
  11. Li and Ogihara (ISMIR 2003) Detecting emotion in music
  12. Hilliges, Holzer, Kl├╝ber and Butz (2006) AudioRadar: A metaphorical visualization for the navigation of large music collections.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada.
    It summarized implicit problems in traditional genre/artist based music organization.
  13. Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238.
  14. Yang, Liu, and Chen (ACMMM 2006) Music emotion classification: A fuzzy approach