2016:Audio Tag Classification
Contents
Description
This task will compare various algorithms' abilities to associate descriptive tags with 10-second audio clips of songs. Two datasets are used to implement a pair of sub tasks, based on the MajorMiner and Mood tag datasets. This task is very much related to the other audio classification tasks, however, multiple tags may be applied to each example rather than single-label classification.
Algorithms will be evaluated both on their ability to apply binary classifications of tags to examples, but also on their ability to rank tags for a track by asking them to return an affinity score for each tag/track pair.
Audio tag classification was first run at MIREX 2008 2008:Audio_Tag_Classification and as a special MIREX task at 2009 2009:SpecialTagatuneEvaluation and each year during 2010-2014.
Task specific mailing list
In the past we have use a specific mailing list for the discussion of this task. This year, however, we are asking that all discussions take place on the MIREX "EvalFest" list. If you have an question or comment, simply include the task name in the subject heading.
Data
Two datasets will be used to evaluate tagging algorithms: The MajorMiner and Mood tag datasets.
MajorMiner Tag Dataset
The tags come from the MajorMiner game. All of the data is browseable via the MajorMiner search page.
The music consists of 2300 clips selected at random from 3900 tracks. Each clip is 10 seconds long. The 2300 clips represent a total of 1400 different tracks on 800 different albums by 500 different artists. To give a sense for the music collection, the following genre tags have been applied to these artists, albums, and tracks on Last.fm: electronica, rock, indie, alternative, pop, britpop, idm, new wave, hip-hop, singer-songwriter, trip-hop, post-punk, ambient, jazz.
The MajorMiner game has collected a total of about 73000 taggings, 12000 of which have been verified by at least two users. In these verified taggings, there are 43 tags that have been verified at least 35 times, for a total of about 9000 verified uses. These are the tags we will be using in this task.
Note that these data do not include strict negative labels. While many clips are tagged rock, none are tagged not rock. Frequently, however, a clip will be tagged many times without being tagged rock. We take this as an indication that rock does not apply to that clip. More specifically, a negative example of a particular tag is a clip on which another tag has been verified, but the tag in question has not.
Here is a list of the top 50 tags along with an approximate number of times each has been verified, how many times it's been used in total, and how many different users have ever used it:
Tag | Verified | Total | Users |
---|---|---|---|
drums | 962 | 3223 | 127 |
guitar | 845 | 3204 | 181 |
male | 724 | 2452 | 95 |
rock | 658 | 2619 | 198 |
synth | 498 | 1889 | 105 |
electronic | 490 | 1878 | 131 |
pop | 479 | 1761 | 151 |
bass | 417 | 1632 | 99 |
vocal | 355 | 1378 | 99 |
female | 342 | 1387 | 100 |
dance | 322 | 1244 | 115 |
techno | 246 | 943 | 104 |
piano | 179 | 826 | 120 |
electronica | 168 | 686 | 67 |
hip hop | 166 | 701 | 126 |
voice | 160 | 790 | 55 |
slow | 157 | 727 | 90 |
beat | 154 | 708 | 90 |
rap | 151 | 723 | 129 |
jazz | 136 | 735 | 154 |
80s | 130 | 601 | 94 |
fast | 109 | 494 | 70 |
instrumental | 103 | 539 | 62 |
drum machine | 89 | 427 | 35 |
british | 81 | 383 | 60 |
country | 74 | 360 | 105 |
distortion | 73 | 366 | 55 |
saxophone | 70 | 316 | 86 |
house | 65 | 298 | 66 |
ambient | 61 | 335 | 78 |
soft | 61 | 351 | 58 |
silence | 57 | 200 | 35 |
r&b | 57 | 242 | 59 |
strings | 55 | 252 | 62 |
quiet | 54 | 261 | 57 |
solo | 53 | 268 | 56 |
keyboard | 53 | 424 | 41 |
punk | 51 | 242 | 76 |
horns | 48 | 204 | 38 |
drum and bass | 48 | 191 | 50 |
noise | 46 | 249 | 61 |
funk | 46 | 266 | 90 |
acoustic | 40 | 193 | 58 |
trumpet | 39 | 174 | 68 |
end | 38 | 178 | 36 |
loud | 37 | 218 | 62 |
organ | 35 | 169 | 46 |
metal | 35 | 178 | 64 |
folk | 33 | 195 | 58 |
trance | 33 | 226 | 49 |
Mood Tag Dataset
The Mood tag dataset is derived from mood related tags on last.fm. All tags in this set are identified by a general affect lexicon (WordNet-Affect) and by human experts. Similar tags are grouped together to define a mood tag group and each song may belong to multiple mood tag groups.
There are 18 mood tag groups containing 135 unique tags. The dataset contains 3,469 unique songs. The following table lists the tag groups, their member tags and number of songs in each group:
Group id | Tags | num. of tags | num. of songs |
---|---|---|---|
G12 | calm, comfort, quiet, serene, mellow, chill out, calm down, calming, chillout, comforting, content, cool down, mellow music, mellow rock, peace of mind, quietness, relaxation, serenity, solace, soothe, soothing, still, tranquil, tranquility, tranquility | 25 | 1,680 |
G15 | sad, sadness, unhappy, melancholic, melancholy, feeling sad, mood: sad - slightly, sad song | 8 | 1,178 |
G5 | happy, happiness, happy songs, happy music, glad, mood: happy | 6 | 749 |
G32 | romantic, romantic music | 2 | 619 |
G2 | upbeat, gleeful, high spirits, zest, enthusiastic, buoyancy, elation, mood: upbeat | 8 | 543 |
G16 | depressed, blue, dark, depressive, dreary, gloom, darkness, depress, depression, depressing, gloomy | 11 | 471 |
G28 | anger, angry, choleric, fury, outraged, rage, angry music | 7 | 254 |
G17 | grief, heartbreak, mournful, sorrow, sorry, doleful, heartache, heartbreaking, heartsick, lachrymose, mourning, plaintive, regret, sorrowful | 14 | 183 |
G14 | dreamy | 1 | 146 |
G6 | cheerful, cheer up, festive, jolly, jovial, merry, cheer, cheering, cheery, get happy, rejoice, songs that are cheerful, sunny | 13 | 142 |
G8 | brooding, contemplative, meditative, reflective, broody, pensive, pondering, wistful | 8 | 116 |
G29 | aggression, aggressive | 2 | 115 |
G25 | angst, anxiety, anxious, jumpy, nervous, angsty | 6 | 80 |
G9 | confident, encouraging, encouragement, optimism, optimistic | 5 | 61 |
G7 | desire, hope, hopeful, mood: hopeful | 4 | 45 |
G11 | earnest, heartfelt | 2 | 40 |
G31 | pessimism, cynical, pessimistic, weltschmerz, cynical/sarcastic | 5 | 38 |
G1 | excitement, exciting, exhilarating, thrill, ardor, stimulating, thrilling, titillating | 8 | 30 |
TOTAL | 135 | 6,490 |
The songs are mostly from the USPOP collection, a detailed breakdown of the songs are listed in the following table:
Collection | num. of songs in the dataset | percentage of songs in the dataset |
---|---|---|
USPOP | 2764 | 80% |
Assorted pop | 366 | 10% |
American music | 145 | 4% |
Beatles | 128 | 4% |
USCRAP | 40 | 1% |
Metal music | 25 | 1% |
Magnatune | 1 | 0% |
TOTAL | 3469 | 100% |
Details on how the mood tag groups were derived are described in X. Hu, J. S. Downie, A.Ehmann, Lyric Text Mining in Music Mood Classification, In Proceedings of the 10th International Symposium on Music Information Retrieval (ISMIR), Oct. 2009, Kobe , Japan
Details on how the songs were selected are available in the description.
Evaluation
Participating algorithms will be evaluated with 3-fold artist-filtered cross-validation. An introduction to the evaluation statistics computed is given in the following subsections.
Binary (Classification) Evaluation
Algorithms are evaluated on their performance at tag classification using F-measure. Results are also reported for simple accuracy, however, as this statistic is dominated by the negative example accuracy it is not a reliable indicator of performance (as a system that returns no tags for any example will achieve a high score on this statistic). However, the accuracies are also reported for positive and negative examples separately as these can help elucidate the behaviour of an algorithm (for example demonstrating if the system is under or over predicting).
Affinity (Ranking) Evaluation
Algorithms are evaluated on their performance at tag ranking using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The affinity scores for each tag to be applied to a track are sorted prior to the computation of the AUC-ROC statistic, which gives higher scores to ranked tag sets where the correct tags appear towards the top of the set.
Ranking and significance testing
Additionally, more standard tests could be performed on the average classification accuracy, although the cross-tag variance tends to increase each algorithm's variance, interfering with significance tests without further handling. One test that can help resolve these issues is Friedman's ANOVA with Tukey-Kramer HSD.
We wish to compare a number of treatments/systems (the submissions) over a number of blocks/rows. We can either compute average classification accuracy and/or precision metrics over all the tags and use the cross validation folds as the blocks/rows - which will handle variance between different folds. However, we are more interested in considering each tag (averaged over all folds) or (perhaps better) each tag on each fold as a separate block.
The Friedman test should handle the variance between tags (caused by different difficulties of modeling each tag and different numbers of positive and negative examples per tag) by replacing the actual scores achieved by each system on each block (tag) with the rank achieved by that system on that tag amongst all the systems. Hence, we make the assumption that each tag (or combination of tag and fold) is of equal importance in the evaluation. This is an often used approach at TREC (Text Retrieval Conference) when considering retrieval results (where each query is of equal importance, but unequal variance/difficulty).
Tukey-Kramer Honestly Significant Difference multiple comparisons are made over the results of Friedman's ANOVA as this (and other tests, such as multiply applied Student's T-tests) can only safely tell you if one system is statistically significantly different from the rest. If you try to do the full NxN comparisons with such tests then the experiment wide alpha value is cumulative over all the tests. E.g. if we compared 12 systems at an alpha level of 0.05, a total of 66 pairwise comparisons are made and the chance of incorrectly rejecting the hypothesis of no difference in error rates is: 1 - (0.95^66) = 0.97 = 97%. This explanation is lifted from a paper by Tague-Sutcliffe and Blustein:
@article{taguesutcliffe1995sat, title={A Statistical Analysis of the TREC-3 Data}, author={Tague-Sutcliffe, J. and Blustein, J.}, journal={Overview of the Third Text Retrieval Conference (Trec-3)}, year={1995}, publisher={DIANE Publishing} }
For further details on the use of Friedman's ANOVA with Tukey-Kramer HSD in MIR, please see:
@InProceedings{jones2007hsj, title={"Human Similarity Judgments: Implications for the Design of Formal Evaluations"}, author="M.C. Jones and J.S. Downie and A.F. Ehmann", BOOKTITLE ="Proceedings of ISMIR 2007 International Society of Music Information Retrieval", year="2007" }
Runtime performance
In addition computation times for feature extraction and training/classification will be measured.
Submission format
Submission to this task will have to conform to a specified format detailed below, which is very similar to the audio genre classification task, among others.
Audio formats
Participating algorithms will have to read audio in the following format:
- Sample rate: 44 KHz
- Sample size: 16 bit
- Number of channels: 2 (stereo)
- Encoding: WAV (decoded from MP3 files by IMIRSEL)
- Duration: 10 second clips
Implementation details
Scratch folders will be provided for all submissions for the storage of feature files and any model files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique filenames will be assigned to each audio track.
The audio files to be used in the task will be specified in a simple ASCII list file. For feature extraction and classification this file will contain one path per line with no header line. For model training this file will contain one path per line, followed by a tab character and the tag label, again with no header line. Executables will have to accept the path to these list files as a command line parameter. The formats for the list files are specified below.
Algorithms should divide their feature extraction and training/classification into separate executables/scripts. This will facilitate a single feature extraction step for the task, while training and classification can be run for each cross-validation fold.
Multi-processor compute nodes (8 cores) will be used to run this task. Hence, participants should attempt to use parallelism where-ever possible. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.
I/O formats
In this section the input and output files used in this task are described as are the command line calling format requirements for submissions.
Feature extraction list file
The list file passed for feature extraction will be a simple ASCII list file. This file will contain one path per line with no header line.
I.e.
<example path and filename>
E.g.
/path/to/track1.wav /path/to/track2.wav ...
Training list file
The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and a tag label, again with no header line.
I.e.
<example path and filename>\t<tag classification>\n
E.g.
/path/to/track1.wav drum /path/to/track1.wav silence ...
In this way, the input file will represent the sparse ground truth matrix. While no line will be duplicated, multiple lines may contain the same path, one for each tag associated with that clip. Any tag that is not specified as applying to a clip does not apply to that clip. The ordering of the lines is arbitrary and should not be depended upon.
Test (classification) list file
The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.
I.e.
<example path and filename>
E.g.
/path/to/track1.wav /path/to/track2.wav ...
Classification output files
Participating algorithms should produce two simple ASCII list files similar in format to the Training list file. The path to which each list file should be written must be accepted as a parameter on the command line.
Tag Affinity file
The first file will contain one path per line, followed by a tab character and the tag label, followed by another tab character and the affinity of that tag for that file, again with no header line.
I.e.:
<example path and filename>\t<tag classification>\t<affinity>\n
E.g.:
/data/file1.wav rock 0.9 /data/file1.wav guitar 0.7 /data/file1.wav vocal 0.3 /data/file2.wav rock 0.5 ...
In this way, the output file will represent the sparse classification matrix. A path should be repeated on a separate line for each tag that the submission deems applies to it. If a (path, tag) pair is not specified, it will be assumed to have an affinity of 0. The ordering of the lines is not important and can be arbitrary.
The affinity will be used for retrieval evaluation metrics, and its only specification is that for a given tag, larger (closer to +infinity) numbers indicate that the tag is more appropriate to a clip than smaller (closer to -infinity) numbers. As submissions are asked to also return a binary relevance listing, submissions that do not compute an affinity should provide only the binary relevance listing file.
Binary relevance file
The second file to be produced is a binary version of the tag classifications, where a tag must be marked as relevant or not relevant to a track. This file will contain one path per line, followed by a tab character and the tag label, followed by another tab character and either a 1 or a 0 indicating the relevance of that tag for that file, again with no header line.
I.e.:
<example path and filename>\t<tag classification>\t<relevant? [0 | 1]>\n
E.g.:
/data/file1.wav rock 1 /data/file1.wav guitar 1 /data/file1.wav vocal 0 /data/file2.wav rock 1 ...
If a (path, tag) pair is not specified, it will be assumed to be non-relevant (0). Any line with path but no numerical value will be assumed to be relevant (1).
Hence, the following is equivalent to the example above:
/data/file1.wav rock /data/file1.wav guitar /data/file2.wav rock
The ordering of the lines is not important and can be arbitrary.
Example submission calling formats
extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt TrainAndClassify.sh /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt
extractFeatures.sh -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt TrainAndClassify.sh -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt
extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt Train.sh /path/to/scratch/folder /path/to/trainListFile.txt Classify.sh /path/to/scratch/folder /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt
myAlgo.sh -extract -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt myAlgo.sh -TrainAndClassify -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt
myAlgo.sh -extract /path/to/scratch/folder /path/to/featureExtractionListFile.txt myAlgo.sh -train /path/to/scratch/folder /path/to/trainListFile.txt myAlgo.sh -classify /path/to/scratch/folder /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt
Packaging submissions
All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed).
All submissions should include a README file including the following the information:
- Command line calling format for all executables
- Number of threads/cores used or whether this should be specified on the command line
- Expected memory footprint
- Expected runtime
- Approximately how much scratch disk space will the submission need to store any feature/cache files?
- Any required environments libraries and architectures (including version information) such as Matlab, Java, Python, Bash, Ruby etc.
- Any special notice regarding to running your algorithm
Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be specified.
A hard limit of 72 hours will be imposed on the full execution of a submission on each dataset (to include feature extraction time and the 3 training/testing cycles required for the 3-fold cross-validated experiment.