2009:Audio Genre Classification

From MIREX Wiki

Description

The text of this section is copied from the 2008 page. Please add your comments and discussions for 2009. This proposal may be refined based on feedback from the participants.

Note that audio genre classification algorithms have been evaluated at ISMIR 2004, MIREX 2005, MIREX 2007 and MIREX 2008. However, there was no genre classification task in 2006.

Please feel free to edit this page but please conduct discussion of the task format and evaluation on the MRX-COM00 mailing list (List interface).


Discussions for 2009

Your comments here.

1. Jia-Min Ren

Does anyone know what's the meanings of guessed_c and p_c in the evaluation metric of ISMIR2004 Audio Description Contest? How can I calculate the normalized accuracy from the confusion matrix as mentioned in http://ismir2004.ismir.net/genre_contest/results.htm ?

I believe guessed_c is the number of correctly classified songs for class c and p_c is the probability of class c, i.e. the number of songs available in class c divided by the total number of songs in the collection. You may have a look at the ISMIR 2004 Audio Contest TechReport by MTG and the definition of macro-averaged Recall (Eq. 1) in my ISMIR 2005 follow-up paper. -- Thomas Lidy

Data

Collections

Systems will be evaluated on two different collections. The first collection may either be the MIREX 2007 genre classification set (details below) or a new dataset drawn from the same distribution of over 22,000 tracks. If a new set is selected it is expected to contain 10-12 genres, with between 700 and 1000 tracks per genre.

MIREX 2007 collection statistics: 7000 30-second audio clips in 22.05kHz mono WAV format drawn from 10 genres (700 clips from each genre). Genres:

  • Blues
  • Jazz
  • Country/Western
  • Baroque
  • Classical
  • Romantic
  • Electronica
  • Hip-Hop
  • Rock
  • HardRock/Metal


Carlos Silla (cns2 (at) kent (dot) ac (dot) uk) has contributed a second dataset of Latin popular and dance music sourced from Brazil and hand labeled by music experts. This collection is likely to contain a greater number of styles of music that will be differentiated by rhythmic characteristics than the MIREX 2007 dataset.

More precisely, the Latin Music Database has 3.227 audio files from 10 Latin music genres:

  • Ax├⌐
  • Bachata
  • Bolero
  • Forr├│
  • Ga├║cha
  • Merengue
  • Pagode
  • Sertaneja
  • Tango

Audio formats

Participating algorithms will have to read audio in the following format:

  • Sample rate: 22 KHz
  • Sample size: 16 bit
  • Number of channels: 1 (mono)
  • Encoding: WAV

Requests for additional audio formats will be considered, if they are submitted a minimum of three weeks before the submission deadline.

Evaluation

Participating algorithms will be evaluated with 3-fold cross validation. Artist filtering will be used the test and training splits, I.e. training and test sets will contain different artists. A hierarchical genre taxonomy will be provided to all participating algorithms. This taxonomy will have at most two or three levels depending on the collection composition.

The raw classification accuracy, standard deviation and a confusion matrix for each algorithm will be computed. Additionally, an accuracy statistic will be computed that discounts confusion between similar classes - as was used in the MIREX 2005 audio genre task. This will be defined as follows:

  • 1.0 point will be scored for correctly assigning the genre label. I.e. for a two level hierachy correctly assigning the the labels Jazz&Blues and Blues to an example scores 1.0 point.
  • Tracks misclassified as a class on the same branch of the genre hierachy as the true class will score a number of points equal to the number of nodes in the hierachy shared with the true class, divided by the length of the correct branch. I.e. in a two level hierachy containing the following branches:
 JazzBlues, Jazz
 JazzBlues, Blues
 CountryWestern
 GeneralClassical, Baroque
 GeneralClassical, Classical
 GeneralClassical, Romantic
 Electronica
 HipHop
 GeneralRock, Rock
 GeneralRock, HardRockMetal


misclassifying a Jazz example as blues will score 0.5 points.

  • Tracks missclassifed as a completely dissimilar class will score 0.0 points.
  • Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Ranking and significance testing

Classification accuracies will be tested for statistically significant differences using two techniques:

  • McNemar's test (a significance test matrix will be provided display significant differences between algorithms at p-values of 0.05 and 0.01)
  • Friedman's Anova with Tukey-Kramer honestly significant difference (HSD) tests for multiple comparisons. This test will be used to rank the algorithms and to group them into sets of equivalent performance.


In addition computation times for feature extraction and training/classification will be measured.

Submission format

Submission to this task will have to conform to a specified format detailed below.

Audio formats

Participating algorithms will have to read audio in the following format:

  • Sample rate: 22 KHz
  • Sample size: 16 bit
  • Number of channels: 1 (mono)
  • Encoding: WAV

Requests for additional audio formats will be considered, if they are submitted a minimum of three weeks before the submission deadline.

Implementation details

Scratch folders will be provided for all submissions for the storage of feature files and any model files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique filenames will be assigned to each audio track.

The audio files to be used in the task will be specified in a simple ASCII list file. For feature extraction and classification this file will contain one path per line with no header line. For model training this file will contain one path per line, followed by a tab character and the genre label, again with no header line. Executables will have to accept the path to these list files as a command line parameter. The formats for the list files are specified below.

Algorithms should divide their feature extraction and training/classification into separate runs. This will facilitate a single feature extraction step for the task, while training and classification can be run for each cross-validation fold.

Hence, participants should provide two executables or command line parameters for a single executable to run the two separate processes.

Multi-processor compute nodes (2, 4 or 8 cores) will be used to run this task. Hence, participants should attempt to use parallelism where-ever possible. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.


I/O formats

In this section the input and output files used in this task are described as are the command line calling format requirements for submissions.

Genre hierarchy

A genre hierarchy file will be provided to submissions requesting one. There is no guarantee that the tree defined by this file will be balanced (all branches being the same length). Therefore, the tree defined may have branches of length 1, 2 or 3 (excluding the root node).

This file will have a number of lines equal to the number fo genres (with no header line). Each line in the file will conform to one of the following formats:

 Highest_level_classification\tMid_level_classificaiton\tLowest_level_classification
 Highest_level_classification\tLowest_level_classification
 Lowest_level_classification

where \t represents a tab character and Lowest_level_classification is the actual genre label applied to files.

E.g. a simple file for a 4 class genre taxonomy might look like:

 Rock&Pop	Rock	Alternative Rock
 Rock&Pop	Rock
 Rock&Pop	Pop
 Classical

Feature extraction list file

The list file passed for feature extraction will a simple ASCII list file. This file will contain one path per line with no header line.

Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the genre label, again with no header line.

E.g. <example path and filename>\t<genre classification>

Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

Classification output files

Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the genre label, again with no header line. E.g.:

<example path and filename>\t<genre classification>

The path to which this list file should be written must be accepted as a parameter on the command line.

New Optional Output File

Furthermore, we encourage the participating algorithms to produce an additional output file representing the feature extracted from each file in a format of the authors choice. One possible format would be Weka ARFF, but participants are not limited to it. A simple CSV (Comma Separated Value) list would suffice.

Example submission calling formats

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt 
 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/hierachy/file /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 extractFeatures.sh -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 Train.sh /path/to/scratch/folder /path/to/trainListFile.txt 
 Classify.sh /path/to/testListFile.txt /path/to/outputListFile.txt
 myAlgo.sh -extract -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -TrainAndClassify -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 myAlgo.sh -extract /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -train /path/to/scratch/folder /path/to/trainListFile.txt 
 myAlgo.sh -classify /path/to/testListFile.txt /path/to/outputListFile.txt

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed). IMIRSEL should be notified of any dependencies that you cannot include with your submission at the earliest opportunity (in order to give them time to satisfy the dependency).


All submissions should include a README file including the following the information:

  • Command line calling format for all executables
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions) such as Matlab, Java, Python, Bash, Ruby etc.

Pre-trained submissions

Pre-trained submissions to this task will be accepted - however they will have to ensure that they return the correct classification labels (as listed in the hierarchy file).

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be specified.

A hard limit of 24 hours will be imposed on feature extraction times.

A hard limit of 24 hours will be imposed on each training/classification cycle. Leading to a total runtime limit of 72 hours.

Submission opening date

TBA

Submission closing date

TBA

Potential Participants

If you think there is a slight chance that you would participate, please add your name and email below.

1. Preeti Rao and Sujeet Kini (Indian Institute of Technology, Bombay), prao[at]ee[dot]iitb[dot]ac[dot]in, kinisujeet[at]ee[dot]iitb[dot]ac[dot]in

2. Jia-Min Ren, Zhi-Sheng Chen (Tsing Hua Univ., Taiwan), jmzen0921[at]mirlab[dot]org

3. Chuan Cao, Ming Li (Institute of Acoustics, Chinese Academy of Sciences, China), ccao.hccl[at]gmail[dot]com

4. Thomas Lidy (+ ...), Vienna University of Technology, Austria, lidy[at]ifs[dot]tuwien[dot]ac[dot]at

5. Nicolas Wack et al., MTG Universitat Pompeu Fabra, Spain, nicolas[dot]wack[at]upf[dot]edu (3 algos presented, 1 by me, 1 by Enric Guaus and 1 by Cyril Laurier)

6.Huaxin Wang, Dongdong Mao, Di Sun(Peking University, China) wanghxcis[at]gmail[dot]com

7. Michael Mandel, Columbia University, mim[at]ee[dot]columbia[dot]edu

8. Tao Zheng, ( School of Information, Renmin University of China ), w15784463[at]sina[dot]com

9. Emiru Tsunoo (University of Tokyo), George Tzanetakis (University of Victoria), Nobutaka Ono and Shigeki Sagayama (University of Tokyo), tsunoo[at]hil[dot]t[dot]u-tokyo[dot]ac[dot]jp

10. Juan Jose Burred, Geoffroy Peeters (IRCAM), burred[at]ircam[dot]fr