Difference between revisions of "2010:Audio Classification (Train/Test) Tasks"

Description

Many tasks in music classification can be characterized to a two-stage process: train classification models using labeled data, and test the models using new/unseen data. Therefore, we propose this "super" task which includes various audio classification tasks that follow this Train/Test process. In this year, three classification tasks are included:

Audio Artist Identification
Audio Genre Classification
Audio Mood Classification

All three classification tasks were conducted in previous MIREX runs. This page presents the evaluation of these tasks, including the datasets, the submission rules and formats, as well as links to the wiki pages of previous runs of these tasks. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation. Please feel free to edit this page and conduct discussion of the task format and evaluation on the MRX-COM00 mailing list (List interface).

Data

The three classification tasks use three different datasets.

Audio Artist Identification

There are two datasets for this task:

1) The collection used at MIREX 2009 will be re-used. Collection statistics:

3150 30-second 22.05kHz mono wav audio clips drawn from 105 artists (30 clips per artist drawn from 3 albums).

2) The second collection is composed classical composers:

2772 30-second 22.05 kHz mono wav clips organised into 11 "classical" composers (252 clips per composer). At present the database contains tracks for:
- Bach
- Beethoven
- Brahms
- Chopin
- Dvorak
- Handel
- Haydn
- Mendelssohn
- Mozart
- Schubert
- Vivaldi

Audio Genre Classification

This task will use two different datasets. 1) The MIREX 2007 Genre Collection: The first collection may either be the MIREX 2007 genre classification set (details below) or a new dataset drawn from the same distribution of over 22,000 tracks. If a new set is selected it is expected to contain 10-12 genres, with between 700 and 1000 tracks per genre.

MIREX 2007 collection statistics: 7000 30-second audio clips in 22.05kHz mono WAV format drawn from 10 genres (700 clips from each genre). Genres:

Blues
Jazz
Country/Western
Baroque
Classical
Romantic
Electronica
Hip-Hop
Rock
HardRock/Metal

2) Latin Genre Collection: Carlos Silla (cns2 (at) kent (dot) ac (dot) uk) has contributed a second dataset of Latin popular and dance music sourced from Brazil and hand labeled by music experts. This collection is likely to contain a greater number of styles of music that will be differentiated by rhythmic characteristics than the MIREX 2007 dataset.

More precisely, the Latin Music Database has 3,227 audio files from 10 Latin music genres:

Ax├⌐
Bachata
Bolero
Forr├│
Ga├║cha
Merengue
Pagode
Sertaneja
Tango

Audio Mood Classification

The MIREX 2007 Mood Classification dataset will be used. The dataset consists 600 30second audio clips selected from the APM collection (www.apmmusic.com), and labeled by human judges using the Evalutron6000 system. There are 5 mood categories each of which contains 120 clips:

Cluster_1: passionate, rousing, confident,boisterous, rowdy
Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral

Audio Formats

For all three tasks, participating algorithms will have to read audio in the following format:

Sample rate: 22 KHz
Sample size: 16 bit
Number of channels: 1 (mono)
Encoding: WAV

Evaluation

This section first describes evaluation methods common to all the three tasks, then specifies settings unique to each of the tasks.

For all the three tasks, participating algorithms will be evaluated with 3-fold cross validation. For Artist Identification, album filtering will be used the test and training splits, i.e. training and test sets will contain tracks from different albums; for Genre Classification, artist filtering will be used the test and training splits, i.e. training and test sets will contain different artists.

The raw classification (identification) accuracy, standard deviation and a confusion matrix for each algorithm will be computed.

Classification accuracies will be tested for statistically significant differences using two techniques:

McNemar's test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers

A significance test matrix will be provided to display significant differences between algorithms at p-values of 0.05 and 0.01)

Friedman's Anova with Tukey-Kramer honestly significant difference (HSD) tests for multiple comparisons. This test will be used to rank the algorithms and to group them into sets of equivalent performance.

In addition computation times for feature extraction and training/classification will be measured.

Audio Genre Classification

A hierarchical genre taxonomy will be provided to all participating algorithms. This taxonomy will have at most two or three levels depending on the collection composition.

In addition to the aforementioned measures, accuracy statistic will be computed that discounts confusion between similar classes - as was used in the MIREX 2005 audio genre task. This will be defined as follows:

1.0 point will be scored for correctly assigning the genre label. i.e. for a two level hierarchy correctly assigning the the labels Jazz&Blues and Blues to an example scores 1.0 point.
Tracks misclassified as a class on the same branch of the genre hierarchy as the true class will score a number of points equal to the number of nodes in the hierarchy shared with the true class, divided by the length of the correct branch. I.e. in a two level hierarchy containing the following branches:

 JazzBlues, Jazz
 JazzBlues, Blues
 CountryWestern
 GeneralClassical, Baroque
 GeneralClassical, Classical
 GeneralClassical, Romantic
 Electronica
 HipHop
 GeneralRock, Rock
 GeneralRock, HardRockMetal

misclassifying a Jazz example as blues will score 0.5 points.

Tracks missclassifed as a completely dissimilar class will score 0.0 points.
Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Submission

File I/O Format

For all the three tasks, scratch folders will be provided for all submissions for the storage of feature files and any model files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique file names will be assigned to each audio track.

The audio files to be used in these tasks will be specified in a simple ASCII list file. The formats for the list files are specified below:

Feature extraction list file

The list file passed for feature extraction will be a simple ASCII list file. This file will contain one path per line with no header line.

E.g. <example path and filename>

Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the class (artist, genre or mood) label, again with no header line.

 E.g. <example path and filename>\t<class label>

Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

Classification output file

Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the artist label, again with no header line.

 E.g. <example path and filename>\t<class label>

Submission calling formats

Algorithms should divide their feature extraction and training/classification into separate runs. This will facilitate a single feature extraction step for the task, while training and classification can be run for each cross-validation fold.

Hence, participants should provide two executables or command line parameters for a single executable to run the two separate processes.

Also, executables will have to accept the paths to the aforementioned list files as command line parameters.

Example submission calling formats

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 Train.sh /path/to/scratch/folder /path/to/trainListFile.txt 
 Classify.sh /path/to/testListFile.txt /path/to/outputListFile.txt

 myAlgo.sh -extract /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -train /path/to/scratch/folder /path/to/trainListFile.txt 
 myAlgo.sh -classify /path/to/testListFile.txt /path/to/outputListFile.txt

Multi-processor compute nodes (2, 4 or 8 cores) will be used to run this task. Hence, participants should attempt to use parallelism where-ever possible. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.

 extractFeatures.sh -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt

 myAlgo.sh -extract -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -TrainAndClassify -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed). IMIRSEL should be notified of any dependencies that you cannot include with your submission at the earliest opportunity (in order to give them time to satisfy the dependency).
Be sure to follow the [Best Coding Practices for MIREX]
Be sure to follow the MIREX 2010 Submission Instructions

All submissions should include a README file including the following the information:

Command line calling format for all executables
Number of threads/cores used or whether this should be specified on the command line
Expected memory footprint
Expected runtime
Approximately how much scratch disk space will the submission need to store any feature/cache files?
Any required environments (and versions) such as Matlab, Java, Python, Bash, Ruby etc.
Any special notice regarding to running your algorithm

Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.

A hard limit of 24 hours will be imposed on feature extraction times.

A hard limit of 24 hours will be imposed on each training/classification cycle, leading to a total runtime limit of 72 hours.

Specific to Audio Genre Classification: Genre hierarchy

A genre hierarchy file will be provided to submissions requesting one. There is no guarantee that the tree defined by this file will be balanced (all branches being the same length). Therefore, the tree defined may have branches of length 1, 2 or 3 (excluding the root node).

This file will have a number of lines equal to the number fo genres (with no header line). Each line in the file will conform to one of the following formats:

 Highest_level_classification\tMid_level_classificaiton\tLowest_level_classification
 Highest_level_classification\tLowest_level_classification
 Lowest_level_classification

where \t represents a tab character and Lowest_level_classification is the actual genre label applied to files.

E.g. a simple file for a 4 class genre taxonomy might look like:

 Rock&Pop	Rock	Alternative Rock
 Rock&Pop	Rock
 Rock&Pop	Pop
 Classical

Submission opening date

TBA

Submission closing date

TBA

Links to Previous MIREX Runs of These Classification Tasks

Audio Artist Identification

[Audio Artist Identification in ISMIR2004 Audio Description Contest]

@@ Line 238: / Line 238: @@
 TBA
+== Links to Previous MIREX Runs of These Classification Tasks ==
+=== Audio Artist Identification ===
+[[http://ismir2004.ismir.net/genre_contest/index.htm Audio Artist Identification in ISMIR2004 Audio Description Contest]]

Difference between revisions of "2010:Audio Classification (Train/Test) Tasks"

Revision as of 16:43, 21 May 2010

Contents

Description

Data

Audio Artist Identification

Audio Genre Classification

Audio Mood Classification

Audio Formats

Evaluation

Audio Genre Classification

Submission

File I/O Format

Feature extraction list file

Training list file

Test (classification) list file

Classification output file

Submission calling formats

Example submission calling formats

Packaging submissions

Time and hardware limits

Specific to Audio Genre Classification: Genre hierarchy

Submission opening date

Submission closing date

Links to Previous MIREX Runs of These Classification Tasks

Audio Artist Identification

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools