Difference between revisions of "2010:Audio Classification (Train/Test) Tasks"

From MIREX Wiki
(Data)
Line 76: Line 76:
 
*Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry  
 
*Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry  
 
*Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral
 
*Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral
 +
 +
== Audio Formats ==
 +
For all three tasks, participating algorithms will have to read audio in the following format:
 +
 +
*Sample rate: 22 KHz
 +
*Sample size: 16 bit
 +
*Number of channels: 1 (mono)
 +
*Encoding: WAV
 +
 +
== Evaluation ==
 +
This section first describes evaluation methods common to all the three tasks, then specifies settings unique to each of the tasks.
 +
 +
For all the three tasks, participating algorithms will be evaluated with 3-fold cross validation. For '''Artist Identification''', album filtering will be used the test and training splits, i.e. training and test sets will contain tracks from different albums; for '''Genre Classification''', artist filtering will be used the test and training splits, i.e. training and test sets will contain different artists.
 +
 +
The raw classification (identification) accuracy, standard deviation and a confusion matrix for each algorithm will be computed.
 +
 +
Classification accuracies will be tested for statistically significant differences using two techniques:
 +
 +
* McNemar's test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers
 +
 +
A significance test matrix will be provided to display significant differences between algorithms at p-values of 0.05 and 0.01)
 +
 +
* Friedman's Anova with Tukey-Kramer honestly significant difference (HSD) tests for multiple comparisons. This test will be used to rank the algorithms and to group them into sets of equivalent performance.
 +
 +
In addition computation times for feature extraction and training/classification will be measured.

Revision as of 13:55, 21 May 2010

Description

Many tasks in music classification can be characterized to a two-stage process: train classification models using labeled data, and test the models using new/unseen data. Therefore, we propose this "super" task which includes various audio classification tasks that follow this Train/Test process. In this year, three classification tasks are included:

  • Audio Artist Identification
  • Audio Genre Classification
  • Audio Mood Classification

All three classification tasks were conducted in previous MIREX runs. This page presents the evaluation of these tasks, including the datasets, the submission rules and formats, as well as links to the wiki pages of previous runs of these tasks. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation. Please feel free to edit this page and conduct discussion of the task format and evaluation on the MRX-COM00 mailing list (List interface).

Data

The three classification tasks use three different datasets.

Audio Artist Identification

There are two datasets for this task:

1) The collection used at MIREX 2009 will be re-used. Collection statistics:

  • 3150 30-second 22.05kHz mono wav audio clips drawn from 105 artists (30 clips per artist drawn from 3 albums).

2) The second collection is composed classical composers:

  • 2772 30-second 22.05 kHz mono wav clips organised into 11 "classical" composers (252 clips per composer). At present the database contains tracks for:
    • Bach
    • Beethoven
    • Brahms
    • Chopin
    • Dvorak
    • Handel
    • Haydn
    • Mendelssohn
    • Mozart
    • Schubert
    • Vivaldi

Audio Genre Classification

This task will use two different datasets. 1) The MIREX 2007 Genre Collection: The first collection may either be the MIREX 2007 genre classification set (details below) or a new dataset drawn from the same distribution of over 22,000 tracks. If a new set is selected it is expected to contain 10-12 genres, with between 700 and 1000 tracks per genre.

MIREX 2007 collection statistics: 7000 30-second audio clips in 22.05kHz mono WAV format drawn from 10 genres (700 clips from each genre). Genres:

  • Blues
  • Jazz
  • Country/Western
  • Baroque
  • Classical
  • Romantic
  • Electronica
  • Hip-Hop
  • Rock
  • HardRock/Metal


2) Latin Genre Collection: Carlos Silla (cns2 (at) kent (dot) ac (dot) uk) has contributed a second dataset of Latin popular and dance music sourced from Brazil and hand labeled by music experts. This collection is likely to contain a greater number of styles of music that will be differentiated by rhythmic characteristics than the MIREX 2007 dataset.

More precisely, the Latin Music Database has 3,227 audio files from 10 Latin music genres:

  • Ax├⌐
  • Bachata
  • Bolero
  • Forr├│
  • Ga├║cha
  • Merengue
  • Pagode
  • Sertaneja
  • Tango

Audio Mood Classification

The MIREX 2007 Mood Classification dataset will be used. The dataset consists 600 30second audio clips selected from the APM collection (www.apmmusic.com), and labeled by human judges using the Evalutron6000 system. There are 5 mood categories each of which contains 120 clips:

  • Cluster_1: passionate, rousing, confident,boisterous, rowdy
  • Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
  • Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
  • Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
  • Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral

Audio Formats

For all three tasks, participating algorithms will have to read audio in the following format:

  • Sample rate: 22 KHz
  • Sample size: 16 bit
  • Number of channels: 1 (mono)
  • Encoding: WAV

Evaluation

This section first describes evaluation methods common to all the three tasks, then specifies settings unique to each of the tasks.

For all the three tasks, participating algorithms will be evaluated with 3-fold cross validation. For Artist Identification, album filtering will be used the test and training splits, i.e. training and test sets will contain tracks from different albums; for Genre Classification, artist filtering will be used the test and training splits, i.e. training and test sets will contain different artists.

The raw classification (identification) accuracy, standard deviation and a confusion matrix for each algorithm will be computed.

Classification accuracies will be tested for statistically significant differences using two techniques:

  • McNemar's test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers

A significance test matrix will be provided to display significant differences between algorithms at p-values of 0.05 and 0.01)

  • Friedman's Anova with Tukey-Kramer honestly significant difference (HSD) tests for multiple comparisons. This test will be used to rank the algorithms and to group them into sets of equivalent performance.

In addition computation times for feature extraction and training/classification will be measured.