2016:Audio Tempo Estimation

From MIREX Wiki

Description

This task compares current methods for the extraction of tempo from musical audio. We distinguish between notated tempo and perceptual tempo and will test for the extraction of perceptual tempo.

We differentiate between notated tempo and perceived tempo. If you have the notated tempo (e.g., from the score) it is straightforward attach a tempo annotation to an excerpt and run a contest for algorithms to predict the notated tempo. For excerpts for which we have no "official" tempo annotation, we can also annotate the *perceived* tempo. This is not a straightforward task and needs to be done carefully. If you ask a group of listeners (including skilled musicians) to annotate the tempo of music excerpts, they can give you different answers (they tap at different metrical levels) if they are unfamiliar with the piece. For some excerpts the perceived pulse or tempo is less ambiguous and everyone taps at the same metrical level, but for other excerpts the tempo can be quite ambiguous and you get a complete split across listeners.

The annotation of perceptual tempo can take several forms: a probability density function as a function of tempo; a series of tempos, ranked by their respective perceptual salience; etc. These measures of perceptual tempo can be used as a ground truth on which to test algorithms for tempo extraction. The dominant perceived tempo is sometimes the same as the notated tempo but not always. A piece of music can "feel" faster or slower than it's notated tempo in that the dominant perceived pulse can be a metrical level higher or lower than the notated tempo.

There are several reasons to examine the perceptual tempo, either in place of or in addition to the notated tempo. For many applications of automatic tempo extractors, the perceived tempo of the music is more relevant than the notated tempo. An automatic playlist generator or music navigator, for instance, might allow listeners to select or filter music by its (automatically extracted) tempo. In this case, the "feel", or perceptual tempo may be more relevant than the notated tempo. An automatic DJ apparatus might also perform better with a representation of perceived tempo rather than notated tempo.

A more pragmatic reason for using perceptual tempo rather than notated tempo as a ground truth for our contest is that we simply do not have the notated tempo of our test set. If we notate it by having a panel of expert listeners tap along and label the excerpts, we are by default dealing with the perceived tempo. The handling of this data as ground truth must be done with care.


Data

Collections

MIREX 2006 Tempo dataset collected by Martin F. McKinney (Philips) and Dirk Moelants (IPEM, Ghent University). Composed of 160 30-second clips in WAV format with annotated tempos.


Audio Formats

The data are monophonic sound files, with the associated onset times and data about the annotation robustness.

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • 30 second clips


Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.


Input data

Individual audio files in WAV format (30-second clips drawn from the 140 unseen tracks in the dataset). The audio recordings were selected to provide a stable tempo value, a wide distribution of tempi values, and a large variety of instrumentation and musical styles. About 20% of the files contain non-binary meters, and a small number of examples contain changing meters.


Output Data

Submitted programs should output two tempi (a slower tempo, T1, and a faster tempo, T2) as well as the strength of T1 relative to T2 (0-1). The relative strength ST2 (not output) is simply 1 - ST1. The tempo estimates from each algorithm should be written to a text file in the following format:

T1<tab>T2<tab>ST1

E.g.

60	180	0.7


Algorithm Calling Format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the tempo estimation detection on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input %output

or

foobar -i %input -o %output

Moreover, if your submission takes additional parameters, foobar could be called like:

foobar .1 %input %output
foobar -param1 .1 -i %input -o %output  

If your submission is in MATLAB, it should be submitted as a function. Once again, the function must contain String inputs for the full path and names of the input and output files. Parameters could also be specified as input arguments of the function. For example:

foobar('%input','%output')
foobar(.1,'%input','%output')


README File

A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.


Evaluation Procedures

This section focuses on the mechanics of the method while we discuss the data (music excerpts and perceptual data) in the next section. There are two general steps to the method: 1) collection of perceptual tempo annotations; and 2) evaluation of tempo extraction algorithms.

Perceptual tempo data collection

The following procedure is described in more detail in McKinney and Moelants (2004) and Moelants and McKinney (2004). Listeners were asked to tap to the beat of a series of musical excerpts. Responses were collected and their perceived tempo was calculated. For each excerpt, a distribution of perceived tempo was generated. A relatively simple form of perceived tempo was proposed for this contest: The two highest peaks in the perceived tempo distribution for each excerpt were taken, along with their respective heights (normalized to sum to 1.0) as the two tempo candidates for that particular excerpt. The height of a peak in the distribution is assumed to represent the perceptual salience of that tempo.

References

Evaluation of tempo extraction algorithms

Algorithms will process musical excerpts and return the following data: Two tempi in BPM (T1 and T2, where T1 is the slower of the two tempi). For a given algorithm, the performance, P, for each audio excerpt will be given by the following equation:

P = ST1 * TT1 + (1 - ST1) * TT2

where ST1 is the relative perceptual strength of T1 (given by groundtruth data, varies from 0 to 1.0), TT1 is the ability of the algorithm to identify T1 to within 8%, and TT2 is the ability of the algorithm to identify T2 to within 8%. No credit will be given for tempi other than T1 and T2.

The algorithm with the best average P-score will achieve the highest rank in the task.


Relevant Test Collections

We will use a collection of 160 musical exerpts for the evaluation procedure. 40 of the excerpts have been taken from one of McKinney/Moelants previous experiments (See McKinney/Moelants ICMPC paper above).

Excerpts were selected to provide:

  • stable tempo within each excerpt
  • a good distribution of tempi across excerpts
  • a large variety of instrumentation and beat strengths (with and without percussion)
  • a variation of musical styles, including many non-western styles
  • the presence of non-binary meters (about 20% have a ternary element and there are a few examples with odd or changing meter).

We will provide 20 excerpts with ground truth data for participants to try/tune their algorithms before submission. The remaining 140 excerpts will be novel to all participants.


Practice Data

You can find it here:

https://www.music-ir.org/evaluation/MIREX/data/2006/beat/

User: beattrack Password: b34trx

https://www.music-ir.org/evaluation/MIREX/data/2006/tempo/

User: tempo Password: t3mp0

Data has been uploaded in both .tgz and .zip format.


Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.

A hard limit of 8 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.


Potential Participants

name / email