2010:Audio Melody Extraction - Revision history

Kriswest at 09:25, 5 June 2010

2010-06-05T09:25:30Z

AndreasEhmann: /* Output File Format (Audio Melody Extraction) */

2010-05-20T20:28:49Z

‎Output File Format (Audio Melody Extraction)

AndreasEhmann: /* Output File Format (Audio Melody Extraction) */

2010-05-20T20:28:33Z

‎Output File Format (Audio Melody Extraction)

AndreasEhmann: /* Evaluation procedures */

2010-05-20T20:27:09Z

‎Evaluation procedures

AndreasEhmann: /* Evaluation procedures */

2010-05-20T20:04:35Z

‎Evaluation procedures

AndreasEhmann: /* Evaluation procedures */

2010-05-20T19:58:41Z

‎Evaluation procedures

AndreasEhmann: /* Relevant Development Collections */

2010-05-20T19:55:54Z

‎Relevant Development Collections

AndreasEhmann: /* Relevant Development Collections */

2010-05-20T19:55:35Z

‎Relevant Development Collections

AndreasEhmann at 19:53, 20 May 2010

2010-05-20T19:53:55Z

AndreasEhmann at 19:50, 20 May 2010

2010-05-20T19:50:24Z

← Older revision		Revision as of 09:25, 5 June 2010
Line 121:		Line 121:

	* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. http://ismir2004.ismir.net/melody_contest/results.html (full test set with the reference transcriptions (28.6 MB))		* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. http://ismir2004.ismir.net/melody_contest/results.html (full test set with the reference transcriptions (28.6 MB))
		+
		+
		+	== Time and hardware limits ==
		+	Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
		+
		+	A hard limit of 12 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.
		+
		+	== Submission opening date ==
		+
		+	Friday 4th June 2010
		+
		+	== Submission closing date ==
		+	TBA

@@ Line 42: / Line 42: @@
 === Output File Format (Audio Melody Extraction) ===
-The Audio Melody Extraction output file format is a tab-delimited ASCII text format. Fundamental frequencies (in Hz) of the main melody are reported on a 10ms grid. If an algorithm estimates that there is no melody present within a given time frame it is to report a NEGATIVE frequency estimate. This allows the algorithm to still output a pitch estimate even if its voiced/unvoiced detection mechanism is incorrect. Therefore, pitch accuracy and segmentation performance can be evaluated separately. Estimating ZERO frequency is also acceptable. However, Pitch Accuracy performance will go down if the voiced/unvoiced detection of the algorithm is incorrect. If the algorithm performs no segmentation, it can report all positive fundamental frequencies (and the segmentation aspects of the evaluation ignored). If the time-stamp in the algorithm output is not on a 10ms time-grid, it will be resampled using 0th-order interpolation during evaluation. Therefore, we encourage the use of a 10ms frame hop-size. Each line of the output file should look like:
+The Audio Melody Extraction output file format is a tab-delimited ASCII text format. Fundamental frequencies (in Hz) of the main melody are reported on a 10ms time-grid. If an algorithm estimates that there is no melody present within a given time frame it is to report a NEGATIVE frequency estimate. This allows the algorithm to still output a pitch estimate even if its voiced/unvoiced detection mechanism is incorrect. Therefore, pitch accuracy and segmentation performance can be evaluated separately. Estimating ZERO frequency is also acceptable. However, Pitch Accuracy performance will go down if the voiced/unvoiced detection of the algorithm is incorrect. If the algorithm performs no segmentation, it can report all positive fundamental frequencies (and the segmentation aspects of the evaluation ignored). If the time-stamp in the algorithm output is not on a 10ms time-grid, it will be resampled using 0th-order interpolation during evaluation. Therefore, we encourage the use of a 10ms frame hop-size. Each line of the output file should look like:
   <timestamp (seconds)>\t<frequency (Hz)>\n

@@ Line 42: / Line 42: @@
 === Output File Format (Audio Melody Extraction) ===
-The Audio Melody Extraction output file format is a tab-delimited ASCII text format. Fundamental frequencies of the main melody are reported on a 10ms grid. If an algorithm estimates that there is no melody present within a given time frame it is to report a NEGATIVE frequency estimate. This allows the algorithm to still output a pitch estimate even if its voiced/unvoiced detection mechanism is incorrect. Therefore, pitch accuracy and segmentation performance can be evaluated separately. Estimating ZERO frequency is also acceptable. However, Pitch Accuracy performance will go down if the voiced/unvoiced detection of the algorithm is incorrect. If the algorithm performs no segmentation, it can report all positive fundamental frequencies (and the segmentation aspects of the evaluation ignored). If the time-stamp in the algorithm output is not on a 10ms time-grid, it will be resampled using 0th-order interpolation during evaluation. Therefore, we encourage the use of a 10ms frame hop-size. Each line of the output file should look like:
+The Audio Melody Extraction output file format is a tab-delimited ASCII text format. Fundamental frequencies (in Hz) of the main melody are reported on a 10ms grid. If an algorithm estimates that there is no melody present within a given time frame it is to report a NEGATIVE frequency estimate. This allows the algorithm to still output a pitch estimate even if its voiced/unvoiced detection mechanism is incorrect. Therefore, pitch accuracy and segmentation performance can be evaluated separately. Estimating ZERO frequency is also acceptable. However, Pitch Accuracy performance will go down if the voiced/unvoiced detection of the algorithm is incorrect. If the algorithm performs no segmentation, it can report all positive fundamental frequencies (and the segmentation aspects of the evaluation ignored). If the time-stamp in the algorithm output is not on a 10ms time-grid, it will be resampled using 0th-order interpolation during evaluation. Therefore, we encourage the use of a 10ms frame hop-size. Each line of the output file should look like:
   <timestamp (seconds)>\t<frequency (Hz)>\n

@@ Line 83: / Line 83: @@
   matlab -r "foobar(.1,'%input','%output');quit;"
-== Evaluation procedures ==
+== Evaluation Procedures ==
 The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structured the submission to allow these parts to be done independently, i.e. it was possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced.

@@ Line 90: / Line 90: @@
                       unvx    vx    sum
                    ---------------
-Ground  unvoiced  |  TN   |  FP  |  GU
+ Ground unvoiced  |  TN   |  FP  |  GU
   Truth   voiced   |  FN   |  TP  |  GV
                    ---------------
            sum        DU      DV     TO
 TP ("true positives", frames where the voicing was correctly detected) further breaks down into pitch correct and pitch incorrect, say TP = TPC + TPI
 Similarly, the ability to record pitch guesses even for frames judged unvoiced breaks down FN ("false negatives", frames which were actually pitched but detected as unpitched) into pitch correct and pitch incorrect, say FN = FNC + FNI
 In both these cases, we can also count the number of times the chroma was correct, i.e. ignoring octave errors, say TP = TPCch + TPIch and FN = FNCch + FNIch.
-To assess the voicing detection portion, we use the standard tools of detection theory. Statistic A, Voicing Detection is the probability that a frame which is truly voiced is labeled as voiced i.e. TP/GV (also known as "hit rate").
-Statistic B, Voicing False Alarm, is the probability that a frame which is not actually voiced is none the less labeled as voiced i.e. FP/GU.
+To assess the voicing detection portion, we use the standard tools of detection theory.
-Statistic C, Voicing d-prime, is a measure of the sensitivity of the detector that attempts to factor out the overall bias towards labeling any frame as voiced (which can move both hit rate and false alarm rate up and down in tandem). It converts the hit rate and false alarm into standard deviations away from the mean of an equivalent Gaussian distribution, and reports the difference between them. A larger value indicates a detection scheme with better discrimination between the two classes.
-For the voicing detection, we pooled the frames from all excerpts to get an overall frame-level voicing detection performance. Because some excerpts had no unvoiced frames, averaging over the excerpts gave some misleading results.
+*'''Voicing Detection''' is the probability that a frame which is truly voiced is labeled as voiced i.e. TP/GV (also known as "hit rate").
-Now we move on to the actual pitch detection. Statistic D, Raw Pitch Accuracy is the probability of a correct pitch value (to within ± ¼ tone) given that the frame is indeed pitched. This includes the pitch guesses for frames that were judged unvoiced i.e. (TPC + FNC)/GV.
+*'''Voicing False Alarm''' is the probability that a frame which is not actually voiced is none the less labeled as voiced i.e. FP/GU.
-Similarly, Statistic E, Raw Chroma Accuracy, is the probability that the chroma (i.e. the note name) is correct over the voiced frames. This ignores errors where the pitch is wrong by an exact multiple of an octave (octave errors). It is (TPCch + FNCch)/GV.
+*'''Voicing d-prime''' is a measure of the sensitivity of the detector that attempts to factor out the overall bias towards labeling any frame as voiced (which can move both hit rate and false alarm rate up and down in tandem). It converts the hit rate and false alarm into standard deviations away from the mean of an equivalent Gaussian distribution, and reports the difference between them. A larger value indicates a detection scheme with better discrimination between the two classes.
-Finally, Statistic F, Overall Accuracy, combines both the voicing detection and the pitch estimation to give the proportion of frames that were correctly labeled with both pitch and voicing, i.e. (TPC + TN)/TO.
-When averaging the pitch statistics, we calculated the performance for each of the excerpts individually, then report the average of these measures. This helps increase the effective weight of some of the minority genres, which had shorter excerpts.
+For the voicing detection, we pool the frames from all excerpts in a dataset to get an overall frame-level voicing detection performance. Because some excerpts had no unvoiced frames, averaging over the excerpts can give some misleading results.
 == Relevant Development Collections ==

@@ Line 85: / Line 85: @@
 == Evaluation procedures ==
-Descibe the measures etc
+The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structured the submission to allow these parts to be done independently, i.e. it was possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced.
+So consider a matrix of the per-frame voiced (Ground Truth or Detected values != 0) and unvoiced (GT, Det == 0) results, where the counts are:
+                      Detected
 == Relevant Development Collections ==

← Older revision		Revision as of 19:55, 20 May 2010
Line 94:		Line 94:
	* Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/		* Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/

−	* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI.	+	* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. http://ismir2004.ismir.net/melody_contest/results.html (full test set with the reference transcriptions (28.6 MB))
−	http://ismir2004.ismir.net/melody_contest/results.html (full test set with the reference transcriptions (28.6 MB))

← Older revision	Revision as of 19:53, 20 May 2010
Line 1:	Line 1:
	+	== Description ==
	+
	+	The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio. Pitch is expressed as the fundamental frequency of the main melodic voice, and is reported in a frame-based manner on an evenly-spaced time-grid.
	+
	+	The task consists of two parts:
	+	* Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not),
	+	* pitch detection (deciding the most likely melody pitch for each time frame).
	+
	+	We structure the submission to allow these parts to be done independently within a single output file. That is, it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!
	+
	+
	+	== Data ==
	+
	+	=== Collections ===
	+	* MIREX09 database : 374 Karaoke recordings of Chinese songs. Each recording is mixed at three different levels of Signal-to-Accompaniment Ratio {-5dB, 0dB, +5 dB} for a total of 1122 audio clips. Instruments: singing voice (male, female), synthetic accompaniment.
	+	* MIREX08 database : 4 excerpts of 1 min. from "north Indian classical vocal performances", instruments: singing voice (male, female), tanpura (Indian instrument, perpetual background drone), harmonium (secondary melodic instrument) and tablas (pitched percussions). There are two different mixtures of each of the 4 excerpts with differing amounts of accompaniment for a total of 8 audio clips.
	+	* MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano.
	+	* ADC04 database : Dataset from the 2004 Audio Description Contest. 20 excerpts of about 20s each.
	+	* manually annotated reference data (10 ms time grid)
	+
	+	=== Audio Formats ===
	+
	+	* CD-quality (PCM, 16-bit, 44100 Hz)
	+	* single channel (mono)
	+
	+	== Submission Format ==
	+
	+	Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.
	+
	+	=== Input Data ===
	+	Participating algorithms will have to read audio in the following format:
	+
	+	* Sample rate: 44.1 KHz
	+	* Sample size: 16 bit
	+	* Number of channels: 1 (mono)
	+	* Encoding: WAV
	+
	+	=== Output Data ===
	+
	+	The melody extraction algorithms will return the melody contour in an ASCII text file for each input .wav audio file. The specification of this output file is immediately below.
	+
	+	=== Output File Format (Audio Melody Extraction) ===
	+
	+	The Audio Melody Extraction output file format is a tab-delimited ASCII text format. Fundamental frequencies of the main melody are reported on a 10ms grid. If an algorithm estimates that there is no melody present within a given time frame it is to report a NEGATIVE frequency estimate. This allows the algorithm to still output a pitch estimate even if its voiced/unvoiced detection mechanism is incorrect. Therefore, pitch accuracy and segmentation performance can be evaluated separately. Estimating ZERO frequency is also acceptable. However, Pitch Accuracy performance will go down if the voiced/unvoiced detection of the algorithm is incorrect. If the algorithm performs no segmentation, it can report all positive fundamental frequencies (and the segmentation aspects of the evaluation ignored). If the time-stamp in the algorithm output is not on a 10ms time-grid, it will be resampled using 0th-order interpolation during evaluation. Therefore, we encourage the use of a 10ms frame hop-size. Each line of the output file should look like:
	+
	+	<timestamp (seconds)>\t<frequency (Hz)>\n
	+
	+	where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:
	+
	+	0.00 -439.3
	+	0.01 -439.4
	+	0.02 440.2
	+	0.03 440.3
	+	0.04 440.2
	+
	+	=== Algorithm Calling Format ===
	+
	+	The submitted algorithm must take as arguments a SINGLE .wav file to perform the melody extraction on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
	+
	+	foobar %input %output
	+	foobar -i %input -o %output
	+
	+	Moreover, if your submission takes additional parameters, foobar could be called like:
	+
	+	foobar .1 %input %output
	+	foobar -param1 .1 -i %input -o %output
	+
	+	If your submission is in MATLAB, it should be submitted as a function. Once again, the function must contain String inputs for the full path and names of the input and output files. Parameters could also be specified as input arguments of the function. For example:
	+
	+	foobar('%input','%output')
	+	foobar(.1,'%input','%output')
	+
	+	=== README File ===
	+
	+	A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
	+
	+	For instance, to test the program foobar with a specific value for parameter param1, the README file would look like:
	+
	+	foobar -param1 .1 -i %input -o %output
	+
	+	For a submission using MATLAB, the README file could look like:
	+
	+	matlab -r "foobar(.1,'%input','%output');quit;"
	+
	+	== Evaluation procedures ==
	+
	+	Descibe the measures etc
	+

@@ Line 1: / Line 1: @@
-The task consists of two parts:
+== Relevant Development Collections ==
 * [http://unvoicedsoundseparation.googlepages.com/mir-1k MIR-1K]: [http://mirlab.org/dataset/public/MIR-1K_for_MIREX.rar MIR-1K for MIREX](Note that this is not the one used for evaluation. The MIREX 2009 dataset used for evaluation last year was created in the same way but has different content and singers).
@@ Line 112: / Line 7: @@
 * For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))