Difference between revisions of "2019:Music Detection"

From MIREX Wiki
(Created page with "coming soon")
 
(Datasets)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
coming soon
+
==Description==
 +
 
 +
Music detection refers to the task of finding music segments in an audio file. The two main applications of music detection algorithms are (1) the automatic indexing and retrieving of auditory information based on its audio content, and (2) the monitoring of music for copyright management. Additionally, the detection of music can be applied as an intermediate step to improve the performance of algorithms designed for other purposes.
 +
 
 +
Regarding the application of music detection algorithms to the copyright management, the industry is lately becoming more and more interested in not only detecting the presence of music but also estimating if it appears in the foreground (as the main focus of attention) or in the background. In this scenario, the music detection task falls short as we need to estimate the loudness of music in relation to other simultaneous non-music sounds, i.e., its relative loudness. This is why we propose a second task that we name Music Relative Loudness Estimation. We define this second task as the task of finding music segments in and audio file and classifying them into foreground or background music.
 +
 
 +
 
 +
==Tasks==
 +
 
 +
===Music Detection===
 +
 
 +
The music detection sub-task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 +
 
 +
classes: music (and non-music)
 +
 
 +
===Music Relative Loudness Estimation===
 +
 
 +
The music relative loudness estimation sub-task consists in finding segments of one of the following two classes: foreground music and background music. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 +
 
 +
classes: fg-music, bg-music (and non-music)
 +
 
 +
==Datasets==
 +
 
 +
===Available Training Datasets===
 +
 
 +
These resources may be a good starting point for participants.
 +
 
 +
GTZAN Speech and Music Dataset
 +
http://opihi.cs.uvic.ca/sound/music_speech.tar.gz
 +
 
 +
Scheirer & Slaney Music Speech Corpus
 +
http://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html
 +
 
 +
MUSAN Corpus
 +
http://www.openslr.org/17/
 +
 
 +
Muspeak Speech and Music Detection Dataset
 +
http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip
 +
 
 +
Music detection dataset:
 +
www.seyerlehner.info/download/music_detection_dataset_dafx_07.zip
 +
(Ask the author for the password)
 +
 
 +
Open Broadcast Media Audio from TV:
 +
https://zenodo.org/record/3381249
 +
 
 +
 
 +
===Evaluation Dataset===
 +
 
 +
====Content====
 +
 
 +
The evaluation dataset consists of 2987 1-minute, stereo excerpts at 22050 Hz extracted from programs from France (753), Germany (760), Spain (723) and the United States (751).
 +
 
 +
====Annotation====
 +
 
 +
The evaluation dataset has been cross-annotated by 3 annotators using a 6-class taxonomy: ''Music'', ''Foreground Music'', ''Similar'', ''Background Music'', ''Low Background Music'', and ''No Music'' as done in the OpenBMAT dataset, which can be used for training.
 +
 
 +
==Evaluation==
 +
 
 +
In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: the segment-level evaluation and the event-level evaluation. We will report the statistics for each of these evaluations by file and for the whole dataset. We will do that for each algorithm and dataset.
 +
 
 +
===Segment-level evaluation:===
 +
 
 +
In the segment-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in segments of 10 ms. We first compute the intermediate statistics for each class C, which include:
 +
* True Positives (TPc): ref segment’s class = C & est segment’s class = C
 +
* False Positives (FPc): ref segment’s class != C & est segment’s class = C
 +
* True Negatives (TNc): ref segment’s class != C & est segment’s class != C
 +
* False Negatives (FNc): ref segment’s class = C & est segment’s class != C
 +
 
 +
Then we report class-wise Precision, Recall and F-measure.
 +
* Precision (Pc) = TPc / (TPc + FPc)
 +
* Recall (Rc) = TPc / (TPc + FNc)
 +
* F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)
 +
 
 +
As well as the overall Accuracy:
 +
* Accuracy = (TP + TN) / (TP + TN + FP + FN)
 +
 
 +
Where:
 +
* TP = sum(TPc), for every class c
 +
* FP = sum(FPc), for every class c
 +
* TN = sum(TNc), for every class c
 +
* FN = sum(FNc), for every class c
 +
 
 +
===Event-level evaluation:===
 +
 
 +
In the event-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in terms of events. Each annotated segment of the ground truth is considered and event. We first compute the intermediate statistics for the onsets and offsets of each class C, which include:
 +
* True Positives (TPc): an est event of class = C that starts and ends at the same temporal positions as a ref event of class = C, taking into account a tolerance time-window.
 +
* False Positives (FPc): an est event of class = C that starts and ends at temporal positions where no ref event of class = C does, taking into account a tolerance time-window.
 +
* False Negatives (FNc): a ref event of class = C that starts and ends at temporal positions where no est event of class = C does, taking into account a tolerance time-window.
 +
 
 +
Then we report class-wise Precision, Recall, F-measure, Deletion Rate, Insertion Rate and Error Rate.
 +
* Precision (Pc) = TPc / (TPc + FPc)
 +
* Recall (Rc) = TPc / (TPc + FNc)
 +
* F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)
 +
* Deletion Rate (Dc) = FNc / Nc
 +
* Insertion Rate (Ic) = FPc / Nc
 +
* Error Rate (Ec) = Dc + Ic
 +
 
 +
Where:
 +
* Nc is the number of ref events of class = C.
 +
 
 +
We also report the overall version of these statistics:
 +
* Precision (P) = TP / (TP + FP)
 +
* Recall (R) = TP / (TP + FN)
 +
* F-measure (F) = 2 * P * R / (P + R)
 +
* Deletion Rate (D) = FN / N
 +
* Insertion Rate (I) = FP / N
 +
* Error Rate (E) = D + I
 +
 
 +
Where:
 +
* TP = sum(TPc), for every class c
 +
* FP = sum(FPc), for every class c
 +
* TN = sum(TNc), for every class c
 +
* FN = sum(FNc), for every class c
 +
* N is the number of ref events.
 +
 
 +
Different tolerance time-windows will be used: +/- 1000 ms, +/- 500 ms, +/- 200 ms, +/- 100 ms.
 +
 
 +
===Other evaluated features===
 +
 
 +
The execution time of each algorithm will also be reported.
 +
 
 +
==Submission Format==
 +
 
 +
===Command line calling format===
 +
 
 +
Submissions have to conform to the specified format below:
 +
 
 +
Music Detection: ''doMusicDetection path/to/file.wav  path/to/output/file.mud ''
 +
 
 +
Music Relative Loudness Estimation: ''doMusicRelLoudEstimation path/to/file.wav  path/to/output/file.mrle ''
 +
 
 +
where:
 +
* path/to/file.wav: Path to the input audio file.
 +
* path/to/output/file.*: Path to the output file.
 +
 
 +
Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.
 +
 
 +
===I/O format===
 +
 
 +
For each detected segment, the file should include a row containing the onset (seconds), offset (seconds) and the class separated by a tab. Rows should be ordered by onset time:
 +
 
 +
''onset1    offset1    class1''
 +
''onset2    offset2    class2''
 +
''...  ... ...''
 +
 
 +
(note that events in the case of music and speech detection can overlap)
 +
 
 +
===Packaging submissions===
 +
 
 +
All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed) and include a README file including the following the information:
 +
* Command line calling format for all executables and an example formatted set of commands
 +
* Number of threads/cores used or whether this should be specified on the command line
 +
* Expected memory footprint
 +
* Expected runtime
 +
* Any required environments (and versions), e.g. python, java, bash, matlab.
 +
 
 +
== Potential Participants ==
 +
name/email
 +
 
 +
Blai Meléndez-Catalán, bmelendez … bmat.com
 +
----
 +
 
 +
==Time and hardware limits==
 +
 
 +
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified.
 +
A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.
 +
 
 +
==Submission closing date==
 +
 
 +
September 30th 2019
 +
 
 +
==Task specific mailing list==
 +
 
 +
All discussions on this task will take place on the MIREX  [https://mail.lis.illinois.edu/mailman/listinfo/evalfest "EvalFest" list]. If you have a question or comment, simply include the task name in the subject heading.

Latest revision as of 02:10, 30 August 2019

Description

Music detection refers to the task of finding music segments in an audio file. The two main applications of music detection algorithms are (1) the automatic indexing and retrieving of auditory information based on its audio content, and (2) the monitoring of music for copyright management. Additionally, the detection of music can be applied as an intermediate step to improve the performance of algorithms designed for other purposes.

Regarding the application of music detection algorithms to the copyright management, the industry is lately becoming more and more interested in not only detecting the presence of music but also estimating if it appears in the foreground (as the main focus of attention) or in the background. In this scenario, the music detection task falls short as we need to estimate the loudness of music in relation to other simultaneous non-music sounds, i.e., its relative loudness. This is why we propose a second task that we name Music Relative Loudness Estimation. We define this second task as the task of finding music segments in and audio file and classifying them into foreground or background music.


Tasks

Music Detection

The music detection sub-task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.

classes: music (and non-music)

Music Relative Loudness Estimation

The music relative loudness estimation sub-task consists in finding segments of one of the following two classes: foreground music and background music. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.

classes: fg-music, bg-music (and non-music)

Datasets

Available Training Datasets

These resources may be a good starting point for participants.

GTZAN Speech and Music Dataset http://opihi.cs.uvic.ca/sound/music_speech.tar.gz

Scheirer & Slaney Music Speech Corpus http://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html

MUSAN Corpus http://www.openslr.org/17/

Muspeak Speech and Music Detection Dataset http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip

Music detection dataset: www.seyerlehner.info/download/music_detection_dataset_dafx_07.zip (Ask the author for the password)

Open Broadcast Media Audio from TV: https://zenodo.org/record/3381249


Evaluation Dataset

Content

The evaluation dataset consists of 2987 1-minute, stereo excerpts at 22050 Hz extracted from programs from France (753), Germany (760), Spain (723) and the United States (751).

Annotation

The evaluation dataset has been cross-annotated by 3 annotators using a 6-class taxonomy: Music, Foreground Music, Similar, Background Music, Low Background Music, and No Music as done in the OpenBMAT dataset, which can be used for training.

Evaluation

In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: the segment-level evaluation and the event-level evaluation. We will report the statistics for each of these evaluations by file and for the whole dataset. We will do that for each algorithm and dataset.

Segment-level evaluation:

In the segment-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in segments of 10 ms. We first compute the intermediate statistics for each class C, which include:

  • True Positives (TPc): ref segment’s class = C & est segment’s class = C
  • False Positives (FPc): ref segment’s class != C & est segment’s class = C
  • True Negatives (TNc): ref segment’s class != C & est segment’s class != C
  • False Negatives (FNc): ref segment’s class = C & est segment’s class != C

Then we report class-wise Precision, Recall and F-measure.

  • Precision (Pc) = TPc / (TPc + FPc)
  • Recall (Rc) = TPc / (TPc + FNc)
  • F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)

As well as the overall Accuracy:

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:

  • TP = sum(TPc), for every class c
  • FP = sum(FPc), for every class c
  • TN = sum(TNc), for every class c
  • FN = sum(FNc), for every class c

Event-level evaluation:

In the event-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in terms of events. Each annotated segment of the ground truth is considered and event. We first compute the intermediate statistics for the onsets and offsets of each class C, which include:

  • True Positives (TPc): an est event of class = C that starts and ends at the same temporal positions as a ref event of class = C, taking into account a tolerance time-window.
  • False Positives (FPc): an est event of class = C that starts and ends at temporal positions where no ref event of class = C does, taking into account a tolerance time-window.
  • False Negatives (FNc): a ref event of class = C that starts and ends at temporal positions where no est event of class = C does, taking into account a tolerance time-window.

Then we report class-wise Precision, Recall, F-measure, Deletion Rate, Insertion Rate and Error Rate.

  • Precision (Pc) = TPc / (TPc + FPc)
  • Recall (Rc) = TPc / (TPc + FNc)
  • F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)
  • Deletion Rate (Dc) = FNc / Nc
  • Insertion Rate (Ic) = FPc / Nc
  • Error Rate (Ec) = Dc + Ic

Where:

  • Nc is the number of ref events of class = C.

We also report the overall version of these statistics:

  • Precision (P) = TP / (TP + FP)
  • Recall (R) = TP / (TP + FN)
  • F-measure (F) = 2 * P * R / (P + R)
  • Deletion Rate (D) = FN / N
  • Insertion Rate (I) = FP / N
  • Error Rate (E) = D + I

Where:

  • TP = sum(TPc), for every class c
  • FP = sum(FPc), for every class c
  • TN = sum(TNc), for every class c
  • FN = sum(FNc), for every class c
  • N is the number of ref events.

Different tolerance time-windows will be used: +/- 1000 ms, +/- 500 ms, +/- 200 ms, +/- 100 ms.

Other evaluated features

The execution time of each algorithm will also be reported.

Submission Format

Command line calling format

Submissions have to conform to the specified format below:

Music Detection: doMusicDetection path/to/file.wav path/to/output/file.mud

Music Relative Loudness Estimation: doMusicRelLoudEstimation path/to/file.wav path/to/output/file.mrle

where:

  • path/to/file.wav: Path to the input audio file.
  • path/to/output/file.*: Path to the output file.

Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.

I/O format

For each detected segment, the file should include a row containing the onset (seconds), offset (seconds) and the class separated by a tab. Rows should be ordered by onset time:

onset1    	offset1    	class1
onset2    	offset2    	class2
...  ... 	...

(note that events in the case of music and speech detection can overlap)

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed) and include a README file including the following the information:

  • Command line calling format for all executables and an example formatted set of commands
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions), e.g. python, java, bash, matlab.

Potential Participants

name/email

Blai Meléndez-Catalán, bmelendez … bmat.com


Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified. A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.

Submission closing date

September 30th 2019

Task specific mailing list

All discussions on this task will take place on the MIREX "EvalFest" list. If you have a question or comment, simply include the task name in the subject heading.