2018:Music and/or Speech Detection
Contents
Description
The need for music and/or speech detection is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms. Lately, the industry related to broadcast monitoring for copyright management is becoming more and more interested in not only detecting the presence of music but also estimating if it appears in the foreground (as the main focus of attention) or in the background.
Indeed, music and/or speech detection has received considerable attention from the research community but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardized evaluation.
This MIREX task is aimed at filling that gap and consists of four sub-tasks: Music Detection, Speech Detection and Music and Speech Detection, with submissions welcomed to one or more of them.
Tasks
Music Detection
The music detection sub-task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
classes: music
Speech Detection
The speech detection sub-task consists in finding segments of speech in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
classes: speech
Music and Speech Detection
The music and speech detection sub-task is a combination of the previous two sub-tasks, i.e., the submitted algorithms will have to find segments of music and speech. No assumptions are made about the number of segments present in each archive or about their duration. Moreover, they might overlap in time.
classes: music, speech
Music Relative Loudness Estimation
The music relative loudness estimation sub-task consists in finding segments of one of the following two classes: foreground music and background music. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
classes: fg-music, bg-music
Datasets
Available Training Datasets
These resources may be a good starting point for participants.
GTZAN Speech and Music Dataset http://opihi.cs.uvic.ca/sound/music_speech.tar.gz
Scheirer & Slaney Music Speech Corpus http://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html
MUSAN Corpus http://www.openslr.org/17/
Muspeak Speech and Music Detection Dataset http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip
Music detection dataset: www.seyerlehner.info/download/music_detection_dataset_dafx_07.zip (Ask the author for the password)
Evaluation Dataset
Content
Evaluation dataset 1: it consists of 27 hours of audio from 8 different TV program types from France, Germany, Spain and the United Kingdom. It includes 1647 1-minute files sampled at 22,050 or 48,000 Hz with 16 bits per sample. Around 50% of the audio contains music, from which 70% is background music (35% of the total audio content). This background music can have a very low volume and be covered by any kind of non-musical sounds such as speech, audience noises, sound effects, everyday-life sounds, sounds of the city, etc. The annotation style includes a limit to how short an event can be of 2 seconds for music and background music and 1 second for speech.
Evaluation dataset 2: it consists of 10 hours of audio corresponding to French TV and radio programs, provided by INA (French National Institute of Audiovisual). This include archives collected from 1950 up to now. The files were sampled at 16,000 Hz with 16 bits per sample. The whole dataset will be used for evaluation. It is aimed at being used with pretrained models only.
Annotation
Evaluation dataset 1: it was manually annotated by a single annotator using BAT. A percentage of the annotations has been manually reviewed. The classes included in the ground truth are: foreground music, background music and no music. We defined foreground music as music that is louder than all the other existing simultaneous sounds.
Evaluation dataset 2: it was manually annotated using Transcriber. The annotation was done in the framework of the European Funded project MeMAD. It contains two possibly overlapping classes: speech and music.
Evaluation
In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: the segment-level evaluation and the event-level evaluation. We will report the statistics for each of these evaluations by file and for the whole dataset. We will do that for each algorithm and dataset.
Segment-level evaluation:
In the segment-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in segments of 10 ms. We first compute the intermediate statistics for each class C, which include:
- True Positives (TPc): ref segment’s class = C & est segment’s class = C
- False Positives (FPc): ref segment’s class != C & est segment’s class = C
- True Negatives (TNc): ref segment’s class != C & est segment’s class != C
- False Negatives (FNc): ref segment’s class = C & est segment’s class != C
Then we report class-wise Precision, Recall and F-measure.
- Precision (Pc) = TPc / (TPc + FPc)
- Recall (Rc) = TPc / (TPc + FNc)
- F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)
As well as the overall Accuracy:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where:
- TP = sum(TPc), for every class c
- FP = sum(FPc), for every class c
- TN = sum(TNc), for every class c
- FN = sum(FNc), for every class c
Event-level evaluation:
In the event-level evaluation, we compare the estimation (est) produced by the algorithms with the reference (ref) in terms of events. Each annotated segment of the ground truth is considered and event. We first compute the intermediate statistics for the onsets and offsets of each class C, which include:
- True Positives (TPc): an est event of class = C that starts and ends at the same temporal positions as a ref event of class = C, taking into account a tolerance time-window.
- False Positives (FPc): an est event of class = C that starts and ends at temporal positions where no ref event of class = C does, taking into account a tolerance time-window.
- False Negatives (FNc): a ref event of class = C that starts and ends at temporal positions where no est event of class = C does, taking into account a tolerance time-window.
Then we report class-wise Precision, Recall, F-measure, Deletion Rate, Insertion Rate and Error Rate.
- Precision (Pc) = TPc / (TPc + FPc)
- Recall (Rc) = TPc / (TPc + FNc)
- F-measure (Fc) = 2 * Pc * Rc / (Pc + Rc)
- Deletion Rate (Dc) = FNc / Nc
- Insertion Rate (Ic) = FPc / Nc
- Error Rate (Ec) = Dc + Ic
Where:
- Nc is the number of ref events of class = C.
We also report the overall version of these statistics:
- Precision (P) = TP / (TP + FP)
- Recall (R) = TP / (TP + FN)
- F-measure (F) = 2 * P * R / (P + R)
- Deletion Rate (D) = FN / N
- Insertion Rate (I) = FP / N
- Error Rate (E) = D + I
Where:
- TP = sum(TPc), for every class c
- FP = sum(FPc), for every class c
- TN = sum(TNc), for every class c
- FN = sum(FNc), for every class c
- N is the number of ref events.
Different tolerance time-windows will be used: +/- 1000 ms, +/- 500 ms, +/- 200 ms, +/- 100 ms.
Other evaluated features
The execution time of each algorithm will also be reported.
Submission Format
Command line calling format
Submissions have to conform to the specified format below:
Music Detection: doMusicDetection path/to/file.wav path/to/output/file.mud
Speech Detection: doSpeechDetection path/to/file.wav path/to/output/file.spd
Music and Speech Detection: doMusicAndSpeechDetection path/to/file.wav path/to/output/file.muspd
Music Relative Loudness Estimation: doMusicRelLoudEstimation path/to/file.wav path/to/output/file.mrle
where:
- path/to/file.wav: Path to the input audio file.
- path/to/output/file.*: The output file.
Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.
I/O format
For each detected segment, the file should include a row containing the onset (seconds), offset (seconds) and the class separated by a tab. Rows should be ordered by onset time:
onset1 offset1 class1 onset2 offset2 class2 ... ... ...
(note that events in the case of music and speech detection can overlap)
Packaging submissions
All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed) and include a README file including the following the information:
- Command line calling format for all executables and an example formatted set of commands
- Number of threads/cores used or whether this should be specified on the command line
- Expected memory footprint
- Expected runtime
- Any required environments (and versions), e.g. python, java, bash, matlab.
Potential Participants
name/email
Jan Schlüter, jan.schlueter ... ofai.at David Doukhan, ddoukhan … ina.fr Blai Meléndez-Catalán, bmelendez … bmat.com
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified. A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.
Submission closing date
The submission deadline for these tasks is the 11th of August.
Task specific mailing list
All discussions on this task will take place on the MIREX "EvalFest" list. If you have a question or comment, simply include the task name in the subject heading.