2018:Music and/or Speech Detection
Contents
Description
The need for music and/or speech detection is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms.
Indeed, music and/or speech detection has received considerable attention from the research community (for a partial list, see references below) but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardised evaluation.
This Mirex task is aimed at filling that gap and consists of three sub-tasks: Music Detection, Speech Detection and Music and Speech Detection, with submissions welcomed to one or more of them.
Tasks
Music Detection
The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
Speech Detection
The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
Music and Speech Detection
The music and speech detection sub-task is a combination of the previous two sub-tasks, i.e., the submitted algorithms will have to find segments of music and speech. No assumptions are made about the number of segments present in each archive or about their duration. Moreover, they might overlap in time.
Datasets
Content
Dataset 1: it consists of 27 hours of audio from 8 different TV program types from France, Germany, Spain and the United Kingdom. It includes 1647 1-minuts files at 22050 Hz and 16 bits per sample.
Annotation
Dataset 1: it was annotated by a freelancer using BAT. A percentage of the annotations has been reviewed manually.
Evaluation
In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: in the frame-level approach the comparison is made in short time segments, while in the event-level approach every segment is understood as an event.
Frame level evaluation:
Frame-based evaluation will be carried out on 10ms segments. Precision (the portion of correct retrieved segments for all segments retrieved for each frame), Recall (the ratio of correct segments to all ground truth segments for each frame), and F-measure will be reported.
Event level evaluation:
Events will be evaluated on an onset-only basis as well as an onset-offset basis, again using the Precision, Recall, and F-Measure. In the former, a ground truth segment is assumed to be correctly detected if (1) the system identifies the right class and (2) the detected segment`s onset is within a 1000ms range(+/- 500ms) of the onset of the ground truth segment. In the later (onset-offset) a ground truth segment is assumed to be correctly detected if (1) the system identifies the right class, (2) the detected segment`s onset is within +/- 500ms of the onset of the ground truth segment and (3) the detected segment's offset is either within +/- 500ms of the offset of the ground truth segment or within 20% of the ground truth segment's length. Results will also be included using smaller onset/offset tolerance (+/-100, 200ms). Different statistics can also be reported if agreed by the participants.
Submission Format
Command line calling format
Submissions have to conform to the specified format below:
Music Detection: doMusicDetection path/to/file.wav path/to/output/file.mud
Speech Detection: doSpeechDetection path/to/file.wav path/to/output/file.spd
Music and Speech Detection: doMusicAndSpeechDetection path/to/file.wav path/to/output/file.muspd
where:
- path/to/file.wav: Path to the input audio file.
- path/to/output/file.*: The output file.
Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.
I/O format
For each detected segment, the file should include a row containing the onset (seconds), duration (seconds) and the class (represented by lower-case 'm' or 's') separated by a tab and ordered by onset time:
onset1 duration class onset2 duration class ... ... ...
(note that events in the case of music and speech detection can overlap)
Packaging submissions
All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed) and include a README file including the following the information:
- Command line calling format for all executables and an example formatted set of commands
- Number of threads/cores used or whether this should be specified on the command line
- Expected memory footprint
- Expected runtime
- Any required environments (and versions), e.g. python, java, bash, matlab.
Potential Participants
name/email
Jan Schlüter, jan.schlueter ... ofai.at David Doukhan, david.doukhan … gmail.com Blai Meléndez-Catalán, bmelendez … bmat.com
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified. A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.
Submission closing date
The submission deadline for this task is the 11th of August.
Task specific mailing list
All discussions on this task will take place on the MIREX "EvalFest" list. If you have a question or comment, simply include the task name in the subject heading.