Latest revision as of 14:35, 12 March 2022

Description

This pages describes the MIREX2021: Automatic Lyrics Transcription challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

 Prediction(w) = argmax P(w|X)

where w and X are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

 Prediction(w)  = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

Evaluation

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

 WER = (S + I + D) / (C + S + D)

where;

C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

Submission Format

Submissions must be done through the MIREX system (info available here) and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

A) The main transcription script

The main transcription script to execute. This should be a one-line executable in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

I / O

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path}  ${output}

OR with flags:

foobar -i ${input_audio_path}  -o ${output}

Input Audio

Participating algorithms will have to receive the following input format:

Audio format : WAV / MP3
CD-quality (PCM, 16-bit, 44100 Hz)
single channel (mono) for a cappella (Hansen) and two channels for original

Output File Format

A text file (per song) containing list of words separated by white space:

 <word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

 ${output}/${input_song_id}.txt

B) The README file

This file must contain detailed installation instructions, the use of the main script and contact information.

Any submission that is failed to meet above requirements will not be considered in evaluation!

Training Datasets

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but not obliged to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

DAMP dataset

The DAMP - Sing!300x30x2 dataset consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation. list of recordings. For more details see the paper.

The audio can be downloaded from the Smule web site
Lyrics boundary annotations can be generated from raw annotations using this repository. Paper here (1).
Or annotations can be directly retrieved in the Kaldi form here Paper here (2).

DALI Dataset

DALI (a large Dataset of synchronised Audio, LyrIcs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

For more details how, see its full description here. Paper here.

Evaluation Datasets

The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

*** IMPORTANT *** In case using DALI for training, you MUST exclude the songs used for MIREX evaluation during training your model in order to make a scientific evaluation possible.

Hansen's Dataset

The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here.

You can read in detail about how the dataset was made here: (7). The recordings have been provided by Jens Kofod Hansen for public evaluation.

file duration up to 4:40 minutes (total time: 35:33 minutes)
3590 words annotated in total

Mauch's Dataset

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word. The audio has instrumental accompaniment. An example song can be seen here.

You can read in detail about how the dataset was used for the first time here: (8) . The dataset has been provided by Sungkyun Chang.

file duration up to 5:40 minutes (total time: 1h 19m)
5050 words annotated in total

Jamendo Dataset

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on Github, although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to this paper (9).

file duration up to 4:43 (total time: 1h 12m)
5677 words annotated in total

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. In addition, submission that are not able to run with the provided RAM and CPU instructions provided by you may not receive a result.

Submission closing dates

Closing date: December 9, 2021

Audio-to-Lyrics Alignment

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

Questions?

send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

Potential Participants

Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Bibliography

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

@@ Line 37: / Line 37: @@
 = Submission Format =
-Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:
+Submissions must be done through the MIREX system (info available [https://www.music-ir.org/mirex/wiki/2021:Main_Page#MIREX_2021_Submission_Instructions here]) and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:
 === A) The main transcription script ===
@@ Line 82: / Line 82: @@
 Any submission that is failed to meet above requirements will not be considered in evaluation!
 = Training Datasets =
@@ Line 104: / Line 103: @@
 === DALI Dataset ===
-DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (,) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.
+DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.
 The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).
@@ Line 125: / Line 124: @@
 The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].
-You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.
+You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.
 * file duration up to 4:40 minutes (total time: 35:33 minutes)
@@ Line 135: / Line 134: @@
 The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].
-You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.
+You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.
 * file duration up to 5:40 minutes (total time: 1h 19m)
@@ Line 144: / Line 143: @@
 This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.
-It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].
+It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].
 * file duration up to 4:43 (total time: 1h 12m)
 * 5677 words annotated in total
+= Time and hardware limits =
+Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
+A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. In addition, submission that are not able to run with the provided RAM and CPU instructions provided by you may not receive a result.
 = Submission closing dates =
@@ Line 175: / Line 178: @@
 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378
-- Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In 2020 International Joint Conference on Neural Networks (IJCNN), 1-8. IEEE.
+- Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.
+- Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.
+- Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.
+- Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.
-- Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm.
+- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.
-Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.
-- Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.
+- Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.
-- Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
+- Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

Difference between revisions of "2021:Automatic Lyrics Transcription"

Latest revision as of 14:35, 12 March 2022

Contents