MIREX Wiki - User contributions [en]

2021: Automatic Lyrics Transcription Results

2022-03-20T18:14:34Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}
NOTE: Other submissions have not been able to reach results, because their algorithms exceeded the time and hardware limits described in the task rules.

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

YYHL1
<csv>2021/lt/hansen_solo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_solo_yhll3.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/hansen_poli_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_poli_yhll3.csv</csv>

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/mauch_yhll1.csv</csv>

YYHL3
<csv>2021/lt/mauch_yhll3.csv</csv>

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/jamendo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/jamendo_yhll3.csv</csv>

2021: Automatic Lyrics Transcription Results

2022-03-20T18:11:23Z

Georgi Dzhambazov: /* Per-track results */

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}
NOTE: Other submissions have not been able to reach results, because their algorithms exceeded the time and hardware limits described in the task rules.

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

YYHL1
<csv>results/2021/lt/hansen_solo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_solo_yhll3.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/hansen_poli_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_poli_yhll3.csv</csv>

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/mauch_yhll1.csv</csv>

YYHL3
<csv>2021/lt/mauch_yhll3.csv</csv>

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/jamendo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/jamendo_yhll3.csv</csv>

2021:Automatic Lyrics Transcription

2022-03-12T19:35:31Z

Georgi Dzhambazov:

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions must be done through the MIREX system (info available [https://www.music-ir.org/mirex/wiki/2021:Main_Page#MIREX_2021_Submission_Instructions here]) and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Time and hardware limits =
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. In addition, submission that are not able to run with the provided RAM and CPU instructions provided by you may not receive a result.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

2021:Automatic Lyrics Transcription

2022-03-12T19:35:03Z

Georgi Dzhambazov:

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions must be done through the MIREX system (info available [https://www.music-ir.org/mirex/wiki/2021:Main_Page#MIREX_2021_Submission_Instructions here]) and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

== Time and hardware limits ==
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. In addition, submission that are not able to run with the provided RAM and CPU instructions provided by you may not receive a result.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

2021: Automatic Lyrics Transcription Results

2022-03-12T19:33:43Z

Georgi Dzhambazov:

2020:Lyrics Transcription

2022-03-12T19:32:03Z

Georgi Dzhambazov:

==Description==

This year for the first time we host two tasks simultaneously:
Lyrics Transcription and Lyrics-to-audio alignment. You are free to participate in one of the tasks or both of them.
The task of Lyrics Transcription aims to identify the words from sung music audio, in the sam way as in automatic speech recognition.
The task of Automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.

----------------------- ---------------------------------------------------
| Mixed singing audio | | Lyrics at word-level: no more carefree ... ... |
----------------------- ---------------------------------------------------
| |
--------------------------------------------
|
--------------------
| Alignment system |
--------------------
|
|
--------------------------
| 0.123 0.798 no |
| 0.798 1.123 more |
| 1.345 2.176 carefree|
| ... ... |
--------------------------
The algorithm receives mixed singing audio (singing voice + musical accompaniment) and for the case of alignment its corresponding lyrics at word-level. It outputs the recognized words in the case of transcription and the onset and offset timestamps (second) of each word in the case of alignment.

==Datasets==

===Training Datasets===

==== DAMP dataset ====
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP Multilingual Vocal Performances (MVP) dataset] contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the [https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* No lyrics boundary annotations are available, still the textual lyrics are on the [https://www.smule.com/songs Smule Sing! Karaoke website]

==== DALI Dataset ====

The DALI dataset (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

===Evaluation Datasets===

The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.

==== Hansen's Dataset ====
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided.
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here]

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

==== Mauch's Dataset ====
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

==== Jamendo Dataset ====
This dataset contains 20 full-duration music pieces with 10 different Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

=== Phonetization ===
A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here].

=== Audio Format ===

The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)

* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella and two channels for original

==Evaluation==
===Transcription===
Word Error Rate (WER) - the standard metric use in Automatic Speech Recognition.

===Alignment===
The submitted algorithms will be evaluated at the word boundaries for the originally mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics on the a cappella singing can be reported as well on request, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

'''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L117 test] of using this metric.

'''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011), Figure 9].
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L76 test] of using this metric.

'''Percentage of correct estimates according to a tolerance window''' A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment].
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L151 test] of using this metric.

For more detailed definition and formulas about the metrics, please check the section 2.2.1 of [https://doi.org/10.5281/zenodo.841979 this thesis].

'''To obtain all three metrics for one detected output:'''

<code> python [https://github.com/georgid/AlignmentEvaluation/blob/master/align_eval/eval.py eval.py] <file path of the reference word boundaries> <file path of the detected word boundaries> </code>

Note that evaluation scripts depend on [https://github.com/craffel/mir_eval/ mir_eval].

== Submission Format ==

Submissions should be packaged and contain at least two files: The algorithm itself (as a binary or source code) and a README containing contact information and detailing, in full, the use of the algorithm.

=== Input Data ===
Participating algorithms will have to receive the following input format:

====Transcription====
* Audio in wav, 44.1kHz, stereo.

====Alignment====

* Audio in wav, 44.1kHz, stereo.
* Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).

=== Output File Format ===
====Transcription====
A list of words separated by white space
<word1> <word2> ...
Any non-word items (e.g. silence, end of the sentence) should be excluded.

====Alignment====

The alignment output file format is a tab-delimited ASCII text format.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
...

where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:

0.000 5.223 word1
5.223 15.101 word2
15.101 20.334 word3

'''NOTE:''' the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.

=== Command line calling format ===

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input_audio (%input_txt) %output
foobar -i %input_audio (-it %input_txt) -o %output

=== README File ===

A README file accompanying each submission should contain clear instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

== Time and hardware limits ==
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. In addition, submission that are not able to run with the provided RAM and CPU instructions provided by you may not receive a result.

== Submission closing dates ==
Closing date: First week of September 2020

== Question? ==

* send us an email - d.stoller@qmul.ac.uk (Daniel Stoller ) or info@voicemagix.com (Georgi Dzhambazov) or chitralekha@nus.edu.sg (Chitralekha Gupta).

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

== Bibliography ==

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019

Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650

Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.

Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017

Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016

Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference

Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing

Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.

2019:Automatic Lyrics-to-Audio Alignment

2022-03-12T19:29:01Z

Georgi Dzhambazov:

==Description==

The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.

----------------------- ---------------------------------------------------
| Mixed singing audio | | Lyrics at word-level: no more carefree ... ... |
----------------------- ---------------------------------------------------
| |
--------------------------------------------
|
--------------------
| Alignment system |
--------------------
|
|
--------------------------
| 0.123 0.798 no |
| 0.798 1.123 more |
| 1.345 2.176 carefree|
| ... ... |
--------------------------
The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.

==Datasets==

===Training Datasets===

==== DAMP dataset ====
The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the [https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings].

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* No lyrics boundary annotations are available, still the textual lyrics are on the [https://www.smule.com/songs Smule Sing! Karaoke website]

==== DALI Dataset ====

The DALI dataset (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

===Evaluation Datasets===

The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.

==== Hansen's Dataset ====
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided.
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here]

You can read in detail about how the dataset was made here: [http://smcnetwork.org/system/files/smc2012-198.pdf Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

==== Mauch's Dataset ====
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

==== Gracenote Dataset ====
The dataset contains 8 pop music song excerpts with instrumental accompaniment, with annotations of beginning-timestamps of each word. The dataset has been used in the recent [https://ieeexplore.ieee.org/abstract/document/7952235/references paper].

* file duration up to 1:11 (total time: 11m)
* 1181 words annotated in total

==== Jamendo Dataset ====
This dataset contains 20 full-duration music pieces with 10 different Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

=== Phonetization ===
A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here].

=== Audio Format ===

The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)

* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella and two channels for original

==Evaluation==

The submitted algorithms will be evaluated at the word boundaries for the originally mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics on the a cappella singing can be reported as well on request, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

'''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L117 test] of using this metric.

'''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011), Figure 9].
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L76 test] of using this metric.

'''Percentage of correct estimates according to a tolerance window''' A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment].
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L151 test] of using this metric.

For more detailed definition and formulas about the metrics, please check the section 2.2.1 of [https://doi.org/10.5281/zenodo.841979 this thesis].

'''To obtain all three metrics for one detected output:'''

<code> python [https://github.com/georgid/AlignmentEvaluation/blob/master/align_eval/eval.py eval.py] <file path of the reference word boundaries> <file path of the detected word boundaries> </code>

Note that evaluation scripts depend on [https://github.com/craffel/mir_eval/ mir_eval].

== Submission Format ==

Submissions should be packaged and contain at least two files: The algorithm itself (as a binary or source code) and a README containing contact information and detailing, in full, the use of the algorithm.

=== Input Data ===
Participating algorithms will have to receive the following input format:

* Audio in wav, 44.1kHz, stereo.
* Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).

=== Output File Format ===

The alignment output file format is a tab-delimited ASCII text format.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
...

where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:

0.000 5.223 word1
5.223 15.101 word2
15.101 20.334 word3

'''NOTE:''' the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.

=== Command line calling format ===

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input_audio %input_txt %output
foobar -i %input_audio -it %input_txt -o %output

=== README File ===

A README file accompanying each submission should contain clear instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

== Time and hardware limits ==
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.

== Submission closing dates ==
Closing date: 30 September 2019

== Question? ==

* send us an email - d.stoller@qmul.ac.uk (Daniel Stoller ) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

== Bibliography ==

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019

Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650

Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.

Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017

Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016

Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference

Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing

Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.

2021: Automatic Lyrics Transcription Results

2022-03-12T19:06:24Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://drive.google.com/file/d/1yxwIZcQTvyb_IgN37lxQuy4TFKAJHwrZ/view?usp=sharing PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

YYHL1
<csv>2021/lt/hansen_solo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_solo_yhll3.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/hansen_poli_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_poli_yhll3.csv</csv>

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/mauch_yhll1.csv</csv>

YYHL3
<csv>2021/lt/mauch_yhll3.csv</csv>

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/jamendo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/jamendo_yhll3.csv</csv>

2021: Automatic Lyrics Transcription Results

2022-03-12T18:59:36Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/report.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

YYHL1
<csv>2021/lt/hansen_solo_yhll1.csv</csv>

YYHL3
<csv>2021/lt/hansen_solo_yhll3.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/hansen_poli_yhll1.csv</csv>
YYHL3
<csv>2021/lt/hansen_poli_yhll3.csv</csv>

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/mauch_yhll1.csv</csv>
YYHL3
<csv>2021/lt/mauch_yhll3.csv</csv>

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====
YYHL1
<csv>2021/lt/jamendo_yhll1.csv</csv>
YYHL3
<csv>2021/lt/jamendo_yhll3.csv</csv>

2021: Automatic Lyrics Transcription Results

2022-03-12T18:58:51Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/report.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

YYHL1
<csv>2021/lt/hansen_solo_yhll1.csv</csv>
YYHL3
<csv>2021/lt/hansen_solo_yhll3.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/hansen_poli_yhll1.csv</csv>
YYHL3
<csv>2021/lt/hansen_poli_yhll3.csv</csv>

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

YYHL1
<csv>2021/lt/mauch_yhll1.csv</csv>
YYHL3
<csv>2021/lt/mauch_yhll3.csv</csv>

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====
YYHL1
<csv>2021/lt/jamendo_yhll1.csv</csv>
YYHL3
<csv>2021/lt/jamendo_yhll3.csv</csv>

2021: Automatic Lyrics Transcription Results

2022-03-09T19:33:34Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/report.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

GGL1
<csv>lt/hansen_solo_yhll1.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

2021: Automatic Lyrics Transcription Results

2022-03-09T19:32:19Z

Georgi Dzhambazov: /* Per-track results */

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/report.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

GGL1
<csv>mirex2021/results/lt/hansen_solo_yhll1.csv</csv>

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

2021: Automatic Lyrics Transcription Results

2022-02-13T20:24:44Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/report.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

2021: Automatic Lyrics Transcription Results

2022-02-13T20:12:43Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL1.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 11.45
|-

! YYHL3
| 12.77
|-
|}

====Per-track results====

===Hansen's dataset===
====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 13.54
|-

! YYHL3
| 16.88
|-
|}
====Per-track results====

===Mauch's dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 22.11
|-

! YYHL3
| 26.20
|-
|}
====Per-track results====

===Jamendo dataset===

====Summary Results====
{| border="1" cellspacing="0" style="text-align: left; width: 200px;"
|- style="background: white"
! width="50" | Submission
! width="50" | Avrg WER
|-

! YYHL1
| 24.34
|-

! YYHL3
| 26.86
|-
|}
====Per-track results====

2021: Automatic Lyrics Transcription Results

2022-02-13T19:49:29Z

Georgi Dzhambazov:

= General Legend =

{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: yellow"
! width="80" | Sub code
! width="200" | Submission name
! width="80" style="text-align: center;" | Abstract
! width="540" | Contributors
|-

! YYHL1
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL1.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

! YYHL3
| NetEase || style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2021/YYHL3.pdf PDF] || Zhen Yang, Qichen Han, Xiang Li, Dong Liu, Peng Li
|-

|}

=Results=

===Hansen's dataset a cappella===
====Summary Results====

| border="1" cellspacing="0" style="text-align: left; width: 800px;"
|- style="background: white"
! width="200" | Submission
! width="100" | Avrg WER
|-
! RB1| Sheffield University
|-
====Per-track results====

GGL1
<csv>2020/lt/HansensDataset/YYHL1/results.csv</csv>

GGL2
<csv>2020/lt/HansensDataset/YYHL3/results.csv</csv>

RB1
<csv>2020/lt/HansensDataset/RB1//results.csv</csv>

DDA2
<csv>2020/lt/HansensDataset/DDA2//results.csv</csv>

DDA3
<csv>2020/lt/HansensDataset/DDA3//results.csv</csv>

===Hansen's dataset===
====Summary Results====
<csv>2020/lt/HansensDataset/summary_HansensDataset.csv</csv>

====Per-track results====

GGL1
<csv>2020/lt/HansensDataset/GGL1//results.csv</csv>

GGL2
<csv>2020/lt/HansensDataset/GGL2//results.csv</csv>

RB1
<csv>2020/lt/HansensDataset/RB1//results.csv</csv>

DDA2
<csv>2020/lt/HansensDataset/DDA2//results.csv</csv>

DDA3
<csv>2020/lt/HansensDataset/DDA3//results.csv</csv>

===Mauch's dataset===

====Summary Results====
<csv>2020/lt/MauchsDataset/summary_MauchsDataset.csv</csv>

====Per-track results====
GGL1
<csv>2020/lt/MauchsDataset/GGL1//results.csv</csv>

GGL2
<csv>2020/lt/MauchsDataset/GGL2//results.csv</csv>

RB1
<csv>2020/lt/MauchsDataset/RB1//results.csv</csv>

DDA2
<csv>2020/lt/MauchsDataset/DDA2//results.csv</csv>

DDA3
<csv>2020/lt/MauchsDataset/DDA3//results.csv</csv>

===Jamendo dataset===

====Summary Results====
<csv>2020/lt/jamendolyrics/summary_jamendolyrics.csv</csv>

====Per-track results====

GGL1
<csv>2020/lt/jamendolyrics/GGL1//results.csv</csv>

GGL2
<csv>2020/lt/jamendolyrics/GGL2//results.csv</csv>

RB1
<csv>2020/lt/jamendolyrics/RB1//results.csv</csv>

DDA2
<csv>2020/lt/jamendolyrics/DDA2//results.csv</csv>

DDA3
<csv>2020/lt/jamendolyrics/DDA3//results.csv</csv>

2021: Automatic Lyrics Transcription Results

2022-02-13T19:37:07Z

Georgi Dzhambazov: /* General Legend */

2021:MIREX2020 Results

2022-02-13T19:35:40Z

Georgi Dzhambazov: /* Results by Task (More results are coming) */

==Results by Task (More results are coming) ==
* [[2021: Automatic Lyrics Transcription Results]]  

2021: Automatic Lyrics Transcription Results

2022-02-01T10:07:32Z

Georgi Dzhambazov: Created page with "= General Legend = {| border="1" cellspacing="0" style="text-align: left; width: 800px;" |- style="background: yellow" ! width="80" | Sub code ! width="200" | Sub..."

2021:Automatic Lyrics Transcription

2021-10-29T14:35:15Z

Georgi Dzhambazov: /* Submission Format */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions must be done through the MIREX system (info available [https://www.music-ir.org/mirex/wiki/2021:Main_Page#MIREX_2021_Submission_Instructions here]) and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

2021:Automatic Lyrics Transcription

2021-10-29T14:34:49Z

Georgi Dzhambazov: /* Submission Format */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions must be done through the MIREX system (info available [https://www.music-ir.org/mirex/wiki/2021:Main_Page#MIREX_2021_Submission_Instructions here] and should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

2021:Automatic Lyrics Transcription

2021-10-27T13:23:06Z

Georgi Dzhambazov: /* Bibliography */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In IJCNN 2020, 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In ISMIR 2018.

4 - Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help?. In ICASSP 2020, 496-500. IEEE.

5 - Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.

6- Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.

7 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

8 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. ICASSP 2012, 200-210, IEEE.

9 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.

2021:Automatic Lyrics Transcription

2021-10-27T13:16:32Z

Georgi Dzhambazov: /* Evaluation Datasets */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (7)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (8)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (9)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In 2020 International Joint Conference on Neural Networks (IJCNN), 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm.
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

4 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

5 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

6 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

2021:Automatic Lyrics Transcription

2021-10-27T13:15:52Z

Georgi Dzhambazov: /* DALI Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (4,5,6) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In 2020 International Joint Conference on Neural Networks (IJCNN), 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm.
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

4 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

5 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

6 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

2021:Automatic Lyrics Transcription

2021-10-27T13:15:04Z

Georgi Dzhambazov: /* DALI Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) (3) is the benchmark dataset for building an acoustic model on polyphonic recordings (,) and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In 2020 International Joint Conference on Neural Networks (IJCNN), 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm.
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

4 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

5 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

6 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

2021:Automatic Lyrics Transcription

2021-10-27T13:13:29Z

Georgi Dzhambazov: /* Bibliography */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

2 - Demirel, E., Ahlbäck, S., & Dixon, S. (2020). Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention. In 2020 International Joint Conference on Neural Networks (IJCNN), 1-8. IEEE.

3 - Meseguer-Brocal, G., Cohen-Hadria, A., & Peeters, G. (2019). DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm.
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

4 - Hansen, J. K., & Fraunhofer, I. D. M. T. (2012). Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC), 494-499.

5 - Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

6 - Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

2021:Automatic Lyrics Transcription

2021-10-27T13:09:24Z

Georgi Dzhambazov: /* DAMP dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [https://isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:08:33Z

Georgi Dzhambazov: /* Training Datasets */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [doi:10.21437/Interspeech.2019-2378 here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:07:39Z

Georgi Dzhambazov: /* Training Datasets */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [doi: 10.21437/Interspeech.2019-2378 here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:07:19Z

Georgi Dzhambazov: /* DALI Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [doi: 10.21437/Interspeech.2019-2378 here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [ https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:07:06Z

Georgi Dzhambazov: /* Training Datasets */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository]. Paper [doi: 10.21437/Interspeech.2019-2378 here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] Paper [ https://arxiv.org/pdf/2007.06486.pdf here (2)].

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here]. Paper [ https://arxiv.org/pdf/1906.10606.pdf here (3)].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:05:03Z

Georgi Dzhambazov: /* Bibliography */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] Paper [doi: 10.21437/Interspeech.2019-2378
here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

1 - G.R., Barker, J. (2019) Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Proc. Interspeech 2019, 579-583, doi: 10.21437/Interspeech.2019-2378

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:04:49Z

Georgi Dzhambazov: /* Training Datasets */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] Paper [doi: 10.21437/Interspeech.2019-2378
here (1)].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:02:14Z

Georgi Dzhambazov: /* Jamendo Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper (6)].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:01:51Z

Georgi Dzhambazov: /* Hansen's Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html (4)]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:01:40Z

Georgi Dzhambazov: /* Mauch's Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html](4). The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf (5)] . The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:01:17Z

Georgi Dzhambazov: /* Mauch's Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html](4). The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf] (2). The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:01:00Z

Georgi Dzhambazov: /* Hansen's Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html](4). The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:00:27Z

Georgi Dzhambazov: /* DALI Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (3).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T13:00:19Z

Georgi Dzhambazov: /* DAMP dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository] (1).
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here] (2).

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (1).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:59:31Z

Georgi Dzhambazov: /* DALI Dataset */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here] (1).

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:58:27Z

Georgi Dzhambazov: /* Potential Participants */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:57:48Z

Georgi Dzhambazov: /* Evaluation */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

----

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:57:32Z

Georgi Dzhambazov: /* Evaluation */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

'''Word Error Rate''' (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

'''Character Error Rate''' (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:57:01Z

Georgi Dzhambazov: /* Description */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

IMPORTANT: The evaluation samples have few minutes of audio length. The submission is expected to be able to transcribe the entire recording. If your submission requires segmentation as a preprocessing step, this should already be implemented in your pipeline.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:53:51Z

Georgi Dzhambazov: /* Evaluation */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T12:53:05Z

Georgi Dzhambazov: /* Description */

= Description =

This pages describes the '''MIREX2021: Automatic Lyrics Transcription''' challenge. For evaluation procedure and the submission format please scroll down the page.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T00:15:33Z

Georgi Dzhambazov: /* Evaluation Datasets */

= Description =

This year we host the '''MIREX2021: Automatic Lyrics Transcription''' challenge. You are free to participate in one of the tasks or both of them.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T00:15:24Z

Georgi Dzhambazov: /* Evaluation Datasets */

= Description =

This year we host the '''MIREX2021: Automatic Lyrics Transcription''' challenge. You are free to participate in one of the tasks or both of them.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''*** IMPORTANT ***'''In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T00:14:55Z

Georgi Dzhambazov: /* Evaluation Datasets */

= Description =

This year we host the '''MIREX2021: Automatic Lyrics Transcription''' challenge. You are free to participate in one of the tasks or both of them.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''***''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

2021:Automatic Lyrics Transcription

2021-10-27T00:14:40Z

Georgi Dzhambazov: /* Evaluation Datasets */

= Description =

This year we host the '''MIREX2021: Automatic Lyrics Transcription''' challenge. You are free to participate in one of the tasks or both of them.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

Prediction('''w''') = argmax P('''w'''|'''X''')

where '''w''' and '''X''' are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

Prediction('''w''') = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

= Submission Format =

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

=== A) The main transcription script ===

The main transcription script to execute. This should be a '''one-line executable''' in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

=== I / O ===

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path} ${output}

OR with flags:

foobar -i ${input_audio_path} -o ${output}

==== Input Audio ====

Participating algorithms will have to receive the following input format:

* Audio format : WAV / MP3
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono) for a cappella (Hansen) and two channels for original

==== Output File Format ====

A text file (per song) containing list of words separated by white space:

<word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

Ideally, the output transcriptions will be saved as:

${output}/${input_song_id}.txt

=== B) The README file ===

This file must contain detailed installation instructions, the use of the main script and contact information.

----

Any submission that is failed to meet above requirements will not be considered in evaluation!

= Training Datasets =

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

=== DAMP dataset ===
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application.

The data is curated to be gender-wise balanced and contains performers from 30 different countries, which provides a good amount of variability in terms of accents and pronunciation.
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.

* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]

=== DALI Dataset ===

DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations.

The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).

For each song DALI provides a link to a matched youtube video for the audio retrieval.

* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].

= Evaluation Datasets =

The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI.

'''!!!''' In case using DALI for training, you '''MUST''' exclude [https://www.music-ir.org/mirex/wiki/2020:Lyrics_Transcription_Results the songs used for MIREX evaluation] during training your model in order to make a scientific evaluation possible.

=== Hansen's Dataset ===
The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].

You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html]. The recordings have been provided by Jens Kofod Hansen for public evaluation.

* file duration up to 4:40 minutes (total time: 35:33 minutes)
* 3590 words annotated in total

=== Mauch's Dataset ===

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf]. The dataset has been provided by Sungkyun Chang.

* file duration up to 5:40 minutes (total time: 1h 19m)
* 5050 words annotated in total

=== Jamendo Dataset ===

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].

* file duration up to 4:43 (total time: 1h 12m)
* 5677 words annotated in total

= Evaluation =

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

WER = (S + I + D) / (C + S + D)

where;
C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors

Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

= Submission closing dates =

Closing date: '''December 9, 2021'''

= Audio-to-Lyrics Alignment =

Due to not having sufficient number of participants, we are not currently holding the Audio-to-Lyrics Alignment challenge this year.

However, feel free to contact us if you are willing to participate in such challenge like previous years MIREX challenges. If we reach enough number of participants, we may end up organising the Audio-to-Lyrics Alignment challenge as well.

= Questions? =

* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

== Potential Participants ==
Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

= Bibliography =

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.