Difference between revisions of "2024:Lyrics-to-Audio Alignment"

From MIREX Wiki
(Created page with "2024:Lyrics-to-Audio Alignment")
 
 
Line 1: Line 1:
2024:Lyrics-to-Audio Alignment
+
==Description==
 +
 
 +
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics.  The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases.  For this task word-level alignment is required.
 +
 
 +
  -----------------------    ---------------------------------------------------
 +
  | Mixed singing audio |    | Lyrics at word-level: no more carefree ... ... |
 +
  -----------------------    ---------------------------------------------------
 +
                  |                                            |
 +
                  --------------------------------------------
 +
                                      |
 +
                              --------------------
 +
                              | Alignment system |
 +
                              --------------------
 +
                                      |
 +
                                      |
 +
                              --------------------------
 +
                              | 0.123 0.798  no    |
 +
                              | 0.798 1.123  more  |
 +
                              | 1.345 2.176  carefree|
 +
                              | ... ...                |
 +
                              --------------------------
 +
The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.
 +
 
 +
== What's New ==
 +
 
 +
Compared to previous years:
 +
 
 +
* Submission format: docker image is required. See the submission format section.
 +
 
 +
==Datasets==
 +
 
 +
===Training Datasets===
 +
 
 +
==== DAMP dataset ====
 +
The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the [https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. 
 +
 
 +
* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
 +
* No lyrics boundary annotations are available, still the textual lyrics are on the [https://www.smule.com/songs Smule Sing! Karaoke website]
 +
 
 +
==== DALI Dataset ====
 +
 
 +
The DALI dataset (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
 +
For more details how, see its full description [https://github.com/gabolsgabs/DALI here].
 +
 
 +
===Evaluation Datasets===
 +
 
 +
The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.
 +
 
 +
==== Hansen's Dataset ====
 +
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided.
 +
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here]
 +
 
 +
You can read in detail about how the dataset was made here: [http://smcnetwork.org/system/files/smc2012-198.pdf Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen.
 +
 
 +
* file duration up to 4:40 minutes (total time: 35:33 minutes)
 +
* 3590 words annotated in total
 +
 
 +
==== Mauch's Dataset ====
 +
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset.
 +
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].
 +
 
 +
You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.
 +
 
 +
* file duration up to 5:40 minutes (total time: 1h 19m)
 +
* 5050 words annotated in total
 +
 
 +
==== Gracenote Dataset ====
 +
The dataset contains 8 pop music song excerpts with instrumental accompaniment, with annotations of beginning-timestamps of each word. The dataset has been used in the recent [https://ieeexplore.ieee.org/abstract/document/7952235/references paper].
 +
 
 +
* file duration up to 1:11 (total time: 11m)
 +
* 1181 words annotated in total
 +
 
 +
==== Jamendo Dataset ====
 +
This dataset contains 20 full-duration music pieces with 10 different Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].
 +
 
 +
* file duration up to 4:43 (total time: 1h 12m)
 +
* 5677 words annotated in total
 +
 
 +
=== Phonetization ===
 +
A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here].
 +
 
 +
=== Audio Format ===
 +
 
 +
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
 +
 
 +
* CD-quality (PCM, 16-bit, 44100 Hz)
 +
* single channel (mono) for a cappella and two channels for original
 +
 
 +
==Evaluation==
 +
 
 +
The submitted algorithms will be evaluated at the word boundaries for the originally mixed songs (a cappella singing + instrumental accompaniment).  Evaluation metrics on the a cappella singing can be reported as well on request, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.
 +
 
 +
'''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
 +
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L117 test] of using this metric.
 +
 
 +
'''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011), Figure 9].
 +
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L76 test] of using this metric.
 +
 
 +
'''Percentage of correct estimates according to a tolerance window''' A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment].
 +
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L151 test] of using this metric.
 +
 
 +
For more detailed definition and formulas about the metrics, please check the section 2.2.1 of [https://doi.org/10.5281/zenodo.841979 this thesis].
 +
 
 +
'''To obtain all three metrics for one detected output:'''
 +
 
 +
<code> python [https://github.com/georgid/AlignmentEvaluation/blob/master/align_eval/eval.py eval.py] <file path of the reference word boundaries> <file path of the detected word boundaries> </code>
 +
 
 +
Note that evaluation scripts depend on [https://github.com/craffel/mir_eval/ mir_eval].
 +
 
 +
 
 +
== Submission Format ==
 +
 
 +
Every submission must be packed into a docker image containing the bash file <code>main.sh</code> in the root folder.
 +
 
 +
=== Input Data ===
 +
Participating algorithms will have to receive the following input format:
 +
 
 +
* Audio in wav, 44.1kHz, stereo.
 +
* Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).
 +
 
 +
=== Output File Format ===
 +
 
 +
The alignment output file format is a tab-delimited ASCII text format.
 +
 
 +
Three column text file of the format
 +
 
 +
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
 +
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
 +
...
 +
 
 +
where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:
 +
 
 +
0.000    5.223    word1
 +
5.223    15.101  word2
 +
15.101  20.334  word3
 +
 
 +
'''NOTE:''' the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.
 +
 
 +
=== Command line calling format ===
 +
 
 +
The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file in the following format:
 +
 
 +
<pre>main.sh %input_audio %input_txt %output_txt</pre>
 +
 
 +
 
 +
== Time and hardware limits ==
 +
 
 +
A Linux server with one Nvidia GeForce RTX 3090 is used for evaluation. CPU, OS, and memory specifications will be announced later.
 +
 
 +
Time limit: within 5 times the total duration of the test set.
 +
 
 +
== Bibliography ==
 +
 
 +
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
 +
 
 +
Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019
 +
 
 +
Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650
 +
 
 +
Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.
 +
 
 +
Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017
 +
 
 +
Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016
 +
 
 +
Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference
 +
 
 +
Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
 +
 
 +
Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.
 +
 
 +
Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing
 +
 
 +
Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.

Latest revision as of 02:16, 25 August 2024

Description

The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.

  -----------------------    ---------------------------------------------------
  | Mixed singing audio |    | Lyrics at word-level: no more carefree ... ... |
  -----------------------    ---------------------------------------------------
                 |                                            |
                  --------------------------------------------
                                     |
                             --------------------
                             | Alignment system |
                             --------------------
                                     |
                                     |
                             --------------------------
                             | 0.123 	0.798  no     |
                             | 0.798 	1.123  more   |
                             | 1.345 	2.176  carefree|
                             | ... ...                |
                             --------------------------

The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.

What's New

Compared to previous years:

  • Submission format: docker image is required. See the submission format section.

Datasets

Training Datasets

DAMP dataset

The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.

DALI Dataset

The DALI dataset (a large Dataset of synchronised Audio, LyrIcs and notes) contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved. For more details how, see its full description here.

Evaluation Datasets

The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.

Hansen's Dataset

The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided. The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here

You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.

  • file duration up to 4:40 minutes (total time: 35:33 minutes)
  • 3590 words annotated in total

Mauch's Dataset

The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here.

You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.

  • file duration up to 5:40 minutes (total time: 1h 19m)
  • 5050 words annotated in total

Gracenote Dataset

The dataset contains 8 pop music song excerpts with instrumental accompaniment, with annotations of beginning-timestamps of each word. The dataset has been used in the recent paper.

  • file duration up to 1:11 (total time: 11m)
  • 1181 words annotated in total

Jamendo Dataset

This dataset contains 20 full-duration music pieces with 10 different Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on Github, although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to this paper.

  • file duration up to 4:43 (total time: 1h 12m)
  • 5677 words annotated in total

Phonetization

A popular choice for phonetization of the words is the CMU pronunciation dictionary. One can phonetize them with the online tool. A list of all words of both datasets, which are outside of the list of CMU words is given here.

Audio Format

The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono) for a cappella and two channels for original

Evaluation

The submitted algorithms will be evaluated at the word boundaries for the originally mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics on the a cappella singing can be reported as well on request, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song. Here is a test of using this metric.

Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by Fujihara et al. (2011), Figure 9. Here is a test of using this metric.

Percentage of correct estimates according to a tolerance window A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. Here is a test of using this metric.

For more detailed definition and formulas about the metrics, please check the section 2.2.1 of this thesis.

To obtain all three metrics for one detected output:

python eval.py <file path of the reference word boundaries> <file path of the detected word boundaries>

Note that evaluation scripts depend on mir_eval.


Submission Format

Every submission must be packed into a docker image containing the bash file main.sh in the root folder.

Input Data

Participating algorithms will have to receive the following input format:

  • Audio in wav, 44.1kHz, stereo.
  • Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).

Output File Format

The alignment output file format is a tab-delimited ASCII text format.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
...

where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:

0.000    5.223    word1
5.223    15.101   word2
15.101   20.334   word3

NOTE: the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.

Command line calling format

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file in the following format:

main.sh %input_audio %input_txt %output_txt


Time and hardware limits

A Linux server with one Nvidia GeForce RTX 3090 is used for evaluation. CPU, OS, and memory specifications will be announced later.

Time limit: within 5 times the total duration of the test set.

Bibliography

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019

Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650

Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.

Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017

Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016

Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference

Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing

Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.