Difference between revisions of "2017:Automatic Lyrics-to-Audio Alignment"
|  (→Evaluation) |  (→Mauch's Dataset) | ||
| (52 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| ==Description== | ==Description== | ||
| − | The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics.  The  | + | The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics.  The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases.  For this task word-level alignment is required. | 
| − | For this task word-level alignment  | ||
| − | ===  | + | ==Data== | 
| + | |||
| + | ===Training Dataset=== | ||
| + | The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the [https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings].   | ||
| + | |||
| + | * The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site] | ||
| + | * No lyrics boundary annotations are available, still the textual lyrics are on the [https://www.smule.com/songs Smule Sing! Karaoke website] | ||
| − | == | + | ===Evaluation Datasets=== | 
| − | The  | + | |
| − | The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one.   | + | ==== Hansen's Dataset ==== | 
| + | The dataset contains 9 popular music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Non-vocal segments are assigned a special word BREATH*. Sentence-level annotations are also provided. | ||
| + | The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here] | ||
| + | |||
| + | [https://www.dropbox.com/sh/evg395yz1ciyy2r/AABwUHXnVlXK_YrN1Rov7iU6a?dl=0 Half of the dataset] is being released after the competition! | ||
| You can read in detail about how the dataset was made here: [http://smcnetwork.org/system/files/smc2012-198.pdf Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen. | You can read in detail about how the dataset was made here: [http://smcnetwork.org/system/files/smc2012-198.pdf Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen. | ||
| + | * file duration up to 4:40 minutes (total time: 35:33 minutes) | ||
| + | * 3590 words annotated on total | ||
| + | |||
| + | ==== Mauch's Dataset ==== | ||
| + | The dataset contains 20 popular music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, in order to enable comparison to previous work, evaluated on this dataset. | ||
| + | The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here] "_" are used instead of "'" in the annotation. | ||
| + | |||
| + | [https://www.dropbox.com/sh/y6kwqdgq8ous12e/AABWrMXOmLOZoNFO06STLQkAa?dl=0 Half of the dataset] is being released after the competition! | ||
| + | |||
| + | You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang. | ||
| + | |||
| + | * file duration up to 5:40  (total time: 1:19:12 hours) | ||
| + | * 5050 words annotated on total | ||
| − | ===  | + | ==== Phonetization ==== | 
| + | A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here]. | ||
| − | The data are  | + | ==== Audio Format ==== | 
| + | |||
| + | The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files) | ||
| * CD-quality (PCM, 16-bit, 44100 Hz) | * CD-quality (PCM, 16-bit, 44100 Hz) | ||
| − | * single channel (mono) | + | * single channel (mono) for a cappella and two channels for original | 
| − | |||
| ==Evaluation== | ==Evaluation== | ||
| + | |||
| + | The submitted algorithms will be evaluated at the boundaries of words for the original multi-instrumental songs.  Evaluation metrics on the a cappella versions will be reported as well, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking. | ||
| '''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.   | '''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.   | ||
| + | To evaluate it  [https://github.com/georgid/AlignmentEvaluation/blob/master/test/EvalMetricsTest.py#L159 call this python script ]  | ||
| + | |||
| '''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011, Figure 9].   | '''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011, Figure 9].   | ||
| + | To evaluate it  [https://github.com/georgid/AlignmentEvaluation/blob/master/test/EvalMetricsTest.py#L164 call this python script ] | ||
| − | + | To check for both metrics uncomment [https://github.com/georgid/AlignmentEvaluation/blob/master/test/EvalMetricsTest.py#L98 this line] for  Hansen's dataset and [https://github.com/georgid/AlignmentEvaluation/blob/master/test/EvalMetricsTest.py#L102 this line] for Mauch's dataset. | |
| + | Note that evaluation scripts depend on [https://github.com/craffel/mir_eval/ mir_eval]. | ||
| == Submission Format == | == Submission Format == | ||
| − | === Audio Format === | + | Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm. | 
| + | |||
| + | === Input Data === | ||
| + | Participating algorithms will have to read audio in the following format: | ||
| + | |||
| + | * Audio for the original songs in wav (stereo) | ||
| + | * Lyrics in .txt file where each word is separated by a space, each lyrics line is separated by a new line. | ||
| + | |||
| + | === Output File Format === | ||
| + | |||
| + | The alignment output file format is a tab-delimited ASCII text format.  | ||
| + | |||
| + | Three column text file of the format | ||
| + |  <onset_time(sec)>\t<offset_time(sec)>\t<word>\n | ||
| + |  <onset_time(sec)>\t<offset_time(sec)>\t<word>\n | ||
| + |  ... | ||
| + | |||
| + | where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like: | ||
| + | |||
| + |  0.000    5.223    word1 | ||
| + |  5.223    15.101   word2 | ||
| + |  15.101   20.334   word3 | ||
| + | |||
| + | NOTE: the end timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only. | ||
| === Command line calling format === | === Command line calling format === | ||
| + | The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows: | ||
| + |  foobar %input_audio %input_txt %output | ||
| + |  foobar -i %input_audio -it %input_txt  -o %output | ||
| − | |||
| + | === README File === | ||
| + | A README file accompanying each submission should contain explicit instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file. | ||
| === Packaging submissions === | === Packaging submissions === | ||
| + | |||
| + | Please provide submissions as a binary or source code. | ||
| == Time and hardware limits == | == Time and hardware limits == | ||
| + | Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. | ||
| + | A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result. | ||
| + | |||
| + | == Submission opening date == | ||
| + | 21 July | ||
| + | |||
| + | == Submission closing date == | ||
| + | 11 September | ||
| + | == Bibliography == | ||
| + | Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078. | ||
| + | Dzhambazov, G. (2017). Knowledge-based probabilistic modeling for tracking lyrics in music audio signals, PhD Thesis | ||
| − | + | Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016 | |
| + | Mesaros, A. (2013). Singing voice identification and lyrics transcription for music information retrieval invited paper. 2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD), 1-10. | ||
| + | Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. | ||
| − | + | Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210. | |
| == Potential Participants == | == Potential Participants == | ||
| + | Nikolaos Tsipas  nitsipas [at] auth [dot] gr | ||
| + | |||
| + | Anna Kruspe kpe [at] idmt [dot] fraunhofer [dot] de | ||
Latest revision as of 12:11, 8 December 2017
Contents
Description
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.
Data
Training Dataset
The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.
- The audio can be downloaded from the Smule web site
- No lyrics boundary annotations are available, still the textual lyrics are on the Smule Sing! Karaoke website
Evaluation Datasets
Hansen's Dataset
The dataset contains 9 popular music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Non-vocal segments are assigned a special word BREATH*. Sentence-level annotations are also provided. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here
Half of the dataset is being released after the competition!
You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.
- file duration up to 4:40 minutes (total time: 35:33 minutes)
- 3590 words annotated on total
Mauch's Dataset
The dataset contains 20 popular music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, in order to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here "_" are used instead of "'" in the annotation.
Half of the dataset is being released after the competition!
You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.
- file duration up to 5:40 (total time: 1:19:12 hours)
- 5050 words annotated on total
Phonetization
A popular choice for phonetization of the words is the CMU pronunciation dictionary. One can phonetize them with the online tool. A list of all words of both datasets, which are outside of the list of CMU words is given here.
Audio Format
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono) for a cappella and two channels for original
Evaluation
The submitted algorithms will be evaluated at the boundaries of words for the original multi-instrumental songs. Evaluation metrics on the a cappella versions will be reported as well, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.
Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song. To evaluate it call this python script
Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by Fujihara et al. (2011, Figure 9. 
To evaluate it  call this python script 
To check for both metrics uncomment this line for Hansen's dataset and this line for Mauch's dataset. Note that evaluation scripts depend on mir_eval.
Submission Format
Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.
Input Data
Participating algorithms will have to read audio in the following format:
- Audio for the original songs in wav (stereo)
- Lyrics in .txt file where each word is separated by a space, each lyrics line is separated by a new line.
Output File Format
The alignment output file format is a tab-delimited ASCII text format.
Three column text file of the format
<onset_time(sec)>\t<offset_time(sec)>\t<word>\n <onset_time(sec)>\t<offset_time(sec)>\t<word>\n ...
where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:
0.000 5.223 word1 5.223 15.101 word2 15.101 20.334 word3
NOTE: the end timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.
Command line calling format
The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
foobar %input_audio %input_txt %output foobar -i %input_audio -it %input_txt -o %output
README File
A README file accompanying each submission should contain explicit instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
Packaging submissions
Please provide submissions as a binary or source code.
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.
Submission opening date
21 July
Submission closing date
11 September
Bibliography
Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.
Dzhambazov, G. (2017). Knowledge-based probabilistic modeling for tracking lyrics in music audio signals, PhD Thesis
Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016
Mesaros, A. (2013). Singing voice identification and lyrics transcription for music information retrieval invited paper. 2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD), 1-10.
Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.
Potential Participants
Nikolaos Tsipas nitsipas [at] auth [dot] gr
Anna Kruspe kpe [at] idmt [dot] fraunhofer [dot] de

