Difference between revisions of "2020:Singing Transcription from Polyphonic Music"
m (→Evaluation) |
m (→Description) |
||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Description == | == Description == | ||
− | The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note | + | The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note is denoted by three numbers, onset, offset and the score pitch. The input of this task is a music recording (mostly pop music) that contains a vocal and some accompaniments, and the output is a series of notes. The vocal part is monophonic, but the accompaniments are not. |
Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome. | Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome. | ||
This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music. | This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music. | ||
+ | |||
+ | Besides, it's worth noting that the definition of “singing transcription” is not quite specific. Some researches [1][2] regard this term as the task of “transcribing '''polyphonic music''' into notes”, but other researches [3][4] seems to regard this term as the task of “transcribing '''monophonic signals without accompaniment''' into notes”, since both of them [3][4] created monophonic datasets (without accompaniments) for “automatic singing transcription”. | ||
+ | |||
+ | Therefore, to make the name more specific, we call the task that “transcribing '''polyphonic music''' that contains only '''monophonic vocal''' into notes” as '''singing transcription from polyphonic music'''. | ||
== Data == | == Data == | ||
Two datasets can be used to construct and evaluate a model for singing transcription: | Two datasets can be used to construct and evaluate a model for singing transcription: | ||
=== RWC Music Database : Popular Music (RWC-MDB-P) === | === RWC Music Database : Popular Music (RWC-MDB-P) === | ||
− | We can use the “Popular Music Database” part of RWC database [ | + | We can use the “Popular Music Database” part of RWC database [5] for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task. |
− | + | === Cmedia dataset === | |
+ | This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of them as the open set for training/validation (training set), and use the other 100 as the hidden set for test. | ||
− | + | The '''training''' set can be downloaded [https://drive.google.com/file/d/15b298vSP9cPP8qARQwa2X_0dbzl6_Eu7/ here]. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set. | |
− | |||
− | |||
== Evaluation == | == Evaluation == | ||
− | We will use Python package “mir_eval” [ | + | We will use Python package “mir_eval” [6] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [4]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other: |
1. The onset difference is less than the threshold (100ms in this competition). | 1. The onset difference is less than the threshold (100ms in this competition). | ||
Line 29: | Line 32: | ||
We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics. | We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics. | ||
− | In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice. | + | In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX [https://www.music-ir.org/mirex/wiki/2020:Multiple_Fundamental_Frequency_Estimation_%26_Tracking “Multiple Fundamental Frequency Estimation & Tracking” task]. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice. |
+ | |||
+ | A simple evaluation code can be downloaded from [https://drive.google.com/file/d/1Uw-MQA14XGypwXaADdvV9Hg_URsWD4WV here]. | ||
== Submission Format == | == Submission Format == | ||
=== Input Format === | === Input Format === | ||
Sample rate: 44.1 KHz | Sample rate: 44.1 KHz | ||
+ | |||
Sample size: 16 bit | Sample size: 16 bit | ||
+ | |||
Number of channels: 2 (stereo) | Number of channels: 2 (stereo) | ||
+ | |||
Encoding: WAV | Encoding: WAV | ||
Line 53: | Line 61: | ||
=== Time limits === | === Time limits === | ||
− | The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in | + | The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in Cmedia dataset, of which the total duration is about 14hr. |
If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported. | If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported. | ||
− | The algorithm will be executed on a computer with 64GB | + | |
+ | The algorithm will be executed on a computer with 64GB memory and one NVIDIA GEFORCE GTX 1080Ti GPU. | ||
+ | |||
+ | == Question? == | ||
+ | If you have any question about this task or the datasets, please feel free to send us an email: b06902046@ntu.edu.tw (Jun-You Wang) or roger.jang@gmail.com (Jyh-Shing Roger Jang). | ||
+ | |||
+ | Since this is a new MIREX task, we are eagerly waiting for your help to make everything get on track. | ||
+ | |||
+ | == Submission deadline == | ||
+ | September 13th, 2020. | ||
== Reference == | == Reference == | ||
− | [1] | + | [1] M. Ryynanen and A. Klapuri: “Transcription of the Singing Melody in Polyphonic Music,” in Proc. of the 7th International Society for Music Information Retrieval Conference (ISMIR 2006), pp.222-227, October 2006. |
+ | |||
+ | [2] R. Nishikimi, E. Nakamura, S. Fukayama, M. Goto, K. Yoshii: “Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism,” in Processing of 2019 IEEE International Conference on Acoustics, Speech and Signal (ICASSP), 2019. | ||
+ | |||
+ | [3] E. Gómez and J. Bonada: Towards computer-assisted flamenco transcription: “An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, 37(2):73–90, 2013. | ||
+ | |||
+ | [4] E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014. | ||
− | [ | + | [5] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002. |
− | [ | + | [6] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014. |
Latest revision as of 07:15, 25 August 2020
Contents
Description
The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note is denoted by three numbers, onset, offset and the score pitch. The input of this task is a music recording (mostly pop music) that contains a vocal and some accompaniments, and the output is a series of notes. The vocal part is monophonic, but the accompaniments are not.
Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.
This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.
Besides, it's worth noting that the definition of “singing transcription” is not quite specific. Some researches [1][2] regard this term as the task of “transcribing polyphonic music into notes”, but other researches [3][4] seems to regard this term as the task of “transcribing monophonic signals without accompaniment into notes”, since both of them [3][4] created monophonic datasets (without accompaniments) for “automatic singing transcription”.
Therefore, to make the name more specific, we call the task that “transcribing polyphonic music that contains only monophonic vocal into notes” as singing transcription from polyphonic music.
Data
Two datasets can be used to construct and evaluate a model for singing transcription:
RWC Music Database : Popular Music (RWC-MDB-P)
We can use the “Popular Music Database” part of RWC database [5] for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.
Cmedia dataset
This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of them as the open set for training/validation (training set), and use the other 100 as the hidden set for test.
The training set can be downloaded here. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set.
Evaluation
We will use Python package “mir_eval” [6] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [4]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:
1. The onset difference is less than the threshold (100ms in this competition).
2. The pitch difference is less than the threshold (50 cents in this competition).
3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.
Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1). We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.
In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.
A simple evaluation code can be downloaded from here.
Submission Format
Input Format
Sample rate: 44.1 KHz
Sample size: 16 bit
Number of channels: 2 (stereo)
Encoding: WAV
Output Format
The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are floating points in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:
0.131 0.355 64 0.355 0.896 64 0.896 1.141 62 1.888 2.333 62
Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.
Command line calling format
The submitted algorithm must take as arguments a SINGLE .wav file to perform the singing transcription on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called “main” could be called from the command-line as follows:
./main %input %output
Time limits
The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in Cmedia dataset, of which the total duration is about 14hr.
If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported.
The algorithm will be executed on a computer with 64GB memory and one NVIDIA GEFORCE GTX 1080Ti GPU.
Question?
If you have any question about this task or the datasets, please feel free to send us an email: b06902046@ntu.edu.tw (Jun-You Wang) or roger.jang@gmail.com (Jyh-Shing Roger Jang).
Since this is a new MIREX task, we are eagerly waiting for your help to make everything get on track.
Submission deadline
September 13th, 2020.
Reference
[1] M. Ryynanen and A. Klapuri: “Transcription of the Singing Melody in Polyphonic Music,” in Proc. of the 7th International Society for Music Information Retrieval Conference (ISMIR 2006), pp.222-227, October 2006.
[2] R. Nishikimi, E. Nakamura, S. Fukayama, M. Goto, K. Yoshii: “Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism,” in Processing of 2019 IEEE International Conference on Acoustics, Speech and Signal (ICASSP), 2019.
[3] E. Gómez and J. Bonada: Towards computer-assisted flamenco transcription: “An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, 37(2):73–90, 2013.
[4] E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014.
[5] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002.
[6] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014.