Difference between revisions of "2006:Evalutron6000 Issues"
|  (→Introduction) | |||
| Line 10: | Line 10: | ||
| '''Official Title of Evalutron 6000 Protocol:''' '''''Music Similarity Grading System: Collecting Ground-Truth Similarity Information for Music Information Retrieval Evaluations''''' | '''Official Title of Evalutron 6000 Protocol:''' '''''Music Similarity Grading System: Collecting Ground-Truth Similarity Information for Music Information Retrieval Evaluations''''' | ||
| + | |||
| + | The research protocol for this protocol is IRB# 07066. | ||
| '''Task Description:''' For the ΓÇ£similarity-basedΓÇ¥ class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs ΓÇ£similarΓÇ¥ to the standardized ΓÇ£queryΓÇ¥ songs that were selected by IMIRSEL. For each ΓÇ£query song,ΓÇ¥ each retrieval system under evaluation gives IMIRSEL a list of songs that it ΓÇ£thinksΓÇ¥ are similar to the ΓÇ£query songΓÇ¥. These songs are called ΓÇ£candidate songsΓÇ¥ (or simply, ΓÇ£candidatesΓÇ¥). IMIRSEL gathers up these lists of candidates and then makes a ΓÇ£master listΓÇ¥ of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity ΓÇ£graders.ΓÇ¥ The gradersΓÇÖ job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes. | '''Task Description:''' For the ΓÇ£similarity-basedΓÇ¥ class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs ΓÇ£similarΓÇ¥ to the standardized ΓÇ£queryΓÇ¥ songs that were selected by IMIRSEL. For each ΓÇ£query song,ΓÇ¥ each retrieval system under evaluation gives IMIRSEL a list of songs that it ΓÇ£thinksΓÇ¥ are similar to the ΓÇ£query songΓÇ¥. These songs are called ΓÇ£candidate songsΓÇ¥ (or simply, ΓÇ£candidatesΓÇ¥). IMIRSEL gathers up these lists of candidates and then makes a ΓÇ£master listΓÇ¥ of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity ΓÇ£graders.ΓÇ¥ The gradersΓÇÖ job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes. | ||
Revision as of 15:33, 22 August 2006
Contents
Introduction
This page is intended to provide a detailed explication of the Evalutron 6000 setup for the Audio Music Similarity and Retrieval and Symbolic Melodic Similarity tasks.
Please read it through carefully.
Also, please carefully read through the Evalutron6000 Walkthrough.
Official Description of the Evalutron 6000 System and Protocol
The following text is drawn from the University of Illinois Institutional Review Board documents with some minor edits and abridgements for clarity.
Official Title of Evalutron 6000 Protocol: Music Similarity Grading System: Collecting Ground-Truth Similarity Information for Music Information Retrieval Evaluations
The research protocol for this protocol is IRB# 07066.
Task Description: For the ΓÇ£similarity-basedΓÇ¥ class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs ΓÇ£similarΓÇ¥ to the standardized ΓÇ£queryΓÇ¥ songs that were selected by IMIRSEL. For each ΓÇ£query song,ΓÇ¥ each retrieval system under evaluation gives IMIRSEL a list of songs that it ΓÇ£thinksΓÇ¥ are similar to the ΓÇ£query songΓÇ¥. These songs are called ΓÇ£candidate songsΓÇ¥ (or simply, ΓÇ£candidatesΓÇ¥). IMIRSEL gathers up these lists of candidates and then makes a ΓÇ£master listΓÇ¥ of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity ΓÇ£graders.ΓÇ¥ The gradersΓÇÖ job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.
Location: IMIRSEL has developed a web-based similarity judgment collection system currently called the Evalutron 6000. The Evalutron 6000 is located on the servers of the Graduate School of Library and Information Science, UIUC. A "walkthrough" with screen shots of the system prototype  located at the Evalutron6000 Walkthrough page. The Evalutron 6000 is set up to:
1.	Begin with a registration page that is designed to:
- 1.a. Ensure that each grader has read and agreed to the informed consent information before proceeding further
- 1.b. Establish a username/password pair which is needed for :
- 1.b.i. Data security and integrity
 
- 1.b.ii. Affording the graders the ability to ΓÇ£come and goΓÇ¥ from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.
 
- 1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc.
 
- 1.b.iv. Affording the algorithm underlying the system the ability to evenly distribute sub-sets of the query/candidate master lists across graders. This helps us minimize the burden on the graders (i.e., they need not evaluate every possible query/candidate pair).
 
- 1.c. Provide us with contact email address to help us communicate with the graders over such issues as:
- 1.c.i. System problems
 
- 1.c.ii. Verification of membership in the MIR/MDL research community, etc.
 
- 1.ciii. Confirmation of withdrawals from participation
 
2. Provide the graders with a set of web pages, one for each query song, that present:
- 2.a. The ability to hear, stop, start, rewind, etc. the query song
- 2.b. The sub-set listing (~15 to ~20) of the candidates for the query along with:
- 2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song
 
- 2.b.ii. An input mechanism to indicate and record the graderΓÇÖs similarity judgment for each candidate
 
3. Use a robust, password-protected, PHP/MySql database system to:
- 3.a. Collect and maintain the query/candidate lists
- 3.b. Generate the candidate sub-set lists seen by the graders so that each candidate is graded by roughly equal numbers of graders
- 3.c. Collect and preserve the raw grader scores, response times, etc.
- 3.d. Process the raw grader scores into aggregate similarity ground-truth values
- 3.e. Collect and maintain the confidential grader identification and consent information
Time Commitment: Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take a grader approximately 2 to 3 hours. Graders can stop at anytime and resume grading, allowing them to complete the entire process in stages.
Date and Duration Issues: This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date.
Measures: Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather ΓÇ£relativeΓÇ¥ similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or ΓÇ£comparativeΓÇ¥ scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.
Derivative aggregate data: (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc.
About Graders and Data Integrity: The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be ΓÇ£double-blindΓÇ¥ in the same manner that peer-reviewing is double-blind. In a very real sense, this protocol is peer-reviewing. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, AND our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., it could even from their own system and they would have no way of knowing!). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.
Informed Consent Issues:
- All graders are over the age of 18.
- The system is designed to only allow access to the grading steps if the grader has confirmed that the grader has read and agrees to the informed consent document presented during the registration procedure. Consent is logged in the password-protected database system for each username/password to allow graders to come and go freely from the system.
Dissemination of Results: Research results derived from this protocol will be used in a variety of ways. They will likely be presented in conference presentations and/or other academic presentations, as well as conference proceedings, and/or journal or book articles. Such publication venues may include: the annual MIREX meetings and accompanying web pages; the International Conferences on Music Information Retrieval (ISMIR), the ACM Multimedia conference; ACM SIG Information Retrieval conference; ACM/IEEE Joint Conference on Digital Libraries; the Journal of the American Society for Information Science and Technology; the Information Processing and Management Journal, and other similar venues. Copies of such publications may also be deposited into an institutional repository or made available on researchersΓÇÖ homepages, etc.
Audio Simularity Assumptions
As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities):
# of votes/eyes per candidate: 5 # of alogrithms evaluated: 6 # of candidates per query per algo: 5 # of queries: 30 # of comparisons: 4500 # of "graders": 20 # of comparisons per "grader": 225 # of seconds assumed per comparison: 45 # of total seconds to grade per grader: 10,125 # of hours to grade per grader: 2.812
As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).
Potential Audio Simularity "Graders"
- J. Stephen Downie, IMIRSEL
- Andreas Ehmann, IMIRSEL
- Kris West, IMIRSEL
- Xiao Hu, IMIRSEL
- Mert Bay, IMIRSEL
- M. Cameron Jones, IMIRSEL
- Paul Lamere, IMIRSEL
- Martin McCrory, IMIRSEL
- Qin Wei, IMIRSEL
- Anatoliy Gruzd, IMIRSEL
- Tamar Berman, IMIRSEL
- Jin Ha Lee, IMIRSEL
- Beth Logan
- Mark Levy, QMUL
Symbolic Melodic Simularity Assumptions
As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities):
# of votes/eyes per candidate: 3 # of alogrithms evaluated: 10 # of candidates per query per algo: 10 # of queries: 16 # of comparisons: 4800 # of "graders": 15 # of comparisons per "grader": 320 # of seconds assumed per comparison: 45 # of total seconds to grade per grader: 14,400 # of hours to grade per grader: 4 <<Comment from JSD: this value is too high.
As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).
Potential Symbolic Melodic Simularity "Graders"
- J. Stephen Downie, IMIRSEL
- Andreas Ehmann, IMIRSEL
- Kris West, IMIRSEL
- Xiao Hu, IMIRSEL
- Mert Bay, IMIRSEL
- M. Cameron Jones, IMIRSEL
- Martin McCrory, IMIRSEL
- Qin Wei, IMIRSEL
- Anatoliy Gruzd, IMIRSEL
- Tamar Berman, IMIRSEL
- Jin Ha Lee, IMIRSEL
Informed Consent Text Template
The following text is the approved Informed Consent Information Form template for the Evalutron 6000 System and Protocol.
Music Similarity Grading System
Investigators:
Responsible Project Investigator: J. Stephen Downie (jdownie@uiuc.edu)
Graduate Assistant: M. Cameron Jones (mjones2@uiuc.edu)
Graduate Assistant: Anatoliy Gruzd (agruzd2@uiuc.edu)
The following music similarity grading system is designed capture human judgments concerning the similarity of various query songs and those songs deemed to be similar to them (also known as ΓÇ£candidatesΓÇ¥) by one or more Music Information Retrieval (MIR) systems. The similarity judgments you provide for each query/candidate pair will be aggregated with the judgments of other graders to form a ground-truth set of similarity judgments. The ground-truth set that you help us create will then be used to evaluate the performance of MIR systems. You have been asked to participate as a similarity grader because of your active involvement in the MIR and/or Music Digital Library research domains.
Your participation as a grader is completely voluntary. If you come to any selections you do not want to grade, please feel free to skip to the next selection. You may also start and stop your grading sessions, and modify your judgments as you see fit, up to [insert date]  when we will be closing the collection process. You may discontinue participation at any time, including after the completion of the grading, for any reason. In the event that you chose to stop participation, you may ask us to have your answers deleted by contacting us through email prior to [insert date] when we will be aggregating the collected data. 
All personally identifying information of the graders, however obtained, (e.g., name, company of employment, place of residence, names of collaborators, email addresses, website URLs, response times, etc) will be kept confidential, meaning accessible by only the investigators and not published nor shared with other researchers. The original raw grader scores will not be distributed nor disseminated beyond the investigators and will be kept locked in a University office and on restricted access (i.e., password- protected) areas of the investigatorsΓÇÖ computers. Data will be retained until the end of IMIRSELΓÇÖs active involvement in MIR/MDL evaluations, for a minimum of three years after its collection and as long as is necessary to complete the necessary analyses of the data. 
Benefits of Participation
The sharing of your knowledge of which queries are similar to which candidates will contribute to a fuller understanding of music similarity in general, and also aid in the development of algorithms and systems designed to identify and locate similar musical works.
Risks of Participation
Participation as a grader does not involve risks beyond those encountered in daily life. 
Time Commitment
Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take you approximately 2 to 3 hours. You can stop at anytime and resume grading, allowing you to complete the entire process in stages.
Contact Information
If you have any questions or concerns about this study, please contact the investigators.
Project contact address: c/o Dr. J. Stephen Downie, Graduate School of Library and Information Science, 501 E. Daniel St., Champaign, IL 61820; phone: 217-649-3839, fax: 217-244-3302.
If you have any general questions about your rights as a participant in this study, please contact the University of Illinois Institutional Review Board at 217-333-2670 (you may call collect if you identity yourself as a research participant) or via email at irb@uiuc.edu
CONSENT TO GRADING PARTICIPATION
Music Similarity Grading System
 YOU MUST BE 18 YEARS OF AGE OR OLDER TO PARTICIPATE!
I certify that I am 18 years of age or older, I can print out a copy of this consent form, I have read the preceding and that I understand its contents. By selecting "I Agree" below I am freely agreeing to participate in this study by filling out the survey.
{Choose one}
( ) Yes, I Agree
( ) No, I Disagree

