Difference between revisions of "2006:Evalutron6000 Issues"
IMIRSELBot (talk | contribs) m (Robot: Automated text replacement (-\[\[([A-Z][^:]+)\]\] +2006:\1)) |
(→Official Description of the Evalutron 6000 System and Protocol) |
||
Line 17: | Line 17: | ||
The research protocol for this protocol is IRB# 07066. | The research protocol for this protocol is IRB# 07066. | ||
− | '''Task Description:''' For the | + | '''Task Description:''' For the "similarity-based" class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs "similar" to the standardized "query" songs that were selected by IMIRSEL. For each "query song," each retrieval system under evaluation gives IMIRSEL a list of songs that it "thinks" are similar to the "query song". These songs are called "candidate songs" (or simply, "candidates"). IMIRSEL gathers up these lists of candidates and then makes a "master list" of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity "graders." The graders' job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes. |
Line 30: | Line 30: | ||
**1.b.i. Data security and integrity | **1.b.i. Data security and integrity | ||
− | **1.b.ii. Affording the graders the ability to | + | **1.b.ii. Affording the graders the ability to "come and go" from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want. |
**1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc. | **1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc. | ||
Line 52: | Line 52: | ||
**2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song | **2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song | ||
− | **2.b.ii. An input mechanism to indicate and record the | + | **2.b.ii. An input mechanism to indicate and record the grader's similarity judgment for each candidate |
3. Use a robust, password-protected, PHP/MySql database system to: | 3. Use a robust, password-protected, PHP/MySql database system to: | ||
Line 71: | Line 71: | ||
'''Date and Duration Issues:''' This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date. | '''Date and Duration Issues:''' This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date. | ||
− | '''Measures:''' Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather | + | '''Measures:''' Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather "relative" similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or "comparative" scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs. |
'''Derivative aggregate data:''' (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc. | '''Derivative aggregate data:''' (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc. | ||
− | '''About Graders and Data Integrity:''' The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be ''' | + | '''About Graders and Data Integrity:''' The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be '''"double-blind"''' in the same manner that peer-reviewing is double-blind. In a very real sense, ''<b>this protocol is peer-reviewing</b>''. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, '''AND''' our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., ''it could even from their own system and they would have no way of knowing!''). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values. |
'''Informed Consent Issues:''' | '''Informed Consent Issues:''' |
Latest revision as of 10:50, 3 June 2010
Contents
Introduction
This page is intended to provide a detailed explication of the Evalutron 6000 setup for the 2006:Audio Music Similarity and Retrieval and 2006:Symbolic Melodic Similarity tasks.
Please read it through carefully.
Also, please carefully read through the two task-specific Evalutron Walkthroughs:
- 2006:Evalutron6000_Walkthrough_For_Symbolic_Melodic_Similarity
- 2006:Evalutron6000_Walkthrough_For_Audio_Music_Similarity_and_Retrieval
Official Description of the Evalutron 6000 System and Protocol
The following text is drawn from the University of Illinois Institutional Review Board documents with some minor edits and abridgements for clarity.
Official Title of Evalutron 6000 Protocol: Music Similarity Grading System: Collecting Ground-Truth Similarity Information for Music Information Retrieval Evaluations
The research protocol for this protocol is IRB# 07066.
Task Description: For the "similarity-based" class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs "similar" to the standardized "query" songs that were selected by IMIRSEL. For each "query song," each retrieval system under evaluation gives IMIRSEL a list of songs that it "thinks" are similar to the "query song". These songs are called "candidate songs" (or simply, "candidates"). IMIRSEL gathers up these lists of candidates and then makes a "master list" of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity "graders." The graders' job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.
Location: IMIRSEL has developed a web-based similarity judgment collection system currently called the Evalutron 6000. The Evalutron 6000 is located on the servers of the Graduate School of Library and Information Science, UIUC. A "walkthrough" with screen shots of the system prototype located at the 2006:Evalutron6000 Walkthrough page. The Evalutron 6000 is set up to:
1. Begin with a registration page that is designed to:
- 1.a. Ensure that each grader has read and agreed to the informed consent information before proceeding further
- 1.b. Establish a username/password pair which is needed for :
- 1.b.i. Data security and integrity
- 1.b.ii. Affording the graders the ability to "come and go" from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.
- 1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc.
- 1.b.iv. Affording the algorithm underlying the system the ability to evenly distribute sub-sets of the query/candidate master lists across graders. This helps us minimize the burden on the graders (i.e., they need not evaluate every possible query/candidate pair).
- 1.c. Provide us with contact email address to help us communicate with the graders over such issues as:
- 1.c.i. System problems
- 1.c.ii. Verification of membership in the MIR/MDL research community, etc.
- 1.ciii. Confirmation of withdrawals from participation
2. Provide the graders with a set of web pages, one for each query song, that present:
- 2.a. The ability to hear, stop, start, rewind, etc. the query song
- 2.b. The sub-set listing (~15 to ~20) of the candidates for the query along with:
- 2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song
- 2.b.ii. An input mechanism to indicate and record the grader's similarity judgment for each candidate
3. Use a robust, password-protected, PHP/MySql database system to:
- 3.a. Collect and maintain the query/candidate lists
- 3.b. Generate the candidate sub-set lists seen by the graders so that each candidate is graded by roughly equal numbers of graders
- 3.c. Collect and preserve the raw grader scores, response times, etc.
- 3.d. Process the raw grader scores into aggregate similarity ground-truth values
- 3.e. Collect and maintain the confidential grader identification and consent information
Time Commitment: Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take a grader approximately 2 to 3 hours. Graders can stop at anytime and resume grading, allowing them to complete the entire process in stages.
Date and Duration Issues: This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date.
Measures: Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather "relative" similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or "comparative" scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.
Derivative aggregate data: (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc.
About Graders and Data Integrity: The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be "double-blind" in the same manner that peer-reviewing is double-blind. In a very real sense, this protocol is peer-reviewing. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, AND our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., it could even from their own system and they would have no way of knowing!). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.
Informed Consent Issues:
- All graders are over the age of 18.
- The system is designed to only allow access to the grading steps if the grader has confirmed that the grader has read and agrees to the informed consent document presented during the registration procedure. Consent is logged in the password-protected database system for each username/password to allow graders to come and go freely from the system.
Dissemination of Results: Research results derived from this protocol will be used in a variety of ways. They will likely be presented in conference presentations and/or other academic presentations, as well as conference proceedings, and/or journal or book articles. Such publication venues may include: the annual MIREX meetings and accompanying web pages; the International Conferences on Music Information Retrieval (ISMIR), the ACM Multimedia conference; ACM SIG Information Retrieval conference; ACM/IEEE Joint Conference on Digital Libraries; the Journal of the American Society for Information Science and Technology; the Information Processing and Management Journal, and other similar venues. Copies of such publications may also be deposited into an institutional repository or made available on researchersΓÇÖ homepages, etc.
Audio Simularity Assumptions (MIREX 2006)
As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities): UPDATE: 23 August 2006: The basic shape of the following information will be constant but a few of the details are about to change. Will modify this once the AudioSim folks reach consensus. See the << notes for probable changes.
# of votes/eyes per candidate: <<23 August: This number to decrease # of alogrithms evaluated: 6 # of candidates per query per algo: 5 # of queries: 30 <<23 August: This number to increase # of comparisons: 4500 # of "graders": 20 <<This number to increase # of comparisons per "grader": 225 <<23 August: This number to decrease # of seconds assumed per comparison: 45 # of total seconds to grade per grader: 10,125 <<23 August: This number to decrease slightly # of hours to grade per grader: 2.812 <<23 August: This number to decrease slightly
As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).
Potential Audio Simularity "Graders" (MIREX 2006)
22 August 2006: If you believe you might interested in participating in the evaluation process as a simularity "grader" please read through the documentation thoroughly. At this point we are trying to determine the size of the "grader" pool as it affects individual grader work loads, etc. I would appreciate it very much if you would add your name to the Potential "Graders" list as found here.
Nota Bene: If you do NOT want your name on the "Graders" lists, feel free to email me at jdownie@uiuc.edu.
Once we know the size of the "grader" pool. We send out another note with the actual Evalutron 6000 URLs for the respective tasks. With luck, we should have the Evalutron 6000 open on or before 28 August. We expect to have the Evalutron up and running over a ~14 day period.
PLEASE NOTE: Due to the legalities imposed upon us by the University and the US Federal Government, we will be only accepting "graders" who have a stake in MIR research and NOT THE GENERAL PUBLIC.
- J. Stephen Downie, IMIRSEL
- Andreas Ehmann, IMIRSEL
- Kris West, IMIRSEL
- Xiao Hu, IMIRSEL
- Mert Bay, IMIRSEL
- M. Cameron Jones, IMIRSEL
- Paul Lamere, IMIRSEL
- Martin McCrory, IMIRSEL
- Qin Wei, IMIRSEL
- Anatoliy Gruzd, IMIRSEL
- Tamar Berman, IMIRSEL
- Jin Ha Lee, IMIRSEL
- Beth Logan
- Mark Levy, QMUL
- Ichiro Fujinaga, McGill
- Audrey Laplante, McGill
- Wietse Balkema
- Robert Neumayer, TU Vienna
- Stephen Cox, UEA
- Christopher Watkins, UEA
- Anna Pienimäki, Univ. Helsinki
- Sebastian Stober, Univ. Magdeburg
- Elias Pampalk
- Tim Pohle, Johannes Kepler Univ. Linz
- Youngmoo Kim, Drexel
- Donald S. Williamson, Drexel
- Matija Marolt, Uni Ljubljana
Symbolic Melodic Simularity Assumptions (MIREX 2006)
As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities):
# of votes/eyes per candidate: 3 # of alogrithms evaluated: 10 # of candidates per query per algo: 10 # of queries: 16 # of comparisons: 4800 # of "graders": 15 # of comparisons per "grader": 320 # of seconds assumed per comparison: 45 # of total seconds to grade per grader: 14,400 # of hours to grade per grader: 4 <<Comment from JSD: this value is too high.
As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).
Potential Symbolic Melodic Simularity "Graders" (MIREX 2006)
22 August 2006: If you believe you might interested in participating in the evaluation process as a simularity "grader" please read through the documentation thoroughly. At this point we are trying to determine the size of the "grader" pool as it affects individual grader work loads, etc. I would appreciate it very much if you would add your name to the Potential "Graders" list as found here.
Nota Bene: If you do NOT want your name on the "Graders" lists, feel free to email me at jdownie@uiuc.edu.
Once we know the size of the "grader" pool. We send out another note with the actual Evalutron 6000 URLs for the respective tasks. With luck, we should have the Evalutron 6000 open on or before 28 August. We expect to have the Evalutron up and running over a ~14 day period.
PLEASE NOTE: Due to the legalities imposed upon us by the University and the US Federal Government, we will be only accepting "graders" who have a stake in MIR research and NOT THE GENERAL PUBLIC.
- J. Stephen Downie, IMIRSEL
- Andreas Ehmann, IMIRSEL
- Kris West, IMIRSEL
- Xiao Hu, IMIRSEL
- Mert Bay, IMIRSEL
- M. Cameron Jones, IMIRSEL
- Martin McCrory, IMIRSEL
- Qin Wei, IMIRSEL
- Anatoliy Gruzd, IMIRSEL
- Tamar Berman, IMIRSEL
- Jin Ha Lee, IMIRSEL
- Ichiro Fujinaga, McGill
- Audrey Laplante, McGill
- Alexandra Uitdenbogerd, RMIT
- Anna Pienimäki, Univ. Helsinki
- Niko Mikkilä, Univ. Helsinki
Informed Consent Text Template
The following text is the approved Informed Consent Information Form template for the Evalutron 6000 System and Protocol.
Music Similarity Grading System
Investigators:
Responsible Project Investigator: J. Stephen Downie (jdownie@uiuc.edu)
Graduate Assistant: M. Cameron Jones (mjones2@uiuc.edu)
Graduate Assistant: Anatoliy Gruzd (agruzd2@uiuc.edu)
The following music similarity grading system is designed capture human judgments concerning the similarity of various query songs and those songs deemed to be similar to them (also known as ΓÇ£candidatesΓÇ¥) by one or more Music Information Retrieval (MIR) systems. The similarity judgments you provide for each query/candidate pair will be aggregated with the judgments of other graders to form a ground-truth set of similarity judgments. The ground-truth set that you help us create will then be used to evaluate the performance of MIR systems. You have been asked to participate as a similarity grader because of your active involvement in the MIR and/or Music Digital Library research domains.
Your participation as a grader is completely voluntary. If you come to any selections you do not want to grade, please feel free to skip to the next selection. You may also start and stop your grading sessions, and modify your judgments as you see fit, up to [insert date] when we will be closing the collection process. You may discontinue participation at any time, including after the completion of the grading, for any reason. In the event that you chose to stop participation, you may ask us to have your answers deleted by contacting us through email prior to [insert date] when we will be aggregating the collected data.
All personally identifying information of the graders, however obtained, (e.g., name, company of employment, place of residence, names of collaborators, email addresses, website URLs, response times, etc) will be kept confidential, meaning accessible by only the investigators and not published nor shared with other researchers. The original raw grader scores will not be distributed nor disseminated beyond the investigators and will be kept locked in a University office and on restricted access (i.e., password- protected) areas of the investigatorsΓÇÖ computers. Data will be retained until the end of IMIRSELΓÇÖs active involvement in MIR/MDL evaluations, for a minimum of three years after its collection and as long as is necessary to complete the necessary analyses of the data.
Benefits of Participation
The sharing of your knowledge of which queries are similar to which candidates will contribute to a fuller understanding of music similarity in general, and also aid in the development of algorithms and systems designed to identify and locate similar musical works.
Risks of Participation
Participation as a grader does not involve risks beyond those encountered in daily life.
Time Commitment
Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take you approximately 2 to 3 hours. You can stop at anytime and resume grading, allowing you to complete the entire process in stages.
Contact Information
If you have any questions or concerns about this study, please contact the investigators.
Project contact address: c/o Dr. J. Stephen Downie, Graduate School of Library and Information Science, 501 E. Daniel St., Champaign, IL 61820; phone: 217-649-3839, fax: 217-244-3302.
If you have any general questions about your rights as a participant in this study, please contact the University of Illinois Institutional Review Board at 217-333-2670 (you may call collect if you identity yourself as a research participant) or via email at irb@uiuc.edu
CONSENT TO GRADING PARTICIPATION
Music Similarity Grading System
YOU MUST BE 18 YEARS OF AGE OR OLDER TO PARTICIPATE!
I certify that I am 18 years of age or older, I can print out a copy of this consent form, I have read the preceding and that I understand its contents. By selecting "I Agree" below I am freely agreeing to participate in this study by filling out the survey.
{Choose one}
(x ) Yes, I Agree
( ) No, I Disagree