Latest revision as of 10:50, 3 June 2010

1 Introduction
2 Official Description of the Evalutron 6000 System and Protocol
3 Audio Simularity Assumptions (MIREX 2006)
- 3.1 Potential Audio Simularity "Graders" (MIREX 2006)
4 Symbolic Melodic Simularity Assumptions (MIREX 2006)
- 4.1 Potential Symbolic Melodic Simularity "Graders" (MIREX 2006)
5 Informed Consent Text Template

Introduction

This page is intended to provide a detailed explication of the Evalutron 6000 setup for the 2006:Audio Music Similarity and Retrieval and 2006:Symbolic Melodic Similarity tasks.

Please read it through carefully.

Also, please carefully read through the two task-specific Evalutron Walkthroughs:

Official Description of the Evalutron 6000 System and Protocol

The following text is drawn from the University of Illinois Institutional Review Board documents with some minor edits and abridgements for clarity.

Official Title of Evalutron 6000 Protocol: Music Similarity Grading System: Collecting Ground-Truth Similarity Information for Music Information Retrieval Evaluations

The research protocol for this protocol is IRB# 07066.

Task Description: For the "similarity-based" class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs "similar" to the standardized "query" songs that were selected by IMIRSEL. For each "query song," each retrieval system under evaluation gives IMIRSEL a list of songs that it "thinks" are similar to the "query song". These songs are called "candidate songs" (or simply, "candidates"). IMIRSEL gathers up these lists of candidates and then makes a "master list" of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity "graders." The graders' job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.

Location: IMIRSEL has developed a web-based similarity judgment collection system currently called the Evalutron 6000. The Evalutron 6000 is located on the servers of the Graduate School of Library and Information Science, UIUC. A "walkthrough" with screen shots of the system prototype located at the 2006:Evalutron6000 Walkthrough page. The Evalutron 6000 is set up to:

1. Begin with a registration page that is designed to:

1.a. Ensure that each grader has read and agreed to the informed consent information before proceeding further

1.b. Establish a username/password pair which is needed for :

- 1.b.i. Data security and integrity

- 1.b.ii. Affording the graders the ability to "come and go" from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.

- 1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc.

- 1.b.iv. Affording the algorithm underlying the system the ability to evenly distribute sub-sets of the query/candidate master lists across graders. This helps us minimize the burden on the graders (i.e., they need not evaluate every possible query/candidate pair).

1.c. Provide us with contact email address to help us communicate with the graders over such issues as:

- 1.c.i. System problems

- 1.c.ii. Verification of membership in the MIR/MDL research community, etc.

- 1.ciii. Confirmation of withdrawals from participation

2. Provide the graders with a set of web pages, one for each query song, that present:

2.a. The ability to hear, stop, start, rewind, etc. the query song

2.b. The sub-set listing (~15 to ~20) of the candidates for the query along with:

- 2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song

- 2.b.ii. An input mechanism to indicate and record the grader's similarity judgment for each candidate

3. Use a robust, password-protected, PHP/MySql database system to:

3.a. Collect and maintain the query/candidate lists

3.b. Generate the candidate sub-set lists seen by the graders so that each candidate is graded by roughly equal numbers of graders

3.c. Collect and preserve the raw grader scores, response times, etc.

3.d. Process the raw grader scores into aggregate similarity ground-truth values

3.e. Collect and maintain the confidential grader identification and consent information

Time Commitment: Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take a grader approximately 2 to 3 hours. Graders can stop at anytime and resume grading, allowing them to complete the entire process in stages.

Date and Duration Issues: This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date.

Measures: Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather "relative" similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or "comparative" scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.

Derivative aggregate data: (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc.

About Graders and Data Integrity: The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be "double-blind" in the same manner that peer-reviewing is double-blind. In a very real sense, this protocol is peer-reviewing. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, AND our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., it could even from their own system and they would have no way of knowing!). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.

Informed Consent Issues:

All graders are over the age of 18.
The system is designed to only allow access to the grading steps if the grader has confirmed that the grader has read and agrees to the informed consent document presented during the registration procedure. Consent is logged in the password-protected database system for each username/password to allow graders to come and go freely from the system.

Dissemination of Results: Research results derived from this protocol will be used in a variety of ways. They will likely be presented in conference presentations and/or other academic presentations, as well as conference proceedings, and/or journal or book articles. Such publication venues may include: the annual MIREX meetings and accompanying web pages; the International Conferences on Music Information Retrieval (ISMIR), the ACM Multimedia conference; ACM SIG Information Retrieval conference; ACM/IEEE Joint Conference on Digital Libraries; the Journal of the American Society for Information Science and Technology; the Information Processing and Management Journal, and other similar venues. Copies of such publications may also be deposited into an institutional repository or made available on researchersΓÇÖ homepages, etc.

Audio Simularity Assumptions (MIREX 2006)

As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities): UPDATE: 23 August 2006: The basic shape of the following information will be constant but a few of the details are about to change. Will modify this once the AudioSim folks reach consensus. See the << notes for probable changes.

# of votes/eyes per candidate: <<23 August: This number to decrease
# of alogrithms evaluated: 6
# of candidates per query per algo: 5
# of queries: 30 <<23 August: This number to increase
# of comparisons: 4500
# of "graders": 20 <<This number to increase
# of comparisons per "grader": 225 <<23 August: This number to decrease
# of seconds assumed per comparison: 45
# of total seconds to grade per grader: 10,125 <<23 August: This number to decrease slightly
# of hours to grade per grader: 2.812 <<23 August: This number to decrease slightly

As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).

Potential Audio Simularity "Graders" (MIREX 2006)

22 August 2006: If you believe you might interested in participating in the evaluation process as a simularity "grader" please read through the documentation thoroughly. At this point we are trying to determine the size of the "grader" pool as it affects individual grader work loads, etc. I would appreciate it very much if you would add your name to the Potential "Graders" list as found here.

Nota Bene: If you do NOT want your name on the "Graders" lists, feel free to email me at jdownie@uiuc.edu.

Once we know the size of the "grader" pool. We send out another note with the actual Evalutron 6000 URLs for the respective tasks. With luck, we should have the Evalutron 6000 open on or before 28 August. We expect to have the Evalutron up and running over a ~14 day period.

PLEASE NOTE: Due to the legalities imposed upon us by the University and the US Federal Government, we will be only accepting "graders" who have a stake in MIR research and NOT THE GENERAL PUBLIC.

J. Stephen Downie, IMIRSEL
Andreas Ehmann, IMIRSEL
Kris West, IMIRSEL
Xiao Hu, IMIRSEL
Mert Bay, IMIRSEL
M. Cameron Jones, IMIRSEL
Paul Lamere, IMIRSEL
Martin McCrory, IMIRSEL
Qin Wei, IMIRSEL
Anatoliy Gruzd, IMIRSEL
Tamar Berman, IMIRSEL
Jin Ha Lee, IMIRSEL
Beth Logan
Mark Levy, QMUL
Ichiro Fujinaga, McGill
Audrey Laplante, McGill
Wietse Balkema
Robert Neumayer, TU Vienna
Stephen Cox, UEA
Christopher Watkins, UEA
Anna Pienim├ñki, Univ. Helsinki
Sebastian Stober, Univ. Magdeburg
Elias Pampalk
Tim Pohle, Johannes Kepler Univ. Linz
Youngmoo Kim, Drexel
Donald S. Williamson, Drexel
Matija Marolt, Uni Ljubljana

Symbolic Melodic Simularity Assumptions (MIREX 2006)

As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities):

# of votes/eyes per candidate: 3
# of alogrithms evaluated: 10
# of candidates per query per algo: 10
# of queries: 16
# of comparisons: 4800
# of "graders": 15
# of comparisons per "grader": 320
# of seconds assumed per comparison: 45
# of total seconds to grade per grader: 14,400
# of hours to grade per grader: 4 <<Comment from JSD: this value is too high.

As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).

Potential Symbolic Melodic Simularity "Graders" (MIREX 2006)

22 August 2006: If you believe you might interested in participating in the evaluation process as a simularity "grader" please read through the documentation thoroughly. At this point we are trying to determine the size of the "grader" pool as it affects individual grader work loads, etc. I would appreciate it very much if you would add your name to the Potential "Graders" list as found here.

Nota Bene: If you do NOT want your name on the "Graders" lists, feel free to email me at jdownie@uiuc.edu.

Once we know the size of the "grader" pool. We send out another note with the actual Evalutron 6000 URLs for the respective tasks. With luck, we should have the Evalutron 6000 open on or before 28 August. We expect to have the Evalutron up and running over a ~14 day period.

PLEASE NOTE: Due to the legalities imposed upon us by the University and the US Federal Government, we will be only accepting "graders" who have a stake in MIR research and NOT THE GENERAL PUBLIC.

J. Stephen Downie, IMIRSEL
Andreas Ehmann, IMIRSEL
Kris West, IMIRSEL
Xiao Hu, IMIRSEL
Mert Bay, IMIRSEL
M. Cameron Jones, IMIRSEL
Martin McCrory, IMIRSEL
Qin Wei, IMIRSEL
Anatoliy Gruzd, IMIRSEL
Tamar Berman, IMIRSEL
Jin Ha Lee, IMIRSEL
Ichiro Fujinaga, McGill
Audrey Laplante, McGill
Alexandra Uitdenbogerd, RMIT
Anna Pienim├ñki, Univ. Helsinki
Niko Mikkil├ñ, Univ. Helsinki

Informed Consent Text Template

The following text is the approved Informed Consent Information Form template for the Evalutron 6000 System and Protocol.

Music Similarity Grading System

Investigators:
Responsible Project Investigator: J. Stephen Downie (jdownie@uiuc.edu)
Graduate Assistant: M. Cameron Jones (mjones2@uiuc.edu)
Graduate Assistant: Anatoliy Gruzd (agruzd2@uiuc.edu)

The following music similarity grading system is designed capture human judgments concerning the similarity of various query songs and those songs deemed to be similar to them (also known as ΓÇ£candidatesΓÇ¥) by one or more Music Information Retrieval (MIR) systems. The similarity judgments you provide for each query/candidate pair will be aggregated with the judgments of other graders to form a ground-truth set of similarity judgments. The ground-truth set that you help us create will then be used to evaluate the performance of MIR systems. You have been asked to participate as a similarity grader because of your active involvement in the MIR and/or Music Digital Library research domains.

Your participation as a grader is completely voluntary. If you come to any selections you do not want to grade, please feel free to skip to the next selection. You may also start and stop your grading sessions, and modify your judgments as you see fit, up to [insert date] when we will be closing the collection process. You may discontinue participation at any time, including after the completion of the grading, for any reason. In the event that you chose to stop participation, you may ask us to have your answers deleted by contacting us through email prior to [insert date] when we will be aggregating the collected data.

All personally identifying information of the graders, however obtained, (e.g., name, company of employment, place of residence, names of collaborators, email addresses, website URLs, response times, etc) will be kept confidential, meaning accessible by only the investigators and not published nor shared with other researchers. The original raw grader scores will not be distributed nor disseminated beyond the investigators and will be kept locked in a University office and on restricted access (i.e., password- protected) areas of the investigatorsΓÇÖ computers. Data will be retained until the end of IMIRSELΓÇÖs active involvement in MIR/MDL evaluations, for a minimum of three years after its collection and as long as is necessary to complete the necessary analyses of the data.

Benefits of Participation
The sharing of your knowledge of which queries are similar to which candidates will contribute to a fuller understanding of music similarity in general, and also aid in the development of algorithms and systems designed to identify and locate similar musical works.

Risks of Participation
Participation as a grader does not involve risks beyond those encountered in daily life.

Time Commitment
Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take you approximately 2 to 3 hours. You can stop at anytime and resume grading, allowing you to complete the entire process in stages.

Contact Information
If you have any questions or concerns about this study, please contact the investigators.
Project contact address: c/o Dr. J. Stephen Downie, Graduate School of Library and Information Science, 501 E. Daniel St., Champaign, IL 61820; phone: 217-649-3839, fax: 217-244-3302.

If you have any general questions about your rights as a participant in this study, please contact the University of Illinois Institutional Review Board at 217-333-2670 (you may call collect if you identity yourself as a research participant) or via email at irb@uiuc.edu

CONSENT TO GRADING PARTICIPATION

Music Similarity Grading System

YOU MUST BE 18 YEARS OF AGE OR OLDER TO PARTICIPATE!

I certify that I am 18 years of age or older, I can print out a copy of this consent form, I have read the preceding and that I understand its contents. By selecting "I Agree" below I am freely agreeing to participate in this study by filling out the survey.

{Choose one}
(x ) Yes, I Agree
( ) No, I Disagree

@@ Line 17: / Line 17: @@
 The research protocol for this protocol is IRB# 07066.
-'''Task Description:''' For the ΓÇ£similarity-basedΓÇ¥ class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs ΓÇ£similarΓÇ¥ to the standardized ΓÇ£queryΓÇ¥ songs that were selected by IMIRSEL. For each ΓÇ£query song,ΓÇ¥ each retrieval system under evaluation gives IMIRSEL a list of songs that it ΓÇ£thinksΓÇ¥ are similar to the ΓÇ£query songΓÇ¥. These songs are called ΓÇ£candidate songsΓÇ¥ (or simply, ΓÇ£candidatesΓÇ¥). IMIRSEL gathers up these lists of candidates and then makes a ΓÇ£master listΓÇ¥ of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity ΓÇ£graders.ΓÇ¥ The gradersΓÇÖ job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.
+'''Task Description:''' For the "similarity-based" class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs "similar" to the standardized "query" songs that were selected by IMIRSEL. For each "query song," each retrieval system under evaluation gives IMIRSEL a list of songs that it "thinks" are similar to the "query song". These songs are called "candidate songs" (or simply, "candidates"). IMIRSEL gathers up these lists of candidates and then makes a "master list" of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity "graders." The graders' job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.
@@ Line 30: / Line 30: @@
 **1.b.i.	Data security and integrity
-**1.b.ii.	Affording the graders the ability to ΓÇ£come and goΓÇ¥ from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.
+**1.b.ii.	Affording the graders the ability to "come and go" from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.
 **1.b.iii.	Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc.
@@ Line 52: / Line 52: @@
 **2.b.i.	The ability to hear, stop, start, rewind, etc. each candidate song
-**2.b.ii.	An input mechanism to indicate and record the graderΓÇÖs similarity judgment for each candidate
+**2.b.ii.	An input mechanism to indicate and record the grader's similarity judgment for each candidate
 .	Use a robust, password-protected, PHP/MySql database system to:
@@ Line 71: / Line 71: @@
 '''Date and Duration Issues:''' This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date.
-'''Measures:''' Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if  two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather ΓÇ£relativeΓÇ¥ similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or ΓÇ£comparativeΓÇ¥ scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.
+'''Measures:''' Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if  two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather "relative" similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or "comparative" scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.
 '''Derivative aggregate data:''' (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc.
-'''About Graders and Data Integrity:''' The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be '''ΓÇ£double-blindΓÇ¥''' in the same manner that peer-reviewing is double-blind. In a very real sense, ''<b>this protocol is peer-reviewing</b>''. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, '''AND''' our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., ''it could even from their own system and they would have no way of knowing!''). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.
+'''About Graders and Data Integrity:''' The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be '''"double-blind"''' in the same manner that peer-reviewing is double-blind. In a very real sense, ''<b>this protocol is peer-reviewing</b>''. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, '''AND''' our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., ''it could even from their own system and they would have no way of knowing!''). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.
 '''Informed Consent Issues:'''

Difference between revisions of "2006:Evalutron6000 Issues"

Latest revision as of 10:50, 3 June 2010

Contents

Introduction

Official Description of the Evalutron 6000 System and Protocol

Audio Simularity Assumptions (MIREX 2006)

Potential Audio Simularity "Graders" (MIREX 2006)

Symbolic Melodic Simularity Assumptions (MIREX 2006)

Potential Symbolic Melodic Simularity "Graders" (MIREX 2006)

Informed Consent Text Template

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools