 © 2006 American Dental Education Association
Abstract
A standard correction for random guessing on multiplechoice examinations was examined retrospectively in an oral and maxillofacial pathology course for secondyear dental students. The correction was a weighting formula for points awarded for correct answers, incorrect answers, and unanswered questions such that the expected value of the increase in test score due to guessing was zero. We compared uncorrected and corrected scores on examinations using a multiplechoice format with scores on examinations composed of shortanswer questions. The shortanswer format eliminated or at least greatly reduced the potential for guessing the correct answer. Agreement of corrected multiplechoice scores with shortanswer scores (intraclass correlation coefficient 0.78) was significantly (p=0.015) higher than agreement of uncorrected multiplechoice scores with shortanswer scores (intraclass correlation coefficient 0.71). The higher agreement indicated increased validity for the corrected multiplechoice examination.
 validity
 formula scoring
 correction for guessing
 educational methodology
 educational measurement
 examination performance
 evaluation
 multiplechoice questions
 shortanswer questions
 dental education
The expectation for students to improve their grades by guessing on multiplechoice format examinations is well known. We have examined a method for correcting for random (no knowledge) guessing on multiplechoice question^{1}^{–}^{3} by comparing uncorrected and corrected scores on examinations using a multiplechoice format with scores on examinations using shortanswer questions in an oral and maxillofacial pathology course for dental students. We take as selfevident that the shortanswer format eliminates or at least greatly reduces the potential for guessing the correct answer. The shortanswer questions were presented in a clinically relevant context to better simulate the situation students will face in providing patient care. Questions included a clinical history and projected Kodachrome® slides of clinical and microscopic pathology and appropriate radiographs.
This study represented a unique opportunity to compare scores from multiplechoice and shortanswer examinations in a setting in which students were given the same number of questions in each of the two format types testing their knowledge over the same subject matter. The results of this study assessing four separate examinations during an oral and maxillofacial pathology course indicated that the corrected multiplechoice scores agreed significantly better with student performance on shortanswer examinations than did the uncorrected multiplechoice scores.
Methods
We investigated the standard correction for guessing (formula scoring) using scores on four examinations in the didactic oral and maxillofacial pathology course presented in the 2005 spring semester at the University of Texas Health Science Center at San Antonio. This course, given to all secondyear dental students, was fiftyeight hours in length and consisted of fifty hours of lecture and four twohour examinations. Each of the four examinations was divided into two onehour examinations. The first hour of each examination consisted of twentyfive cases, each of which had a short clinical history and projected clinical, microscopic, and radiographic Kodachrome® slides. For each of the twentyfive cases, two shortanswer questions were asked for a total of fifty questions. When all twentyfive cases had been presented, the students were given five minutes to look over their answers and make any changes or corrections. The shortanswer examination was then collected and subsequently graded by the course director (ACJ). The shortanswer questions were graded by looking for key words identified at the time of construction of the examination. If a student gave multiple answers, only the first answer was evaluated; no partial credit was awarded. The second hour of each examination consisted of fifty multiplechoice questions, each with one correct answer and four distractors. The multiplechoice questions were a mixture of clinical vignettes and ordinary didactic questions. Students were asked to choose the single correct answer for each question. At the end of the second hour, answer sheets were collected and graded electronically.
Since the multiplechoice examination and the shortanswer examination each consisted of fifty questions, they were equally weighted during the calculation of each student’s final grade. Each twohour examination comprised 25 percent of the final grade. No comprehensive final examination was given. Students received final course grades based on averages calculated from the scores on the four onehour multiplechoice examinations and the four onehour shortanswer examinations. These averages were used to assign course grades as A (90–100), B (80–89), C (70–79), or F (0–69).
Each of the four examinations was equally spaced during the course and covered between eleven and thirteen hours of lecture material. When the individual shortanswer and multiplechoice examinations were constructed, the questions were equally weighted to the topics that were presented prior to each of the four examinations. This was to ensure that a given topic was not stressed more often than another topic. The students were advised to add up the number of topics discussed in a given section and divide that number by fifty to arrive at an approximate number of questions per topic on both the multiplechoice and shortanswer examinations.
The effect of correction for guessing was investigated after the completion of the course and official grades awarded. Ninety students initially enrolled in the course during the 2004–05 academic year; two students who were failing the course after the completion of three examinations withdrew from the class before the fourth examination. The analyses presented in this report were based on the eightyeight students who completed the course and had scores for all four examinations. This study was approved by the Institutional Review Board of the University of Texas Health Science Center at San Antonio.
In making the correction for guessing, we assumed that students were making a truly random choice. The correction that we applied was a modification to the ordinary grading method for multiplechoice examinations (numbercorrect or numberright scoring) where zero points are assigned for an incorrect answer and full credit is given for a correct answer.^{4} Since each multiplechoice question has five possible answers, the standard correction for guessing consisted of awarding −1/4 for an incorrect answer, 0 for a question not answered, and +1 for a correct answer; these points were added and the sum divided by the number of questions. The probability of guessing a correct answer and being awarded +1 was 0.20, and the probability of guessing an incorrect answer and being awarded −1/4 was 0.80. Thus the expected value of the number of points gained due to guessing was (0.20)(1)+(0.80)(−1/4)=0. In general, for K possible answers per question, −1/(K−1) is awarded for an incorrect answer, 0 for a question not answered, and +1 for a correct answer.^{4} This correction for guessing is generally referred to as formula scoring^{4} or the standard correction for guessing. Formula scoring is a special case^{5} of choice weighting.
If all questions were answered, as was the case with our examinations, the correction for guessing was equivalent to applying a straight line adjustment such that a grade of 100 percent was unchanged and a grade of 20 percent was adjusted to zero. The equation of this straight line was Corrected Score (%)=1.25 [Uncorrected Score (%)−20].
An intraclass correlation coefficient^{6} was used as a measure of agreement^{7} between a multiplechoice score and a shortanswer score. The Pearson correlation coefficient was not appropriate because it measures association, not agreement.^{8} Perfect agreement occurs only if data points lie along the line of equality; perfect correlation occurs if data points lie along any straight line.
Principal component lines^{9} were estimated from the variancecovariance matrix. The first principal component is the linear combination (straight line) of the variables that has the maximum variance among all (normalized) linear combinations; also, the first principal component is the line through the means, (X̄, Ȳ), which minimizes the sum of the squared distances of the data points to the line.^{10} We used principal components analysis because both the X and Y variables were random variables; in ordinary linear regression analysis, only the Y variable is considered to be a random variable, and the estimator of the line is biased if X also is a random variable.^{11} Thus, the first principal component lines more accurately estimated the relation between these X and Y variables.
Variance components^{12} were calculated for students, examinations, and error to study the sources of variability in scores. Reliability was the ratio of variance between students to total variance.
To compare means on each examination type for different typical grade classifications, we used analysis of variance for repeated measurements^{13} with examination (four levels) as a repeated measures factor. The statistical model used an unstructured variancecovariance matrix.
A bootstrap procedure^{14} with 1,000 samples was used to estimate confidence intervals and to compare agreement of uncorrected and corrected multiplechoice scores with shortanswer scores, to estimate confidence intervals for the slope and intercept of the first principal component lines, to test that the slope was 1.00 and the intercept 0.00, to test that the principal component lines for the uncorrected and corrected scores were the same, and to compare reliability coefficients of multiplechoice examinations and shortanswer examinations. The bootstrap is a nonparametric procedure and thus does not depend on any particular probability distribution. The statistic of interest is calculated in bootstrap samples, of the same size as the original, that are generated by sampling with replacement from the original data. Thus, the bootstrap is a resampling procedure. If the resampling is repeated a large number of times, the empirical distribution of the statistic generated from many bootstrap samples approximates the actual distribution. The empirical distribution may be used to construct confidence intervals (95 percent confidence limits are the 2.5 and 97.5 percentiles of the empirical distribution) or perform hypothesis tests.
Results
Descriptive statistics for shortanswer and uncorrected and corrected multiplechoice scores are given in Table 1⇓.
As shown in Figure 1⇓, the average scores of four multiplechoice examinations, corrected for guessing, were clearly more in agreement with the shortanswer scores than were the uncorrected multiplechoice scores. The agreement statistic (intraclass correlation coefficient) for the corrected scores was greater than the agreement statistic for the uncorrected scores for each of the four examinations and for the average (Table 2⇓). The agreement of the average of the four corrected multiplechoice examinations with the average of the four shortanswer examinations (0.78, 95 percent confidence interval 0.69–0.85) was significantly (p=0.015, onetailed test) greater than agreement of the average of the four uncorrected multiplechoice examinations with the average of the four shortanswer examinations (0.71, 95 percent confidence interval 0.59–0.80). These results indicate increased validity due to applying the standard correction for guessing to multiplechoice examinations.
The foregoing intraclass correlation coefficients describe the agreement among the scores for individual students. The agreement for the group also was better for the corrected scores as indicated by the lines of the first principal components (Figure 1⇑ and Table 3⇓). The line for the corrected multiplechoice scores was substantially closer to the line of equality in both slope (p=0.001) and intercept (p=0.001) than was the line for uncorrected scores. However, for both uncorrected and corrected multiplechoice scores, the slope was significantly different from one (p=0.002 for uncorrected and p=0.016 for corrected), and the intercept was significantly different from zero (p=0.002 for uncorrected and p=0.048 for corrected).
We computed reliability separately for shortanswer and multiplechoice examinations across the four examinations for the entire course. The reliability of the multiplechoice examinations (48.4 percent, 95 percent confidence interval 37.5–57.2) was not significantly (p=0.1225) different from the reliability of the shortanswer examinations (43.1 percent, 95 percent confidence interval 32.5–53.6). Reliability was unaffected by the linear transformation used to correct for guessing if all questions were answered as was the case in our retrospective study; thus, reliability was the same for the uncorrected and corrected multiplechoice examinations.
The correction for guessing resulted in lower grades for students as indicated graphically in Figure 1⇑. This effect on the overall means is given in Table 1⇑. To further define the effects of correction, we classified students based on our classification of A (90–100), B (80–89), C (70–79), and F (0–69) using the average of the four shortanswer examinations (corresponds to the horizontal axis of Figure 1⇑). The multiplechoice grades were lowered an average of 2.1, 3.8, 4.6, and 6.6 points for the A, B, C, and F categories respectively by the correction for guessing. The correction lowered scores more for those students with lower grades where presumably there was a greater degree of guessing.
The average difference between uncorrected and corrected multiplechoice examinations and the shortanswer examinations for each of the grade categories are given in Table 4⇓ (uncorrected multiplechoice F(3,84)=23.6, p<0.0001; corrected multiplechoice F(3,84)=8.44, p<0.0001). The uncorrected multiplechoice examination scores were significantly (p≤0.05) higher than the shortanswer scores for the C and F categories; for the A and B categories, the uncorrected multiplechoice scores were not significantly different from the shortanswer scores. For the F category, the corrected multiplechoice scores were not significantly different from the shortanswer scores. The corrected multiplechoice examination scores were significantly higher than the shortanswer examination scores for the C classifications. For the A and B categories, the corrected multiplechoice grade was significantly lower than the shortanswer grade.
Discussion
Figure 1⇑ and Table 2⇑ show that for individual students the agreement of the corrected multiplechoice scores with shortanswer scores was significantly better than the agreement of uncorrected multiplechoice scores with shortanswer scores. The principal component lines in Figure 1⇑ (that is, the single dimension that best summarizes the data from both examination formats) show that the corrected multiplechoice scores placed the group of students closer to the line of equality. These results indicate increased validity due to applying the standard correction for guessing to multiplechoice examination scores. While we cannot claim that the shortanswer format better evaluates student knowledge based on these data only, we believe any question format that reduces the influence of guessing will be a better indicator of what students know or do not know on a given subject. In particular, the shortanswer format examinations should provide a better measure of a student’s ability to perform in clinical situations in which patients present without a set of possible choices for the diagnosis. Our use of validity refers to performance without guessing, that is, performance without “cuing.” Diamond and Evans^{1} report there are many studies with increased validity measures where formula scoring is used.
For medical students, Norman et al.^{15} demonstrated significantly higher scores on examinations using multiplechoice questions compared to examinations using essay questions with slightly higher reliability for the multiplechoice examinations and similar measures of validity for the multiplechoice and essay examinations. In third and fourthyear medical students, Veloski et al.^{16} compared examinations using multiplechoice format questions (cued response) with examinations using uncued format questions. For the uncued questions, the students selected the answer from a numbered list of alphabetized choices so that these examinations could still be graded electronically by checking for the appropriate number. Their results indicated average scores from the cued multiplechoice examinations were 11 percent to 22 percent higher than average scores from uncued examinations. They concluded that the multiplechoice examination scores gave falsely inflated measures of abilities needed for clinical competency. Our results support this notion by showing that scores from multiplechoice format examinations when corrected for guessing better reflected the test scores on shortanswer questions presented in a more clinically relevant manner. As shown in Figure 2⇓, instructors should realize that, when employing multiplechoice examinations without correcting for guessing, the standard for passing and for all grade levels is inflated, particularly at the lower end of the grading range. For example, if the minimum passing standard nominally is 70 percent on an uncorrected multiplechoice examination, then the correction for guessing shows the actual standard for passing in reality is only 62.5 percent.
The standard correction for guessing adjusts only for truly random guessing among the possible answers. If a student does not attempt an answer, zero points are awarded in the standard correction that we applied. It potentially would benefit a student to guess if they could eliminate one, two, or three of the distractors. Table 5⇓ gives the expected gain per question if a student had partial knowledge and could eliminate one or more incorrect answers. Thus, if an instructor wishes to use this standard correction for guessing, it is imperative that all of the distractors be of uniformly high quality and should not obviously allow students to easily eliminate one or more irrelevant answers. Moreover, it also will be important to adequately shuffle the possible answers so that no pattern for the position (a, b, c, d, e) of the correct answer would be apparent from question to question.
Diamond and Evans^{1} reported on the need for specific instructions to be given to students about guessing to allow examinations with correction for guessing to retain reliability. Students must be informed that a correction for guessing will be applied and must be shown the effect of guessing without knowledge or even with partial knowledge (the ability to eliminate one or more incorrect answers) as well as the potential benefits of partial knowledge. Later in this report, we present an example illustrating that consideration of only the expected gain is inadequate to make a decision regarding guessing. Such an example should be given to students as part of the discussion of correcting for guessing.
The reliability for the shortanswer examination and the multiplechoice examination in our study was similar. Lord^{4} argues that formula scoring will always improve reliability provided the student leaves at least one question unanswered. Intuitively, this can be understood as removing some random guessing component from the score and, thus, focusing on the student’s actual knowledge. This advantage of formula scoring has been empirically supported and discussed in several recent studies.^{17}^{–}^{22} Thus the use of formula scoring (corrected multiplechoice examinations) not only results in increased validity but also saves faculty time that would have to be spent grading shortanswer examinations. Increasing the number of questions that are included in the multiplechoice examinations potentially would result in greater reliability. This addition would not increase faculty time spent in grading but would require additional time in test preparation.
In striking contrast to those students with lower grades in this course, we observed that those students with high grades performed better on the shortanswer examinations than on the multiplechoice examinations. This may reflect deficiencies or confusion in the multiplechoice examinations that are detected by the betterscoring students with substantial knowledge of the subject matter. This interpretation is consistent with results from factor analysis^{23} where an additional small dimension of knowledge was supported with uncued questions in testing students.
Choppin^{3} points out that correction for guessing addresses three concerns: 1) guessing introduces a random factor into test scores that adversely lowers reliability and validity, 2) expected correct guesses inflate estimation of students’ abilities, and 3) the inflation from guessing can be an unfair advantage for students who guess frequently when compared to students with equal ability who do not guess. Applying the correction for guessing reduces the advantage for students who guess frequently. Our study clearly shows the inflated grades on multiplechoice examinations. Thus, the multiplechoice examination scores were brought into better agreement with the shortanswer scores by the standard correction for guessing, indicating increased validity of the corrected multiplechoice tests. This increased validity supports the side of the controversy in recent literature^{17}^{–}^{22} that favors the use of formula scoring. We have not performed the exercise of applying a correction for guessing to a multiplechoice examination after thoroughly informing students of the procedure and comparing these grades with a shortanswer examination. Nonetheless, the results presented here would predict a positive result for such an undertaking—that is, increased validity.
Example
The following example illustrates the effects of the standard correction for guessing and will provide considerations that must be addressed by students who contemplate guessing. Suppose that on a fiftyquestion multiplechoice examination with five possible answers per question, a student had the following result: five questions not answered, eight incorrect answers, and thirtyseven correct answers. We assume that the questions answered incorrectly represent misunderstanding of the material; that is, the student thought he or she knew the correct answer but, in fact, did not. On the questions not answered, the student admitted a complete lack of knowledge. The corrected score is computed using the formula previously described in this article as follows:
If the student had instead tried to guess the correct answer to the five questions left blank, we would expect the student to answer one correctly, yielding twelve incorrect answers and thirtyeight correct answers with an uncorrected score of 38/50=76%. Applying the correction algorithm to the hypothetical result with twelve wrong answers and thirtyeight correct answers yields:
Since there is no difference in the outcome regardless of whether a student guessed or left a question unanswered, why should a student not make random guesses? While we may expect students to answer one question correctly, there is a chance that they will guess the correct answer less frequently than expected. Similarly, they might guess better than expected. These probabilities of these different outcomes are given in Table 6⇓. There is about a onethird chance that the student will lower his or her grade and less than a onethird chance (0.262) that this student will improve his or her grade by guessing. Leaving questions unanswered does not expose the student to the risk of lowering the grade by achieving less than the expected success due to guessing. This is perhaps a critical decision for those students at the cutoff point for passing (70 percent) in our oral and maxillofacial pathology course.
Many instructors would consider the unanswered questions as incorrect and award a score of 37/50 or 74 percent. If students know that no correction for guessing will be applied, they would be foolish not to answer all questions. Computing a raw score ignores the different information that may be contained in the unanswered questions compared to the incorrectly answered questions.
Suppose the student could eliminate one, two, or three possible answers for each of the five questions left blank. The probabilities of a correct guess for each question are 1/4, 1/3, or 1/2, respectively. The probabilities of various numbers of correct answers under these circumstances are given in Table 7⇓. Although students would lessen their chances of a lower (and failing grade) and improve their chances of a higher grade due to guessing, it still would seem prudent for students to avoid the risk of the lower grade even if they can eliminate two incorrect choices. Only if they could eliminate three incorrect answers would the decision to guess be a wise one. Decisions made by students having greater knowledge and thus higher grades than our example student might well make different decisions. That is, they might be more willing to gamble to achieve a higher grade.
Suppose this student had a better knowledge of what he or she didn’t know and did not give an answer to five of the eight incorrect answers. That is, the student had the same number of correct responses (thirtyseven) with ten unanswered and three incorrect. The corrected score is:
In this case, students would be rewarded for recognizing when they cannot do more than make a random guess.
Conclusion
By comparing uncorrected and corrected for guessing scores on multiplechoice examinations with scores on shortanswer examinations, we demonstrated that dental students have been guessing at a level close to that anticipated due to random guessing. In this retrospective analysis, applying the standard correction for guessing increased the validity of the multiplechoice examination in that the corrected scores agreed better with the scores on shortanswer examinations presented in a more clinically relevant context. This study suggests that instructors using multiplechoice examinations should either correct for guessing or take into account the effect of guessing in setting the standard for minimal passing and, in fact, for all grade levels.
Footnotes

Dr. Prihoda is Associate Professor, Dr. Pinckard is Professor, Dr. McMahan is Professor, and Dr. Jones is Professor—all in the Department of Pathology, University of Texas Health Science Center at San Antonio. Direct correspondence and requests for reprints to Dr. Anne Cale Jones, Department of Pathology, University of Texas Health Science Center at San Antonio, 7703 Floyd Curl Drive, San Antonio, TX 782293900; 2105674122 phone; 2105672303 fax; jonesac{at}uthscsa.edu.
This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.