JDE
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Dent Educ. 70(4): 428-433 2006
© 2006 American Dental Education Association
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Haj-Ali, R.
Right arrow Articles by Feil, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Haj-Ali, R.
Right arrow Articles by Feil, P.

Milieu in Dental School and Practice

Rater Reliability: Short- and Long-Term Effects of Calibration Training

Reem Haj-Ali, B.D.S., D.D.S., M.S.; Philip Feil, Ed.D.

Key words: calibration, measurement, reliability

Submitted for publication 09/13/05; accepted 12/22/05


   Abstract
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 
The purpose of this investigation was to evaluate the immediate effects of calibration on inter-rater agreement to a gold standard (GS) and to determine whether the effects can be sustained over a ten-week period. Valid criteria for a Class II amalgam preparation, a three-point rating scale, and a grade form were developed. Three tests were administered: prior to calibration training, immediately following training, and ten weeks later. Each test consisted of faculty independently evaluating ten prepared teeth. Agreement with GS scores for most of the grading criteria improved as a result of training and did not deteriorate over time. The overall percent agreement was 54.5, 66.9, and 64.6 percent across test periods. The most impressive gains in agreement occurred when the criteria evaluated had a GS score of either "standard not met" or "ideal." There was very little gain when the gold standard score was "acceptable." It is concluded that, with training, inter-rater agreement with a gold standard can improve and such improvement is reasonably resistant to deterioration after ten weeks. Nevertheless, future training ought to consider the use of a mastery approach in calibration training to ensure that a satisfactory degree of agreement with the GS is obtained.


Knowledge of results (KR), whether provided by students or instructors, forms the basis for the acquisition of skill in preclinical laboratory courses. KR is information that describes the adequacy of the product both during production and at its completion. This information is used by the student to make appropriate alterations in the next attempt in order to achieve a higher level of performance. Several studies representing different disciplines have shown that when students can accurately self-evaluate their progress and product, the quality of the outcome is significantly improved.17 Nevertheless, students will always seek KR from instructors, so it is imperative that the information they provide is consistently accurate. This is a very difficult task to achieve because of the many interacting components that play a role; these include the adequacy of the criteria, the grading scale, and rater training. For example, one article identifies sixteen sources contributing to disagreement in the evaluation of students’ products.7

To overcome these weaknesses, dental educators repeatedly have been urged to develop an appropriate evaluation system for both the clinic and preclinic contexts that includes valid criteria, an appropriate rating scale, and a training program that calibrates raters.815 In 1997, Knight developed prescriptions for the grading system.16 Drawing on decades of research dating to the 1960s, he specified several criteria that define acceptable criteria used in the grading context. These specifications suggest that criteria should a) have individual and collective validity; b) be independent of each other; c) be sequenced based on the natural order of the procedure itself; d) be capable of being objectively tested; and e) provide clearly defined levels of performance. Most authors also agree that calibration is more successful when the number of points on rating scales is limited and when the determination of a rating for a criterion has been operationalized (e.g., a score of 3 is given when the pulpal floor is 2mm ±0.5mm measured at a defined point using a periodontal probe).17 Unfortunately, many attempts to increase rater reliability through improved criteria, rating scales, and/or training have met with inconsistent results: reliability estimates have ranged from essentially zero to the upper nineties.3,18

Results of rater training have frequently reported a reliability estimate based on the notion that the relative rankings of quality across products by faculty are similar.18 However, a "high" reliability estimate does not necessarily imply that raters agree with each other. Further, it does not imply that agreement with a valid gold standard has been achieved. Thus, on a ten-point scale, two sets of ratings for six products might be 9, 7, 5, 4, 3, 1 for one rater and 7, 5, 4, 3, 2, 0 for a second rater. While the rankings are identical from best to worst, it is clear that inter-rater agreement as well as agreement with a gold standard as to the actual quality of each product is lacking. In this example, an estimate of reliability would be approximately 1.0 (i.e., perfectly reliable). Obviously, when agreement is perfect, so too is the ranking, and a reliability estimate would be 1.0 (perfect).

It seems that emphasis on agreement with a valid gold standard may be a more useful strategy when trying to establish rater calibration. Clearly, if agreement by each rater with a gold standard is achieved, it follows that all raters would agree with each other. Rather than discussing how each rater derived a particular measurement and how the two raters can compromise in an effort to be in agreement, the discussion should revolve instead around how to more closely follow the guidelines set forth by the gold standard. Usually, the course coordinator or some other senior faculty member with established expertise and experience teaching in the discipline would serve as that gold standard, and all faculty members would be expected to agree with that person’s standards.

The "gold standard" would be responsible for generating valid criteria, rating scales, and error-free examples of each criterion. Then, preparations showing errors of varying magnitude for each criterion would be collected or produced and used during calibration training. Normally, these products would be checked against the assessment of at least one other "expert" to ensure accuracy. Calibration training would use these products in an attempt to achieve agreement.

The applications of Knight’s guidelines for developing grading criteria and the emphasis on agreement with a gold standard have not been examined. Further, there have been no reports suggesting how long the effects of calibration training persist. This would be important to know because it would provide course coordinators with an estimate of how frequently calibration ought to be updated. Therefore, this investigation that applied Knight’s guidelines for developing a valid grading system had two purposes: 1) to evaluate the immediate effects of calibration to a gold standard on inter-rater agreement on subtasks; and 2) to determine whether the effects of initial training could be sustained over a longer period of time.


   Materials and Methods
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 
For purposes of this study, the gold standard consisted of the evaluation of test teeth as determined by the lead author and the course coordinator. Their evaluations, in turn, were based on the development of valid criteria and a grade form including its rating scale that were consistent with currently published prescriptions. Each item is described in this section.

A grade form was developed based on recommendations by Knight.16 Thirteen criteria for Class II amalgam preparation were selected based on a widely accepted operative textbook used in our pre-clinical laboratory course.19 The criteria were divided into three main categories: occlusal extension and outline form, proximal box extension and outline form, and preparation finish. To enhance calibration, the criteria were placed in an order that best represents the procedure’s sequence. For each criterion, three levels of outcome were specifically described: "ideal," "acceptable," and "standard not met." The terminology used was consistent with what was being taught to the undergraduate students. The grade form is provided in Table 1Go.


View this table:
[in this window]
[in a new window]
 
Table 1. Grade form: Class II amalgam preparation
 
A random sample of thirty preparations was selected from a much larger collection of student work during previous practical examinations in which students attempted to achieve an ideal preparation. Two instructors independently scored each of the thirty preparations using the criteria and grade form; one of the instructors was the course coordinator and the other was the primary author. The two ratings for each preparation were compared, and discrepancies were discussed and resolved. The resolved score for each criterion represented the gold standard score and served as the keys for the three tests.

Prior to faculty participation, we obtained Institutional Review Board approval. Nine instructors who taught in the operative preclinical lab volunteered to evaluate the Class II amalgam preparations at three different times: prior to obtaining any calibration training, immediately after calibration, and ten weeks after calibration. Of the nine faculty members, four were full-time, two were part-time, and three were dentists enrolled in one of the school’s graduate programs.

Ten of the thirty prepared teeth were randomly selected and served as the pretest. Prior to receiving any calibration training, the instructors independently evaluated these teeth by completing the grade form. Raters were provided with periodontal probes, explorers, and mirrors. The use of loupes was voluntary, and lighting was provided by existing fluorescent ceiling appliances. Instructors required no more than thirty minutes to grade the ten teeth on any test period.

Following the pretest, instructors participated in a PowerPoint presentation (approximately twenty minutes) that explained each criterion and the different levels of acceptability. For each criterion, a photograph representing the ideal was discussed first. The foci of the discussion included the test to use to determine if that criterion had been met (what instrument to use and how to use it, including placement and orientation) as well as explanations of what constituted ideal, acceptable, and standard not met. For example, for the criterion related to isthmus extension and width, one photograph conveyed ideal extension and width; a second photo showed an acceptable preparation with the isthmus width slightly more than 1.5 mm and slightly shifted buccally or lingually destroying part of the triangular ridge; and a third showed a preparation with the standard not met where the isthmus width was more than 3mm destroying both buccal and lingual triangular ridges. Throughout, there was opportunity for questions.

Raters were then randomly divided into groups of two, and each group was given two of the pretest teeth. First, each rater independently evaluated the preparation, and when done, scores for each criterion were compared with the key (i.e., the gold standard). Any discrepancies between a rater and the gold standard were discussed and a resolution sought. When discrepancies arose, raters were urged to reevaluate their application of the test and their interpretation and/or recall of the criterion. Raters were asked to help each other in their quest to calibrate with the gold standard and not to continue until discrepancies had honestly been resolved. At that point, raters formed a new group of three and later six instructors. At each point a new pair of teeth was scored, and agreement sought with the gold standard. These group discussions required approximately another forty-five minutes.

Ten out of the remaining twenty teeth were randomly selected and served as the post-test. Immediately after finishing the calibration training, the nine instructors independently evaluated these teeth by completing the grade form. The evaluation was carried out in the same manner as for the pretest.

Ten weeks after the calibration training, the nine instructors independently evaluated the remaining ten teeth by completing the grade form. The evaluation was carried out in the same manner as in the pre- and post-test settings. During the ten-week interval, instructors did not receive any additional training and did not use the grade form developed for this study in any evaluation of student performance in the clinic.

The main purpose of this investigation was to evaluate whether raters could be calibrated to agree with a gold standard over time. Toward this end, data analysis consisted of the number of faculty, expressed as a percent, who agreed with the gold standard on each criterion at three test periods (pretest, immediate post-test, and ten week post-test). Agreement is defined as the assignment of the same rating as the gold standard.


   Results
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 
Table 2Go provides percent exact agreement with the gold standard for each of the thirteen criteria and an overall rating across all test administrations. For example, for criterion #1 there were 53.3 percent agreement on the pretest and 61.1 and 65.6 percent agreement on the subsequent tests. This data shows that there were substantial improvement in agreement for most of the criteria and a decrement in agreement for criterion #8. Agreement was stable from the immediate to the delayed post-test for most criteria; however, substantial decrements (10 percent or more) in agreement were experienced for criterion #9. The percent "perfect agreement" for the overall rating was 54.5, 66.9, and 64.6 on the pretest, post-test, and delayed test respectively. Nevertheless, the large standard deviations suggest substantial variability.


View this table:
[in this window]
[in a new window]
 
Table 2. Percent agreement with the gold standard by criterion and test time
 
Table 3Go shows degree of agreement between raters’ scores and the gold standard score collapsed across criteria. It shows, for example, what score was recorded by raters (expressed as a percent) given a gold standard score of zero, one, and two. For example, when the gold standard was, in fact, a "0," percent agreement with that score on the pretest, immediate post-test, and delayed posttest was 37.8, 52.8, and 56.4, respectively. For that same gold standard score, the percent of raters who assigned a score of "1" was 43.2, 33.0, and 34.7, respectively. When the gold standard was a "1," the percent of raters who also provided a score of 1 across test periods was 52.3, 57.9, and 55.9. When the gold standard was a "2," percent agreement on the three tests was 57.3, 74.7, and 73.9.


View this table:
[in this window]
[in a new window]
 
Table 3. Percent (SD) agreement with gold standard scores of criteria at three test periods
 
These data suggest that the degree of agreement across tests did improve with training when gold standard scores were either "0" or "2," but did not change for a score of "1." Nevertheless, in spite of that improvement, the data show that a substantial percent of raters could not agree with a score that was considered by the gold standard to be unacceptable (standard not met). In that case, percent disagreement, collapsed for rater scores of 1 and 2, were 47.2 and 43.6 on the immediate and delayed post-tests. Very few raters when grading a gold standard example of acceptable or outstanding assigned a score of "standard not met" on the immediate and delayed post-tests.


   Discussion
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 
It is widely agreed that faculty who rate student products in preclinical laboratory or clinical courses should be calibrated so that their ratings are consistent and accurate. This is crucial so that students can obtain accurate knowledge of results and scores that reflect that assessment. This investigation sought to determine if instructors who grade students’ preclinical laboratory products could improve their agreement with a gold standard following calibration on an immediate and delayed (ten weeks) post-test with no intervening training. The results are mixed, and while the trend suggested substantial improvement with virtually no deterioration across tests, there were some disappointments.

Two general trends emerge. First, it is arguable whether the percent agreement achieved on the immediate post-test is satisfactory. For example, for gold standard scores of 0 and 1, agreement on that test period was 52.8 and 57.9 percent. On that test, the most impressive gains in agreement occurred when the gold standard score was either "standard not met" or "ideal." There was very little improvement when the gold standard score was "acceptable." Second, there was virtually no decrement in agreement following the ten-week period.

Nevertheless, there are significant problems in calibrating faculty. What is particularly troublesome is the poor level of agreement with the gold standard score of "standard not met." Even though the level of agreement improved with training, too often instructors deemed a product to be acceptable when in fact it was unacceptable. Agreement with the gold standard score of "0" was only 52.8 and 56.4 percent on the immediate and delayed post-tests. The implication is that, for almost half of all occasions, faculty would not be able to provide consistent feedback to students regarding the acceptability/nonacceptability of their work.

It is difficult to determine why agreement with the gold standard for certain criteria were so resistant to improvement. Criterion #8, axial depth, is a measurement taken from the cavosurface margin of the gingival wall to the axial wall. It is possible that some raters measured from the pulpal floor to the gingival floor instead. As for convergence (criterion #10), it is doubtful that accuracy is achievable by humans who must judge angulation created over a span of approximately 2.5mm. Finally, criterion #11, which deals with the exit angle at the cavosurface margin, is similarly difficult for the naked eye in the absence of a mechanical measuring aid. Conceivably, additional insight might have been obtained by debriefing several of the raters. Unfortunately, that was not done.

Years ago Mackenzie proposed that instructors should become "competent" evaluators.9 In other words, each instructor should be required to achieve a certain level of agreement with a gold standard in order to evaluate student work. The implication is that a rater who cannot attain that level of competence would be expected to continue working on improving this skill perhaps under the direct mentoring of the individual who represents the gold standard. The cycle of training, testing, mentoring, and further testing would continue until that agreed-upon level of mastery had been reached. This approach would serve to operationalize the phrase "calibration with a gold standard." It would also offer an objective way to estimate teaching competence in the context of providing accurate feedback to students.

As with any investigation, there was an associated limitation to this study. The decision to randomly select prepared teeth from those already completed by students resulted in gold standard scores of "0" being dramatically underrepresented on the three test sessions (7.6 percent, 3.0 percent, and 11.5 percent). Future studies should select teeth using a strategy that ensures more equitable representation across gold standard scores. Selection of the zeros should also ensure that different criteria are responsible for this "standard not met" score.

Although difficult to obtain, faculty calibration is one of the most honorable achievements possible, for it serves to improve learning and results in a fair evaluation of student performances. This study has shown that improvements can be stable across time, but that substantially more effort is required to achieve a satisfactory level of agreement across all gold standard scores.


   Conclusion
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 
Within the limitations of this study, it may be tentatively concluded that: a) with calibration, raters can improve agreement with a gold standard; b) the outcomes of calibration are reasonably resistant to deterioration over a ten-week period; and c) future training ought to consider the use of a mastery approach to calibration training.


   Footnotes
 
Dr. Haj-Ali is Assistant Professor, Department of Restorative Dentistry, and Dr. Feil is Professor and Chair, Department of Dental Public Health and Behavioral Science—both at the University of Missouri-Kansas City School of Dentistry. Direct correspondence and requests for reprints to Dr. Reem Haj-Ali, University of Missouri-Kansas City School of Dentistry, 650 E. 25th Street, Kansas City, MO 64108; 816-235-2012 phone; 816-236-2157 fax; haj-alir{at}umkc.edu.


   REFERENCES
 Top
 Abstract
 Materials and Methods
 Results
 Discussion
 Conclusion
 References
 

  1. Feil PH, Reed T, Hart JK. Continuous knowledge of results and psychomotor skill acquisition. J Dent Educ 1986;50:300–3.[Abstract]
  2. Feil PH, Reed T. The effect of knowledge of the desired outcome on dental motor performance. J Dent Educ 1988;52:198–201.[Abstract]
  3. Reed TR, Feil PH, Greer DE. The reliability and agreement of subtask assessments. J Dent Educ 1988;52:554–7.[Abstract]
  4. Feil PH. A theory of motor performance and its applications to preclinical dental skill acquisition. J Dent Educ 1989;53:226–32.[Abstract]
  5. Knight GW, Guenzel PJ. Discrimination training and formative evaluation for remediation in basic waxing skills. J Dent Educ 1990;54:194–8.[Abstract]
  6. Knight GW, Guenzel PJ, Fitzgerald M. Teaching recognition skills to improve products. J Dent Educ 1990;54:739–42.[Abstract]
  7. Feil PH, Gatti JJ. Validation of a motor skills performance theory with applications for dental education. J Dent Educ 1993;57:628–33.[Abstract]
  8. Mackenzie RS, Antonson DE, Weldy PL, Welsch BB, Simpson WJ. Analysis of disagreement in the evaluation of clinical products. J Dent Educ 1982;46:284–9.[Abstract]
  9. Mackenzie RS. Defining clinical competence in terms of quality, quantity, and need for performance criteria. J Dent Educ 1973;37:37–44.[Medline]
  10. Gaines WG, Bruggers H, Rasmussen RH. Reliability of ratings in preclinical fixed prosthodontics: effect of objective scaling. J Dent Educ 1974;38:672–5.[Medline]
  11. Schiff AJ, Salvendy G, Root CM, Ferguson GW, Cunningham PR. Objective evaluation of quality in cavity preparations. J Dent Educ 1975;39:92–6.[Abstract]
  12. O’Connor P, Lorey RE. Improving interrater agreement in evaluation in dentistry by the use of comparison stimuli. J Dent Educ 1978;42:174–9.[Abstract]
  13. Goepferd SJ, Kerber PE. A comparison of two methods for evaluating primary Class II cavity preparations. J Dent Educ 1980;44:537–42.[Abstract]
  14. Edwards WS, Morse PK, Mitchell RJ. A practical evaluation system for preclinical restorative dentistry. J Dent Educ 1982;46:693–6.[Abstract]
  15. Vann WF, Machen JB, Hounshell PB. Effects of criteria and checklists on reliability in preclinical evaluation. J Dent Educ 1983;47:671–5.[Abstract]
  16. Knight GW. Toward faculty calibration. J Dent Educ 1997;61:941–6.[Medline]
  17. Tinsley H, Weiss D. Interrater reliability and agreement of subjective judgments. J Counseling Psych 1975:358–76.
  18. Feil PH. An analysis of the reliability of a laboratory evaluation system. J Dent Educ 1982;46:489–94.[Abstract]
  19. Robenson T, Heymann H, Swift E. Sturdent’s art and science of operative dentistry. 4th ed. St. Louis: Mosby, Inc., 2002.



This article has been cited by other articles:


Home page
J Dent EducHome page
M. E. Jacks, C. Blue, and D. Murphy
Short- and Long-Term Effects of Training on Dental Hygiene Faculty Members' Capacity to Write SOAP Notes
J Dent Educ., June 1, 2008; 72(6): 719 - 724.
[Abstract] [Full Text] [PDF]


Home page
J Dent EducHome page
F. W. Licari, G. W. Knight, and P. J. Guenzel
Designing Evaluation Forms to Facilitate Student Learning
J Dent Educ., January 1, 2008; 72(1): 48 - 58.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Haj-Ali, R.
Right arrow Articles by Feil, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Haj-Ali, R.
Right arrow Articles by Feil, P.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS