|
|
||||||||
Milieu in Dental School and Practice |
Key words: calibration, measurement, reliability
Submitted for publication 09/13/05; accepted 12/22/05
| Abstract |
|---|
|
|
|---|
To overcome these weaknesses, dental educators repeatedly have been urged to develop an appropriate evaluation system for both the clinic and preclinic contexts that includes valid criteria, an appropriate rating scale, and a training program that calibrates raters.815 In 1997, Knight developed prescriptions for the grading system.16 Drawing on decades of research dating to the 1960s, he specified several criteria that define acceptable criteria used in the grading context. These specifications suggest that criteria should a) have individual and collective validity; b) be independent of each other; c) be sequenced based on the natural order of the procedure itself; d) be capable of being objectively tested; and e) provide clearly defined levels of performance. Most authors also agree that calibration is more successful when the number of points on rating scales is limited and when the determination of a rating for a criterion has been operationalized (e.g., a score of 3 is given when the pulpal floor is 2mm ±0.5mm measured at a defined point using a periodontal probe).17 Unfortunately, many attempts to increase rater reliability through improved criteria, rating scales, and/or training have met with inconsistent results: reliability estimates have ranged from essentially zero to the upper nineties.3,18
Results of rater training have frequently reported a reliability estimate based on the notion that the relative rankings of quality across products by faculty are similar.18 However, a "high" reliability estimate does not necessarily imply that raters agree with each other. Further, it does not imply that agreement with a valid gold standard has been achieved. Thus, on a ten-point scale, two sets of ratings for six products might be 9, 7, 5, 4, 3, 1 for one rater and 7, 5, 4, 3, 2, 0 for a second rater. While the rankings are identical from best to worst, it is clear that inter-rater agreement as well as agreement with a gold standard as to the actual quality of each product is lacking. In this example, an estimate of reliability would be approximately 1.0 (i.e., perfectly reliable). Obviously, when agreement is perfect, so too is the ranking, and a reliability estimate would be 1.0 (perfect).
It seems that emphasis on agreement with a valid gold standard may be a more useful strategy when trying to establish rater calibration. Clearly, if agreement by each rater with a gold standard is achieved, it follows that all raters would agree with each other. Rather than discussing how each rater derived a particular measurement and how the two raters can compromise in an effort to be in agreement, the discussion should revolve instead around how to more closely follow the guidelines set forth by the gold standard. Usually, the course coordinator or some other senior faculty member with established expertise and experience teaching in the discipline would serve as that gold standard, and all faculty members would be expected to agree with that persons standards.
The "gold standard" would be responsible for generating valid criteria, rating scales, and error-free examples of each criterion. Then, preparations showing errors of varying magnitude for each criterion would be collected or produced and used during calibration training. Normally, these products would be checked against the assessment of at least one other "expert" to ensure accuracy. Calibration training would use these products in an attempt to achieve agreement.
The applications of Knights guidelines for developing grading criteria and the emphasis on agreement with a gold standard have not been examined. Further, there have been no reports suggesting how long the effects of calibration training persist. This would be important to know because it would provide course coordinators with an estimate of how frequently calibration ought to be updated. Therefore, this investigation that applied Knights guidelines for developing a valid grading system had two purposes: 1) to evaluate the immediate effects of calibration to a gold standard on inter-rater agreement on subtasks; and 2) to determine whether the effects of initial training could be sustained over a longer period of time.
| Materials and Methods |
|---|
|
|
|---|
A grade form was developed based on recommendations by Knight.16 Thirteen criteria for Class II amalgam preparation were selected based on a widely accepted operative textbook used in our pre-clinical laboratory course.19 The criteria were divided into three main categories: occlusal extension and outline form, proximal box extension and outline form, and preparation finish. To enhance calibration, the criteria were placed in an order that best represents the procedures sequence. For each criterion, three levels of outcome were specifically described: "ideal," "acceptable," and "standard not met." The terminology used was consistent with what was being taught to the undergraduate students. The grade form is provided in Table 1
.
|
Prior to faculty participation, we obtained Institutional Review Board approval. Nine instructors who taught in the operative preclinical lab volunteered to evaluate the Class II amalgam preparations at three different times: prior to obtaining any calibration training, immediately after calibration, and ten weeks after calibration. Of the nine faculty members, four were full-time, two were part-time, and three were dentists enrolled in one of the schools graduate programs.
Ten of the thirty prepared teeth were randomly selected and served as the pretest. Prior to receiving any calibration training, the instructors independently evaluated these teeth by completing the grade form. Raters were provided with periodontal probes, explorers, and mirrors. The use of loupes was voluntary, and lighting was provided by existing fluorescent ceiling appliances. Instructors required no more than thirty minutes to grade the ten teeth on any test period.
Following the pretest, instructors participated in a PowerPoint presentation (approximately twenty minutes) that explained each criterion and the different levels of acceptability. For each criterion, a photograph representing the ideal was discussed first. The foci of the discussion included the test to use to determine if that criterion had been met (what instrument to use and how to use it, including placement and orientation) as well as explanations of what constituted ideal, acceptable, and standard not met. For example, for the criterion related to isthmus extension and width, one photograph conveyed ideal extension and width; a second photo showed an acceptable preparation with the isthmus width slightly more than 1.5 mm and slightly shifted buccally or lingually destroying part of the triangular ridge; and a third showed a preparation with the standard not met where the isthmus width was more than 3mm destroying both buccal and lingual triangular ridges. Throughout, there was opportunity for questions.
Raters were then randomly divided into groups of two, and each group was given two of the pretest teeth. First, each rater independently evaluated the preparation, and when done, scores for each criterion were compared with the key (i.e., the gold standard). Any discrepancies between a rater and the gold standard were discussed and a resolution sought. When discrepancies arose, raters were urged to reevaluate their application of the test and their interpretation and/or recall of the criterion. Raters were asked to help each other in their quest to calibrate with the gold standard and not to continue until discrepancies had honestly been resolved. At that point, raters formed a new group of three and later six instructors. At each point a new pair of teeth was scored, and agreement sought with the gold standard. These group discussions required approximately another forty-five minutes.
Ten out of the remaining twenty teeth were randomly selected and served as the post-test. Immediately after finishing the calibration training, the nine instructors independently evaluated these teeth by completing the grade form. The evaluation was carried out in the same manner as for the pretest.
Ten weeks after the calibration training, the nine instructors independently evaluated the remaining ten teeth by completing the grade form. The evaluation was carried out in the same manner as in the pre- and post-test settings. During the ten-week interval, instructors did not receive any additional training and did not use the grade form developed for this study in any evaluation of student performance in the clinic.
The main purpose of this investigation was to evaluate whether raters could be calibrated to agree with a gold standard over time. Toward this end, data analysis consisted of the number of faculty, expressed as a percent, who agreed with the gold standard on each criterion at three test periods (pretest, immediate post-test, and ten week post-test). Agreement is defined as the assignment of the same rating as the gold standard.
| Results |
|---|
|
|
|---|
|
|
| Discussion |
|---|
|
|
|---|
Two general trends emerge. First, it is arguable whether the percent agreement achieved on the immediate post-test is satisfactory. For example, for gold standard scores of 0 and 1, agreement on that test period was 52.8 and 57.9 percent. On that test, the most impressive gains in agreement occurred when the gold standard score was either "standard not met" or "ideal." There was very little improvement when the gold standard score was "acceptable." Second, there was virtually no decrement in agreement following the ten-week period.
Nevertheless, there are significant problems in calibrating faculty. What is particularly troublesome is the poor level of agreement with the gold standard score of "standard not met." Even though the level of agreement improved with training, too often instructors deemed a product to be acceptable when in fact it was unacceptable. Agreement with the gold standard score of "0" was only 52.8 and 56.4 percent on the immediate and delayed post-tests. The implication is that, for almost half of all occasions, faculty would not be able to provide consistent feedback to students regarding the acceptability/nonacceptability of their work.
It is difficult to determine why agreement with the gold standard for certain criteria were so resistant to improvement. Criterion #8, axial depth, is a measurement taken from the cavosurface margin of the gingival wall to the axial wall. It is possible that some raters measured from the pulpal floor to the gingival floor instead. As for convergence (criterion #10), it is doubtful that accuracy is achievable by humans who must judge angulation created over a span of approximately 2.5mm. Finally, criterion #11, which deals with the exit angle at the cavosurface margin, is similarly difficult for the naked eye in the absence of a mechanical measuring aid. Conceivably, additional insight might have been obtained by debriefing several of the raters. Unfortunately, that was not done.
Years ago Mackenzie proposed that instructors should become "competent" evaluators.9 In other words, each instructor should be required to achieve a certain level of agreement with a gold standard in order to evaluate student work. The implication is that a rater who cannot attain that level of competence would be expected to continue working on improving this skill perhaps under the direct mentoring of the individual who represents the gold standard. The cycle of training, testing, mentoring, and further testing would continue until that agreed-upon level of mastery had been reached. This approach would serve to operationalize the phrase "calibration with a gold standard." It would also offer an objective way to estimate teaching competence in the context of providing accurate feedback to students.
As with any investigation, there was an associated limitation to this study. The decision to randomly select prepared teeth from those already completed by students resulted in gold standard scores of "0" being dramatically underrepresented on the three test sessions (7.6 percent, 3.0 percent, and 11.5 percent). Future studies should select teeth using a strategy that ensures more equitable representation across gold standard scores. Selection of the zeros should also ensure that different criteria are responsible for this "standard not met" score.
Although difficult to obtain, faculty calibration is one of the most honorable achievements possible, for it serves to improve learning and results in a fair evaluation of student performances. This study has shown that improvements can be stable across time, but that substantially more effort is required to achieve a satisfactory level of agreement across all gold standard scores.
| Conclusion |
|---|
|
|
|---|
| Footnotes |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. E. Jacks, C. Blue, and D. Murphy Short- and Long-Term Effects of Training on Dental Hygiene Faculty Members' Capacity to Write SOAP Notes J Dent Educ., June 1, 2008; 72(6): 719 - 724. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. W. Licari, G. W. Knight, and P. J. Guenzel Designing Evaluation Forms to Facilitate Student Learning J Dent Educ., January 1, 2008; 72(1): 48 - 58. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |