|
|
||||||||
Educational Methodologies |
Key words: evaluation, competency, ratings, clinical, stereotyping, bias
Submitted for publication 08/23/04; accepted 10/13/04
| Abstract |
|---|
|
|
|---|
Not all generalizations are sound. Some are too ambiguous to be useful: "The patient might be fine." Some are too precise to travel well: "If you really want to know how good the student is, I will photocopy all his patient charts." Perhaps they are based on insufficient evidence: "I only worked with her once." Some generalizations are misleading; they are off center and will lead to surprises if acted upon. We call these biased generalizations: "We never had any really good applicants from that feeder school."
Stereotypes are a special class of generalizations, usually thought of as being negative, biased, and based on insufficient evidence. But they can also be overly positive and well grounded in experience, such as the feelings we usually have for our friends. The distinguishing characteristic of a stereotype is that it distorts new information. Stereotypes are self-protecting in the sense of making confirmatory evidence more readily available and erecting a barrier against disconfirming observations.1
Although generalizations are unavoidable in dental education and are largely useful, we should work to eliminate stereotypes. A faculty member who is slow to understand the subtleties of teaching and has a few rough classes in the first years can be branded as a poor teacher and later commendations get discounted. A school with an established reputation may not be scrutinized carefully by applicants.
The research reported here addresses the possibility of stereotypes in clinical grading. In cases where faculty members rate students repeatedly, the potential exists for initial ratings to dominate the perception of later performance.
Humphris and Kaney2 found that examiners judging medical students communication ability preserved the same standards across extended periods of time. Previous ratings did not dull their attention to the evaluation task. In a study of family practice residents, Bornsten et al.3 found a willingness to change treatment decisions based on new information, regardless of whether the previous treatment decision was the residents own or one that had been made by another resident. Ryan et al.4 performed a complex statistical analysis on the ratings of residents in emergency department rotations. They found that leniency (defined as the higher difference between ratings by particular staff compared with the average of all ratings) was positively correlated (r = .52) with overall ratings. They interpreted these findings as being consistent with the existence of a positive stereotype. Interrater agreement in the assessment of neurological signs was found to be higher when background on the case was made available to the evaluator5again a case of positive stereotyping. In a generalizability study of judging social dance,6 the dancer-by-judge-by-performance interaction was found to account for 15 percent of the variance. Regrettably, in the study design used, this component is contaminated with error variance, so it is impossible to determine how much smaller this interaction effect really is.
Although there is a small body of literature that suggests the existence of a small and positive stereotyping in repeated evaluations of performance, all of these studies are suggestive at best. Each has one or more methodological limitation. There is an inherent difficulty in studying stereotyping in repeated measures. Assume that data show that Rater A judges Student X to be very good on two or three successive occasions. There are at least two plausible explanations: 1) Rater A has stereotyped Student X and discounted information about later performance, and 2) Student X is consistently good and that fact has been correctly registered by Rater A. On the strength of the data themselves, there is no way to distinguish between the two explanations.
In the study reported in this article, this methodological challenge is addressed by using a dataset in which two sets of ratings are compared for students at Time2: those where the same faculty members rated these students at Time1 and those where the same students were rated for the first time by a new set of faculty members. By comparing the ratings from first-time evaluators with ratings of the same students by faculty members who previously evaluated these students, it should be possible to form an estimate and characterization of the carryover of information in subsequent ratings of students (a measure of stereotyping). This approach resembles the theoretical discussion on regressed reading scores found in Wainer.7
Based on the literature and my personal experience with the clinical rating system studied, it is hypothesized that:
| Materials and Methods |
|---|
|
|
|---|
The datasets for competency ratings for the classes that graduated in 2003 and 2004 were included in this study. Faculty records and clinical evaluation forms were reviewed for each quarter to classify a faculty member in one of three categories: 1) faculty member who had rated the student in the previous quarter, 2) faculty member who was rating the student for the first time, or 3) faculty member who had an inconsistent pattern of rating in some previous quarters and not others. Data from the latter category of faculty members were excluded from the study. Of special interest was the set of ratings made by faculty members who were new to the clinic during students twelfth and final quarter. These represented ratings of competency immediately prior to graduation made by faculty members who were rating these students for the first time without having formed prior opinions.
In each of quarters 6 through 12 there were four sets of ratings: 1) new evaluators for the quarter in question, 2) raters who had evaluated the students the previous quarter (or more) for the quarter in question, 3) repeating raters for the previous quarter, and 4) new raters in quarter 12. These relationships are shown schematically in Figure 1
. PR|N is the partial correlation between previous (P) and repeat (R) ratings with new (N) ratings factored out. There were six such datasets, three each for the two classes. The three different sets per class were ratings of clinical judgment, patient management, and technical skill within disciplines. All scores were normalized to reduce scale variance across quarters and differences caused by non-random sampling of student-evaluator combinations. Preliminary ANOVA of Fisher z-transformed data revealed no differences across years or across the three rating categories. The dataset was combined across years.
|
| Results |
|---|
|
|
|---|
|
By contrast, carryover correlations involving different sets of evaluators varied across quarters, but neither increased nor decreased. These two correlations are shown in Table 1
as PN (previous quarter ratings from faculty members who also rated in the current quarter versus evaluators new in the current quarter) and RN (ratings during the current quarter from repeating versus new faculty members). These patterns are graphed in Figure 2
for the case of ratings of discipline competency.
|
Two approaches were taken to characterizing the magnitude of the carryover on rating information. Partial correlation is a method of determining a correlation coefficient by "holding constant" a common, measured association with other variables.10 In this case the PR correlation for repeated raters across subsequent quarters was calculated partialing out the common associations for new raters, PR|N. This can be interpreted as a measure of carryover of rating information across quarters adjusting for information that only new raters were picking up.
Correlation for repeated ratings across quarters, partialing out information identified by new raters, is shown in Table 2
and graphed for ratings of clinical judgment in Figure 3
. The data show the same general patterns of increasing r-values across the two years of clinical instruction that were apparent in the unadjusted correlations for repeated raters. The partial correlations are always lower than the zero-order correlations, reflecting the fact that repeating raters used both information carried over from the previous quarter and new information that was also identified by the new raters.
|
|
The third hypothesis addressed in this research concerns the quality of the ratings: are competency ratings based on multiple quarters of experience better predictors of graduation competency ratings than are ratings made by fresh sets of faculty members in each quarter? In Table 3
the correlations are shown between ratings made by faculty members who have rated students in previous quarters with the graduation quarter ratings and the correlations between faculty members who are rating students for the first time with graduation quarter ratings. In order to reduce bias in this comparison, the Q12 ratings were all made by faculty members who had not rated students previously.
|
|
| Discussion |
|---|
|
|
|---|
The data show that faculty members who are making repeated ratings use both new information and information carried over from previous quarters, and although the carryover information increases as part of the mix, it never achieves a dominant role. By contrast, a succession of new faculty member evaluators seems to rate independently of information available from previous quarters. There is no support in this study for the "hall talk" hypothesis that students reputations precede them and influence their ratings. The study also shows that raters who have both current and carryover information are more accurate in their predictions of students graduating quarter competence than are faculty member evaluators who have only current quarter contact. The data confirm all three hypotheses: 1) faculty members carry information over from period to period when evaluating student competence, 2) this carryover is cumulative, and 3) carryover information increases rating predictability.
This study found no clear evidence for a stereotyping effect in repeated ratings of clinical competence. Although faculty members carry information forward and deepen their impressions of student competence, there was no evidence that such carryover information systematically dampens or distorts the use of new information. The data show a convergence between repeated ratings and ratings from new faculty members that is attributable substantially to both sets of raters observing students current performance. Further, faculty members making repeated ratings were more accurate in predicting the ultimate, graduation quarter competency of students as rated by new evaluators.
Ratings are generally preferred as an evaluation method where complex performance must be assessed in nonstandardized circumstances.12 Objectivity is sacrificed (professionally competent raters must be used), but reliability and validity are not.13 The literature is as filled with failed attempts to validate objective checklists as it is with other types of psychometric reports.1416 Ratings have the advantage of allowing for adjustments for a range of circumstances and permitting raters to nuance the weighting of various performances. For example, a system of averaging objective single-encounter scores might overweight a trivial slip that was unanticipated and quickly recognized and corrected by the student. It might underweight a gross, negligent, and costly error. Rating systems permit evaluators to review the repertoire of experiences and judge their overall significance. Raters who have a richer range of exposures to students, including those extending back several quarters, could reasonably be expected to provide more accurate ratings, although the literature generally reports a tendency to give lenient (unjustifiably positive) evaluations in rating systems.1718
It may be impossible to design a study that perfectly isolates and measures potential stereotyping effects. This study does not claim to have done so. Such a project would require measurements that incorporate all that faculty members know, but require that this knowledge be revealed all at the same time. Some counterhypotheses that might partially explain the results reported in this study include the following: 1) students perform more dentistry in later quarters, thus exposing more of their true competency for evaluation later in the program; 2) new faculty members may be inexperienced at ratings generally; and 3) students may benefit from feedback given by faculty in previous quarters and thus improve performance in line with repeating faculty members expectations.
| Footnotes |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. E.N. Albino, S. K. Young, L. M. Neumann, G. A. Kramer, S. C. Andrieu, L. Henson, B. Horn, and W. D. Hendricson Assessing Dental Students' Competence: Best Practice Recommendations in the Performance Assessment Literature and Investigation of Current Practices in Predoctoral Dental Education J Dent Educ., December 1, 2008; 72(12): 1405 - 1435. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. W. Licari and D. W. Chambers Some Paradoxes in Competency-Based Dental Education J Dent Educ., January 1, 2008; 72(1): 8 - 18. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |