JDE
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Dent Educ. 73(3): 287-302 2009
© 2009 American Dental Education Association
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chambers, D. W.
Right arrow Articles by Licari, F. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chambers, D. W.
Right arrow Articles by Licari, F. W.

Critical Issues in Dental Education

Issues in the Interpretation and Reporting of Surveys in Dental Education

David W. Chambers, Ed.M., M.B.A., Ph.D.; Frank W. Licari, D.D.S., M.P.H., M.B.A.

Key words: surveys, competency-based education, standard error, precision, accuracy, bias

Submitted for publication 08/01/08; accepted 12/09/08


   Abstract
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Surveys are the most common form of data-based article published in the Journal of Dental Education. The apparent ease with which they are conducted and the seeming simplicity of reporting results mask significant issues in sample design and performing maximally useful analyses. Four concerns are discussed here. First, it is demonstrated that results are a function of who, when, and where responses are sampled, each source making independent contributions. Second, absolute sample size is shown to be the most significant factor affecting precision in surveys, and the numbers of schools, respondents, and other sources of variance can be chosen to minimize survey imprecision. Third, response rate typically has negligible effect on precision and an uncertain effect on accuracy (freedom from bias). A technique, sample saturation, is explained that can be used to protect, to some degree, surveys from the effects of bias. Finally, suggestions are offered for reporting survey results in a visually meaningful fashion, and an appeal is made that recommendations associated with surveys not be published unless they are grounded in both data and well-developed theory. This analysis references a previously published survey on competency-based dental education to illustrate methodological points in concrete terms.


Reports on surveys are among the most typical articles in the dental education literature. When well done, they provide a descriptive base to help understand the work we do, to compare individual programs with broader norms, and even to guide decisions. Compared to research that involves experimental intervention, they can be easily conducted, especially with modern computer technology; they can be performed quickly and are relatively non-invasive; and the basic statistics can be easily calculated on an Excel spreadsheet.

This apparent ease can obscure the complexities involved in conducting and reporting survey research of the best quality. Many surveys provide a thin base for developing theory and are low on the hierarchy of "best evidence" developed by the Center for Evidence-Based Medicine (www.cebm.org). Their straightforward structure can create a false impression of transparency. Sometimes survey data are translated into policy recommendations without full appreciation for the limitations posed by sampling problems or without adequate development of theoretical support for the recommendations.

There are useful resources to guide construction of surveys.15 These should be consulted for assistance on framing questions, pilot-testing, contacting representative respondents, and other methodological issues—problems that will not be discussed here. Rather, this article will focus on the difficult issues of analyzing and interpreting survey results. We believe that the use of surveys in dental education can be improved by greater understanding of the factors that influence variation in survey outcomes and by greater attention to the theoretical contexts in which survey results are interpreted. We also believe that competent statistical advice should be sought in both the design and analysis processes by any researchers who are uncertain about the issues raised in this article.


   Basics in Interpreting Survey Results
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
The results of surveys are reported using descriptive statistics such as means, proportions, and frequency distributions and their graphic counterparts such as column graphs and pie charts. Occasionally, inferential statistical tests are performed to contrast subpopulations. Even when the response rate is very high, the potential for variation embedded in survey responses can be large (including both random variation and bias). Virtually all survey research involves drawing general conclusions based on samples of the potential programs, respondents, and occasions. The reported results can differ based on which programs respond, who completes the survey, and even the timing of the survey. Slightly different wording of survey items has the potential to alter results. It is important to analyze the nature of such variation, both for the sake of fair reporting and in order to design effective future surveys.

An understanding of basic statistical concepts is necessary for both the design and interpretation of survey research that makes general claims. Simple reporting of complex phenomena provides no protection against the dangers of drawing distorted conclusions.

The term "variance" is used in this article in its neutral sense of observed differences, one survey response to another. Variance between program types—if it is in the direction anticipated by the authors of a survey—is usually regarded as variance that is a desirable measure of effect. Differences among surveys that are unanticipated constitute random variance or noise if they have no distorting effect on the hoped-for results. Variance that confounds the expected outcomes is called bias. The intentions of the authors and readers play a role in determining whether variance is regarded as a measure of effect, random error, or bias.

Evidence that is not grounded in theory is just data. There is a natural pull on the authors of surveys to interpret their findings as supporting policies or positions they favor. Making descriptive data do normative service is an acceptable practice only if a theoretical context is provided in which the findings are meaningful. Sometimes the context for interpreting survey findings is tacit and can only be inferred from the recommendations made by the authors at the end of the article. The education6 and business7 literatures favor (insist on) theory development as a precondition to data collection. They view data as a guide to creating robust, cumulative theory to advance their disciplines.

Cronbach8 distinguished between D-studies (e.g., admissions test scores or happiness surveys for courses) intended to support specific decisions in the schools where the study was conducted from G-studies (e.g., published articles on admissions predictors across all schools) that provide data permitting generalization to other contexts and for theory building. A survey may be a viable decision tool for guiding practice in a specific dental school but fail to contribute to understanding of the underlying phenomenon if conditions at the surveyed school differ from those at other schools. In our discussion of response rates and reporting of results, we will focus on considerations for improving generalizability studies.


   The Original Survey of Competency Practices in Dental Schools
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
This article is based on a survey regarding competency-based dental education that was published in the January 2008 issue of the Journal of Dental Education.9 That survey was designed to explore both content issues and the methodological topics presented here. This article is not a comprehensive reanalysis of that research; rather, the original research provides concrete examples for the present comments on survey methodology.

The survey concerning competency-based dental education reported by the authors in January 20089 was undertaken to determine how well competency-based education is understood in U.S. and Canadian dental schools, assess some attitudes toward competency approaches, and gain perspective about how this educational approach has been implemented. Academic and clinical deans and chairs of the departments of restorative dentistry and endodontics at all schools were surveyed. One hundred and fifty-one usable surveys were returned, for a response rate of 62 percent, representing 94 percent of schools. Twenty-six percent of respondents identified themselves as academic deans, 23 percent as clinic deans, 24 percent as chairs of restorative dentistry, and 19 percent as chairs of endodontics. Seven percent said they were "other," despite instructions on the survey specifically stating that only individuals in the four designated categories could participate. "Other" responses were not included in our analysis.

The survey contained thirteen items. There were some demographic questions (e.g., "How long have you been in dental education?") and structured questions about definitions of competency and foundation knowledge and skills, estimated understanding and value of competency approaches in various groups, the impact of competency on respondents’ programs, and how competency was assessed and how competency information was used. Some of the questions (e.g., "What is the definition of foundation knowledge?") were based on general knowledge. Others concerned logistical practices at each school (e.g., "What happens when students fail a clinical test of competency?"). A third category of questions involved an element of self-reporting: for example, "What value do you believe administrators or department chairs place on competencies?" There were also open-ended questions that elicited numerous comments. The types of information gathered from this survey can be discerned from reviewing Table 1Go.


View this table:
[in this window]
[in a new window]

 
Table 1. Standard errors of responses in a survey of associate deans and department chairs regarding definition of competencies, details of their implementation at various dental schools, and assessment of their impact
 
Based on the survey, it was found that competency-based education is a feature in virtually all dental school educational programs, but the manner in which it is implemented varies across schools and across types of respondents. Fifty-eight percent of respondents claimed that the use of competencies had stimulated useful curricular and clinical change, and 27 percent said it had improved the quality of their graduates. There were also reports of frustration with implementation (28 percent), cosmetic changes only (23 percent), and extra work (10 percent). Overall, we concluded that competency-based education is a learner-centered approach to dental education that has been widely accepted in name but implemented to varying degrees and in different ways across the United States and Canada. It is not a standardized and prescribed set of practices.


   Issues Concerning Sample Composition
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Sometimes we are interested in describing or drawing conclusions about only the items we have surveyed (as in counting the votes for an election to determine who won); sometimes we are interested in drawing conclusions about a larger group of respondents based on information from a representative group of them (as in opinion polls about future elections). When we have information from everyone, we have measured the population; when we have representative but incomplete information, we have measured the sample. Because samples are incomplete, there is always some probability that the conclusions based on them will be mistaken when generalized to populations. It is prudent to respect the likely error in survey data, and much can be learned from analyzing the composition of potential error. We will discuss how to calculate and interpret the mysterious "sampling error of x percent" that is reported with public opinion polls.

General Versus Specific Reports of Variance
There are two rules to keep in mind when sampling: 1) the mix of characteristics among the respondents sampled determines the results; and 2) the larger the absolute size of the sample, the smaller the risk of being mistaken when making generalizations. Intuitively this is reasonable. It matters who completes the survey, and more information increases the confidence we can place in our conclusions. If, however, we use care in designing surveys and report the right kind of information about them, we can address these concerns in a scientific fashion. It follows as a combination of these rules that large sources of error can be "corrected for" by taking larger samples on the important dimensions of the sample.

Multiple sources of variance are always at play in surveys, whether these sources are isolated and analyzed separately or they are lumped together in an overall analysis. While it is unusual to compare a general and a differentiated analysis of the same data-set, this process mimics standard practice in research where promising initial research is further explored in detail. The article published in January 2008 is an example of undifferentiated analysis: it provides an analysis separating out respondents and dental schools as distinct sources of variance. Table 2Go, a reproduction of Table 7 from the original publication, is a summary of numerous statistically significant differences in perceptions across various questions, without regard for component sources of variance. The problem is that while these findings were sound based on the assumptions in the first report, each is slightly (but statistically) misleading compared to the case in which the effects of respondent and school are considered. These differences were known at the time the January 2008 article was written, and none of the conclusions present there are affected by these differences. Taking the respondents as a single undifferentiated group, it was found, for example, that accreditation site visitors and respondents to the survey understood the competency concept better than faculty members did, and these in turn were thought to be better informed than students or faculty members at other schools. As we will demonstrate here, making a distinction between associate deans and department chairs or across schools produces statistically significant generalizations that contradict those published in Table 7 of the original article. This is an illustration of the first rule mentioned above: the composition of the sample matters—often a lot.


View this table:
[in this window]
[in a new window]

 
Table 2. Summary of response groupings {each set representing statistically distinct perceptions among respondents} and of significant correlations
 
The statistical procedure known as generalizability analysis or variance analysis (not to be confused with analysis of variance)8,10 can be applied to partition the total of all response variation based on their sources. For example, in the survey on competency-based dental education, we could assign variance as coming from schools (a measure of how far competency practices at an individual school might differ from the "average"), respondent (a measure of how competency practices vary from the perspective of an associate dean or department chair), and "error." Other potential sources affecting the responses—male or female, class sizes, time of day, early or late responder, etc.—were not specifically identified, so they are assumed to be an undifferentiated part of the error variance. Variance across questions was considered to be information and not error. Generalizability theory is a reasonably sophisticated procedure, but one for which free software is available online (www.genova.org).

Sources of Variance in the Competency Survey
Taken across all questions, the proportion of variance attributable to differences across schools was about 30 percent of the total variation among the responses. Almost 18 percent of variation among the answers was attributable to whether the survey was completed by an associate dean or a department chair. Slightly more than half of the variation was idiosyncratic. A detailed partitioning of the variation in responses by survey item appears in Table 1Go.

Knowing, even in an approximate fashion, which sources contribute most heavily to observed differences in outcomes can make a significant difference, both in interpreting survey results and in designing good studies. For example, readers of the original article that reported these findings may have assumed that the responses were representative of the general opinions in U.S. and Canadian dental schools. That would be a risky conclusion because general faculty and students were not sampled. Further, there were differences of opinion between associate deans and the two types of department chairs surveyed. Respondents felt that their own level of understanding and appreciation of the use of competencies were generally greater than was the level of understanding of others such as faculty members and students. Other readers may have assumed that outcomes were a function of differences across schools. While the data confirmed that there were systematic school effects, these accounted for no more than a third of the variance in results.

The importance of various perspectives may depend on the nature of the questions asked in surveys. Merely as an illustration, the sixty-three survey items listed in Table 1Go have been divided into three categories. Those of a general knowledge nature, designated G in Table 1Go, involve information that a respondent could answer based on reading, meetings, conversations with colleagues, or individual reflection. The definition of "competency" is an example. Seventy percent of the variance among responses on these items is classified as "error," meaning it is not attributable to the particular school environment or to the respondent’s role as an administrator or chair, but to factors such as individual awareness, specific experiences, or personal opinion. By contrast, items marked with a P in Table 1Go are opinion items where the role of the respondent in the dental school might reasonably matter. An example would be the item having to do with the degree of understanding and amount of importance administrators or chairs place on competencies. Respondents rated themselves on these items. As might be expected, a third of the variance was attributable to the role of respondents for such questions. The third category included items regarding the logistics of implementation of competency-based education at each school, marked L in Table 1Go. Here, roughly 38 percent of the variance was associated with individual schools.

It is uncommon to report data showing the distribution of responses across schools. This may be due to viewing schools as contributing to error variance in the same way that subjects in experiments are usually thought of as being noise. However, when there is interest in the extent of adoption of a practice or program characteristics, differences among schools can provide useful information, just as scores for individual candidates on the Dental Admission Test (DAT) may be useful information. Figure 1Go is a column graph displaying the range of responses to this question: "What is the weight, considering all criteria for promotion or graduation, that is given to competency test cases?" The number reported for each school is an average across respondents from that school. The grand mean across schools is 37 percent, making it the most commonly reported graduation criterion identified in the survey. But the range is from no use whatsoever to its use to the exclusion of all other criteria. From the perspective of students, it matters greatly which school they attend. From the policy perspective, it is apparent that the implementation of competency testing is highly school-specific and anything but completely adopted. On average, it matters little whether the question is asked of associate deans or department chairs, but there can be great variation within schools. Ranges on the question regarding weight given to competency tests were as great as 5 percent to 80 percent within some schools depending on who responds. Again from the students’ perspective, it matters a great deal who they ask about graduation requirements in their schools. Insight into sources of variance can be detected by the trained eye from data such as presented in Table 1Go, but graphs such as Figure 1Go are also useful.


Figure 1
View larger version (14K):
[in this window]
[in a new window]

 
Figure 1. Frequency distribution of weight given competency test cases in graduation decisions across schools

 
Reporting and Managing Multiple Sources of Variance
It is valuable to report information on sources of variance when publishing survey results. This would be helpful for two reasons. First, such reporting would provide a rich context for interpreting findings and increase the reader’s confidence in the data. Second, knowing, even roughly, the sources of variance for various types of questions would permit intelligent design of future surveys. If schools make only minimal contributions to outcomes, a large number of schools is not required in the sample. If the position a respondent has in a school matters (e.g., faculty member, department chair, administrator), the range and number of respondents should be increased.

Failure to understand sample design is a fatal flaw in one-shot initial licensure examinations. Research by the first author found that examiner calibration cannot be appreciably improved or corrected for by adding more examiners since interrater variability contributes almost nothing to the decision about who is licensed.11 By contrast, differences in patient selection and idiosyncrasies of specific testing sessions matter dramatically and can only be corrected for by multiple testing. This is consistent with results in the testing literature generally. Chambers and Loos have estimated that, for the task of fixed prosthodontics, an increase of several hundred examiners would be necessary to equal the improvement in precision of evaluation achieved by adding one more test case.12 The logic of this type of analysis can be applied to laboratory and clinical evaluation in dental schools as well as to survey design.


   Issues Concerning Sample Size
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Although average scores and standard deviations are independent of sample size, the precision of any claim based on a survey is strongly affected by sample size. Larger samples increase the confidence in what can be said about the results. Precision is a matter of how much hedging is required for a claim. If the news reports that 60 percent of Americans believe the economy is the greatest issue facing the country, with an error margin of 3 points, that means that the standard error (the common measure of precision) is plus or minus 3 percentage points, and were the study to be repeated many times, two out of three of these repeat surveys would have an average value between 57 and 63 percent. Confidence intervals are standard errors where we have zoomed in or out, making the range of precision larger or smaller by some arbitrary amount. The commonly reported 95 percent confidence interval (CI95) is the standard error multiplied by 1.96. The CI95 identifies a range of "possible imprecision" that is just under twice as wide as the standard error. Smaller standard errors are preferred when making claims.

Precision in the Competency Survey
The 95 percent confidence intervals for the items in the survey of competency-based dental education are shown in Table 1Go. The precision of responses is plus or minus 5 percent for most items. If we made the claim that students at three-quarters of dental schools could not graduate without passing competency examinations, we would not be faulted: the reported average is 76 percent, which is within the CI95 range. But we would be stretching it to claim seven in ten. By contrast, claimed number of cases completed by students serving as a barrier to graduation could be between 20 and 40 percent because of the large CI95 associated with this item. The large variation around the average regarding this standard is both a reflection of different policies at each school and of different interpretations of these policies by administrators and chairs, as shown in Table 1Go.

It is useful to inspect the sizes of standard errors because they reflect, among other things, the ambiguity in the survey questions. Large standard errors signal that the question strikes respondents in different ways. Note in Table 1Go that 83 percent of the variance in responses regarding the role of test cases was the result of differences between associate deans and department chairs, who seem to have a different sense about what test cases mean. Although this term may have a specific meaning for the authors of the survey, the large standard error signals that there is difference across dental schools in the way this term is understood.

The CI95 estimates reported in Table 1Go were calculated using generalizability analysis, which involves separate estimates for variance attributable to schools and respondents. Reporting separate estimates of precision can be helpful to readers because it allows isolation of factors that are confounded when overall survey results are reported. Consider the survey item "We evaluate fine; the problem is willingness to make hard decisions about incompetent students." Taking the responses as coming from an undifferentiated pool of respondents, the CI95 ranged from 33 to 56 percent agreement—a wide spread. But we know that 81 percent of that standard error came from respondents. If that error were eliminated by taking an extremely large sample of associate deans and chairs in order to wash out that effect, the standard error would be only 2 percent (10.7* 1–.81), or the much smaller range of 43 to 47 percent. This calculation supports the claims that "cold feet on evaluation" is a common problem in dental schools; but it affects different individuals within schools in very different ways.

An example in which schools matter much more than do the individual respondents within them is the item "Are there a minimum number of required procedures for determining course grades?" The pooled CI95 is between 58 and 70 percent. The largest part of this variance, 57.9 to 72.5, came from schools and error (once the effect of respondents had been accounted for). The CI95 for respondents and error (removing schools) is only one-third as large: 59.5 to 68.5.

Relationship Between Sample Size and Standard Error
There are easy formulas for calculating standard errors when data are treated as though they come from a homogeneous sample. They can be found in most elementary statistical texts. When considering numerical scores, such as the reported average number of years teaching or the common conversion of Likert scales to numerical values, the standard error is the standard deviation divided by the square root of the sample size [SD/SQRT(n)]. For proportions, such as the percentage of respondents agreeing with a statement, the standard error is the square root of the ratio of the proportion multiplied by 1 minus the proportion divided by the sample size [SQRT((p*(1–p))/n)]. The standard error for correlation coefficients is 1 divided by the square root of the sample size less 1 [1/SQRT(n–1)]. It is apparent in each of the formulas that larger sample sizes enhance precision of the estimate. It is also apparent that there are diminishing returns from increasing sample size. This is a function of the square root appearing in each formula. These effects are displayed graphically in Figures 2aGo through 2cGoGo.


Figure 2A
View larger version (8K):
[in this window]
[in a new window]

 
Figure 2a. Standard error of correlation coefficient as a function of sample size

 

Figure 2B
View larger version (12K):
[in this window]
[in a new window]

 
Figure 2b. Standard error for proportions as a function of sample size

 

Figure 2C
View larger version (12K):
[in this window]
[in a new window]

 
Figure 2c. Standard error for ratios of standard error/average as a function of sample size

 
With the exception of correlation coefficients, standard errors are also influenced by the range of measured scores. If the range of possible scores is "crowded," the standard error will be smaller, regardless of the sample size. For example, standard error is greatest in the mid-range of proportions—near the 50:50 split—and smallest for extreme values such as p=.01 or p=.99. Similarly, as the standard deviation decreases relative to the mean in scores reported as averages, estimates have greater precision. It is intuitive that estimates are more wobbly when there is more room for them to vary. These relationships are also shown graphically in Figures 2aGo through 2cGoGo.

Reporting and Managing the Precision of Survey Results
In the cases of numerical values, proportions, and correlation coefficients discussed so far, it has been assumed that the composition of the responses does not matter: ten individuals from the same school are equivalent to one response from each of ten schools, for example. In many cases, this is a reasonable assumption. But sometimes it is assumed on good grounds that the source of the data matters, even to the extent that data from different sources will have different standard errors. When that is the case, a more powerful and computationally sophisticated procedure is required.8 Generalizability analysis provides two advances over the simple procedures mentioned above. First, generalizability analysis permits separate estimates of the variance coming from each source. Second, generalizability analysis permits estimates of standard errors for various combinations of sample sizes, which can be helpful in designing surveys.

Survey designers can improve their chances of getting useful results when they have some prior knowledge about the phenomena they are measuring. A smaller sample size is needed, for example, when measuring something about which there is reason to believe that there is strong agreement. Smaller samples are also required when the survey items are clearly worded. Small samples are generally acceptable when making imprecise distinctions. If decisions turn on differences of a few percentage points, many surveys are required; if meaningful differences must be 10, 15, or more percent before a firm commitment to action is triggered, much smaller sample sizes can be used. Figure 2GoGoGo can be read backwards to estimate the sample size needed for any contemplated decision. Of course, if survey designers hold that they have no preconceived notion of what differences are important to them (they just want to know "what is really the case"), then any sample size will do, and none is the "right" size.

It matters both how many individuals respond to a survey and who they are. Having a reasoned belief about who will respond and how precise the answers must be to support intended action helps design a survey that is large enough. Anticipated large standard deviations can be corrected with correspondingly large sample sizes on the shaky dimensions. Sometimes a design with few schools and many respondents within the school is appropriate; sometimes a different combination of respondents is preferred. The same logic applies to the balance between number of questions on a topic compared with number of topics covered on written examinations or number of test cases and number of evaluators per test case in competency testing.


   Issues Concerning Response Rate
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
It has been shown that the precision of the values estimated from surveys is strongly influenced by the composition of the sample (who responds) and by the number of surveys (how many respond). These factors interact with each other, and both can be controlled to some extent by researchers. In contrast, precision of estimates from surveys is not influenced at all by response rate. Because response rate is always reported in dental education publications and is often accompanied by evaluative terms such as "good" or "excellent," we will have to explore the very limited way in which it matters.

The reader can satisfy himself or herself on this point by considering a Likert item with a standard deviation of 0.5. A 40 percent response rate from a survey sample of 125 potential respondents has a smaller (better) standard error and CI95 than the same item from a potential sample of 50 respondents with a 90 percent response rate.

Response Rate as a Questionable Protection Against Bias
But surely, response rate must stand for something. Conversations with dental educators who conduct surveys often come around to opinions such as "the results are more trustworthy" or "there is less chance for bias" when the response rate is high. This introduces an important new characteristic of survey findings: accuracy. Precision is freedom from random error; accuracy is freedom from bias. Bias is variance that is systematically related to the data collection process. For example, if we had only surveyed dental educators who had published articles on competency-based education for our survey reported earlier in the Journal of Dental Education, our results would have been different from what we found and certainly not representative of the opinions of dental educators generally. Just as response rate has no effect on the precision of estimates from surveys, sample size has no effect on accuracy. Larger samples cannot erase the effects of bias because the relationship is built into the measurement process. Asking five liars for their opinion is no better than asking three—except perhaps in tipping us off that something is strange.

Because bias is a characteristic of data collection, there is no way to detect bias by looking at survey results themselves. Investigations of bias always require a comparison between the data in hand and some external criterion. Response rate (the plain fact that 80 percent responded and 20 percent did not) does not really provide a good criterion, unless there is a sound, independent reason to presume that those who did not respond differ in some material fashion from those who did.

Ways to Address Possible Survey Bias
Is there any way to test the presumption that non-respondents differ from respondents? We will consider three possible ways of addressing this question.

Indexes.
A common approach to probing for bias in surveys is to use an index variable. Imagine that the researcher conducted a survey of dental students’ study habits for National Board preparation. Eighty students returned the survey and signed their names; twenty students did not. There is a concern that students who were ashamed of either their Board scores or their study habits would be less likely to complete the survey, thus calling into question the use of the survey as being representative of students generally. This hunch can be checked by calculating the average Board scores or GPAs for the respondents against those who did not respond (the signatures on the survey forms make this possible). The actual Board scores are called an index because there is justifiable reason to believe such scores are related to both signing the survey and having good study habits. Sometimes indexes can be naturally occurring, such as age, public or private school status, or region of the country.

When indexing is negative (when there is no difference between respondents and nonrespondents on external factors thought to be mutually important), researchers can feel more confident that no bias is present than when indexing is positive. Of course, an unremarkable index result does not prove lack of bias; it only shows that no suspicious relationships were found in the places searched.

When a positive index is discovered, it raises concerns over what should be done with the data. Researchers would be understandably reluctant to throw out the survey; readers would be understandably reluctant to give the reports full credit knowing about the documented potential for bias. Regression techniques such as partial correlation or analysis of covariance8 can be used to make corrections, but these are technically difficult and may engender further suspicions among readers.

Indexing also introduces its own problem: it may reduce sample size and actually create bias. Potential respondents may be reluctant to reveal personal information (name, age, address), and the extent of reluctance to respond may be exactly the source of bias that is suppressed by attempting to measure it.

Late Responders.
A second approach to managing bias in surveys is to control or measure the order of responses received. Frequently, multiple mailings or multiple reminders are part of the survey data collection process. Date of return or whether surveys were received in response to the first, second, or subsequent mailings can be coded as a variable. This information can be analyzed just as indexes are in order to see whether promptness in responding might be associated with differences in responses. If so, reluctance to respond might be associated with some factor that distorts the data. A variation on this approach is to target a pool of nonresponders and use intensive techniques, such as phone interviews or personal contacts, to obtain 100 percent responses from the late responder group. If survey responses show no differences between the regular and the special responder groups, the researcher can be more confident that conspicuous bias does not trouble the study.

Using a "timing of response" strategy to monitor for potential bias is easy, but it is unlikely to detect measurement bias, except for surveys that ask intrusive questions. Both timing and indexing share the problems that failure to detect bias in the places looked does not mean that bias is not hiding someplace else, and if it is discovered, the way to correct it is not always clear.

Sample Saturation.
A third technique that can be used to combat bias in survey research is called sample saturation. (A more apt term might be defeasance, which the dictionary defines as "beyond a shadow of doubt," as used in legal proceedings.) This is a powerful approach that depends on both the sample size and the response rate. The logic involves assuming that all missing surveys would have been damaging to the conclusions of the study. If the sample is large, the effect measured is large, and the proportion of missing surveys is small, it may be the case that even contrary potential results would not be sufficient to change the claim based on data in hand. The claim could be said to be nonresponse "bias-proof " or defeasant.

For example, in the survey of competency-based dental education there was a high correlation between the perception that competency-based education is understood by administrators and that it is valued by them (r=.684). The standard statistical test shows that this association is greater than r=.000 at p<.001. Using a copy of the dataset, we forced values for all missing data, ensuring no correlation existed in those cases. The resulting new correlation coefficient was still significant. In fact, for a survey with this configuration, any correlation coefficient above r=.160 would be significant at p<.05, and any coefficient above r=.200 would be immune from bias due to missing data.

The same procedure can be applied to data that are reported as proportions or averages. The technique can be applied to differences between subgroups (associate deans or department chairs) or to differences between items (the correct order of results attributed to adopting competency-based dental education). In the case of subgroup differences, for example, missing associate deans’ scores would be force coded with the average value for department chairs, and missing department chairs’ surveys would be force coded with the average value for associate deans. In the case of related items reported in order, missing values for the higher item would be coded with the average for the adjacent lower item and vice versa.

A claim that is not statistically significant cannot be immune to reversal based on bias. Those who hear of sample saturation for the first time find its requirements for assuming the most unfavorable results possible among the nonrespondents to be extreme. "Why not," they say, "just assume that the nonrespondents will give pretty much the same sort of answers that the respondents did?" Making such an assumption is exactly the same as assuming that there is no possibility of bias. We are back to questions about precision, not accuracy, at this point, and sample size (not response rate) is all that needs to be considered. Others object to the sample saturation technique on the grounds that it is difficult to specify in advance just how large a difference in outcome would really matter. Is half a point on a Likert scale important, or must it be three-quarters of a point? If issues regarding the size of effect that really matters in practical terms cannot be answered, if the researcher’s entire purpose in conducting a survey is descriptive, bias ceases to be a concern. The description of the survey methods should include the possibility of bias as part of the report, and nothing more is necessary.

Managing Nonresponse Bias in Surveys
The preferred method for addressing bias is pilot-testing the survey instrument. Such screening of the items is best done orally and face-to-face with diverse groups of potential respondents, and each question should be discussed. Potential respondents should be invited to comment on what the item means to them and whether there are any alternative meanings. They should be asked to provide concrete examples that illustrate their interpretations of the items.

Indexing and analysis of response lag should be performed and reported when feasible. Sample saturation statistics might be calculated and reported if there is a concern about response rates. Fortunately, these calculations can also be performed by journal editors and referees and by readers who desire more confidence in the survey results than the authors provide.

There is one more point about response rate that needs to be clarified. Response rate is a ratio of the returned surveys to some conception of total potential sample of meaningful respondents. Usually, the sample value is taken to be the number of surveys sent out. But that is only one arbitrary possible value. If the six dental deans in California are phoned for an interview and all answer, that is a 100 percent response rate for the survey, but it is only about 10 percent of the U.S. and Canadian deans. In our survey on competency-based dental education, we had approximately three times as many respondents as there are schools; but we sampled only something like 20 percent of associate deans and department chairs. There is no reason to believe that the maximum survey size equals the number of dental schools.

An often overlooked dimension of sampling is timing. There are a few examples of dental educational surveys that have been repeated, usually at ten-year or other intervals. We know of no surveys that have been administered to the same dental educators within a short period of time (say a few weeks) to determine how consistent individual respondents are. There may even be temporal events of significance such as the potential that the academic calendar (near graduation, for example), attendance at American Dental Education Association (ADEA) or specialty meetings, state budget decisions, or the publication of major articles have effects. Potentially, meaningful sampling frames can be much larger than the number of schools. Response rate is an arbitrary number unless it can be justified based on intended generalization of the findings.


   Issues Concerning Reporting Results
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Publication of survey results is justifiable when such descriptions hold fair promise of helping others make decisions. When surveys are used to make decisions about local programs, what should be reported are the outcomes of such program changes that have actually been implemented—but only when it is reasonable to expect that other schools could benefit from similar experiences. Only G-study results that can be generalized across interested parties should be reported in the literature. Reporting must be as detailed, complete, and free from bias in the selection and presentation of findings as possible. Because authors of published surveys do not know what decisions others face, the reporting should be "neutral."

Descriptive Reporting of Survey Results
Research that tests hypotheses can be focused on the decisions researchers wish to make. Descriptive research, such as most surveys, will, by contrast, be long on presentation and very short on interpretation. Where subgroups exist among the respondents, it is useful to report subgroup data as well as pooled scores. Where Likert items are used, frequencies for each response, rather than the overall average based on a numerical conversion, are best, as done in the January 2008 article on competency-based education. Standard errors or CI95 values should be reported, as done in Table 1Go of this article. When the standard error is reported, interested readers can perform their own calculations. It would be a useful policy to report standard errors, using the formulas presented above, for all descriptive studies. (By contrast, standard deviations should be reported in articles involving inferential tests of hypotheses.)

Standard errors or CI95 values can also be displayed graphically, as shown in Table 3Go, to provide a quick visual reference. The advantage of a presentation such as Table 3Go is that it permits immediate visual recognition of differences, both across items and across respondent subgroups. Because an average score that is not included within the CI95 range of another variable or group is statistically significantly different at p<.05, quick inspection of such graphs can be more informative than conventional, all-numerical tables. For example, considering the questions regarding the impact of using competency-based education, the great preponderance of positive responses over negative ones can be seen by scanning vertically across the graph. The huge difference between chairs and administrators regarding the use of competency evaluation data for academic status decisions is clear. (It has been our experience that faculty members are heard to lament that they identify the poor students but the administration fails to act; administrators lament that faculty members give them evaluation information that is not legally defensible.) The difference between administrators and chairs on the role of competencies in accreditation is just exactly significant at p=.05.


View this table:
[in this window]
[in a new window]

 
Table 3. Display of response means and confidence intervals permitting visual comparisons across questions and across respondents
 
Visual comparisons can also be made across questions. For example, in Table 3Go, "Improved quality of graduates" received statistically higher scores than the "Extra work" item when reporting the effects of adopting a competency approach to teaching (nonoverlap between average score on one item and confidence interval boundary on another). However, stimulating "Useful curricular, clinical change" was rated the same as "Unwilling to manage needed change," but only for deans and not for chairs. Table 2Go in this article (Table 7 in the original publication) is a summary of those item comparisons that were significant when combining across respondents but were statistically significantly different when deans and chairs were considered separately.

Unwarranted Recommendations
There is one practice typical of surveys in the dental education literature that we would like to see less of. Too often, surveys are used to argue for a favored cause or to support a series of possibly predetermined recommendations. In most cases, the causes and recommendations are consistent with the survey results; but they are also likely to be consistent with many other interpretations, causes, and recommendations. In making such normative pronouncements, authors are betraying assumptions, stated or otherwise, that must be agreed to in order for the recommendations to follow from the data. Where this is done, a thorough review of the literature and a careful building of theory are necessary. Such theory should be constructed so that survey data can confirm or refute the point of the theory. For example, a survey on clock hours for teaching Favorite Subject X cannot support a conclusion regarding the need for more Favorite Subject X absent an independent case for the number of hours actually needed to teach it.

The astrophysicist Sir Arthur Stanley Eddington, who provided empirical confirmation for much of Einstein’s theoretical work, said that "one should be skeptical about a new piece of data until it has been confirmed by theory." The problem of recommending an action solely based on survey results is so common that the English philosopher G.E. Moore coined the phrase "Naturalistic Fallacy" to describe this kind of reasoning. We commit the Naturalistic Fallacy when we argue from a set of descriptions about how things are to a normative conclusion about how they ought to be. A descriptive claim is one about how things are ("dental schools report an average of ten hours of instruction on Topic X"); a normative claim is about what ought to be the case ("the average hours of instruction in dental schools is inadequate"). A more subtle version of this fallacy is to cherry-pick survey results that support the authors’ predetermined views.

It is appropriate that the literature be used to build theory, including theory about what ought to be done to improve dental education. In fact, we are afraid that the naturalistic movement toward evidence-based dentistry is sucking up all the oxygen in dental scholarship and cutting off the flow of good theory construction. But it is simply wrong to describe facts and then jump to conclusions, stepping over the necessary step of building theory. Normative claims must be grounded in the harmony of theory and data; they do not spring full-blown from survey results alone.


   Conclusion
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Surveys appear to be easy to prepare, administer, analyze, and report. These features may account for their popularity in dental education. This facileness, however, tends to mask important issues. Survey results rest firmly on issues of sample design. Although the goal of researchers may have been to draw conclusions about an undifferentiated pool of potential respondents, such as educational programs, the respondents who complete the surveys are subject to multiple classifications. This means that multiple sources of variance may potentially contribute to overall findings. The mix of these underlying sources of variance and the importance of each to the way the data should be used must be identified in the analysis and reported so that readers can make decisions appropriate to their own contexts. Information about these components of the overall results should also be used in designing surveys. Absolute sample size is the strongest determinant of survey precision. Small samples lead to large standard errors and 95 percent confidence intervals. Such lack of precision compromises the extent to which survey findings can be generalized. Precision is not a function of response rate. The issue of accuracy (avoiding bias in surveys) represents a stubborn challenge for surveys in dental education. Bias can only be detected by comparison between survey data in hand and data that were not collected in the survey. A new technique, sample saturation, is introduced that shows some promise of helping manage survey bias. The results of surveys should be reported in detail, with minimal interpretation, and in ways that are visually intuitive. Evidence without theory does not support recommendations.


   Author Information
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 
Dr. Chambers is Professor of Dental Education at the Arthur A. Dugoni School of Dentistry, University of the Pacific; Dr. Licari is Executive Associate Dean, College of Dentistry, University of Illinois at Chicago. Direct correspondence to Dr. David W. Chambers, Arthur A. Dugoni School of Dentistry, University of the Pacific, 2155 Webster Street, San Francisco, CA 94115; 415-929-6438; dchambers{at}pacific.edu. No reprints will be available.


   REFERENCES
 Top
 Author information
 Abstract
 Basics in interpreting survey...
 The original survey of...
 Issues concerning sample...
 Issues concerning sample size
 Issues concerning response rate
 Issues concerning reporting...
 Conclusion
 References
 

  1. Dillman D. Mail and Internet surveys. New York: John Wiley & Sons, 2007.
  2. Fink A. How to conduct surveys. Thousand Oaks, CA: Sage, 1998.
  3. Groves RM, Fowler FJ Jr, Couper MP, Lepkowski JM, Singer E, Tourangeau R. Survey methodology. New York: John Wiley & Sons, 2004.
  4. Rea LM, Parker RA. Designing and conducting survey research. New York: John Wiley & Sons, 2003.
  5. Sue VM, Ritter LA. Conducting online surveys. Thousand Oaks, CA: Sage, 2007.
  6. Burch P. Educational policy and practice from the perspective of institutional theory: crafting a wider lens. Educ Researcher 2007; 36(2):84–95.
  7. Colquitt JA, Zapata-Phelan CP. Treads in theory building and theory testing: a five-decade study of the Academy of Management Journal. Acad Management J 2007; 50(6):1281–303.
  8. Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The dependability of behavioral measures. New York: John Wiley & Sons, 1972.
  9. Licari FW, Chambers DW. Some paradoxes in competency-based dental education. J Dent Educ 2008; 72(1):8–18.[Abstract/Free Full Text]
  10. Wolter KM. Introduction to variance estimation. New York: Springer-Verlag, 1985.
  11. Chambers DW. Portfolios for determining initial licensure competency. J Am Dent Assoc 2004; 135(2):173–84.[Abstract/Free Full Text]
  12. Chambers DW, Loos L. Analyzing the sources of unreliability in fixed prosthodontics mock board examinations. J Dent Educ 1997; 61(4):346–53.[Abstract]




This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chambers, D. W.
Right arrow Articles by Licari, F. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chambers, D. W.
Right arrow Articles by Licari, F. W.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS