Abstract
In support of actions taken by the Joint Commission on National Dental Examinations, two changes—the inclusion of testlet items and the random presentation of items in an interdisciplinary format—were made to enhance the test validity of the National Board Dental Examination Part I in 2007. As a result, the examination was changed from a conjunctive to a comprehensive format. It was assumed that validity would be enhanced with regard to the examination’s internal structure, while not disturbing item performance and examinee score. This study of the results found that 1) three underlying variables were extracted from the conjunctive Part I but only two underlying variables from the comprehensive Part I and 2) the differences in item performance and examinee score were generally small in effect size across formats. Factor analyses revealed that Part I was more disciplinesensitive for the conjunctive format but more item formatsensitive for the comprehensive format. The revision of Part I changed the nature of the examination from a disciplinebased format to a more clinically relevant, interdisciplinary format, a favorable outcome anticipated by the Joint Commission. The results of this study provide evidence supporting the validity of the revised Part I examination for its intended purpose in the licensure process.
For educational programs, systemic reform is a longterm programmatic effort. Since standardized tests are a common part of educational systems throughout the United States, one approach that has been attempted is to develop a test that emphasizes testing a broad range of cognitive behaviors and abilities such as reading comprehension, writing, mathematical and scientific problemsolving, and critical thinking.^{1} For the standardized highstakes dental licensure tests, the National Board Dental Examination (NBDE) Parts I and II were developed to assist state boards in determining the qualifications of dentists who seek licensure to practice dentistry in their states.^{2} The examination assesses candidates’ ability to understand important information from the basic biomedical, dental, and clinical dental sciences and their ability to apply such information in a problemsolving context.^{2} Because of the test’s importance for new dental graduates seeking licensure, the NBDE impacts dental education programs and has the potential to stimulate educational reform. In light of this influence, the validation of new or revised examinations is of paramount importance.
In 2007, in accordance with the Standards for Educational and Psychological Testing,^{3} the NBDE Part I was restructured in such a way that its validity would be enhanced by assessing the candidates’ knowledge and problemsolving skills in a more clinically relevant and interdisciplinary assessment context that more closely reflects clinical practice. Two changes were introduced to build an examination that may more closely assess the competencies underlying initial safe patient care. These changes involved the inclusion of testletbased items as a new item type to better appraise candidates’ problemsolving skills related to dental care and the use of a new item arrangement method to present the test items associated with various disciplines in a random, intermingled fashion.
“Context effects” in testing refers to the interaction of the examinee and the item in a particular environment.^{4} Examples of context effects in relation to testing include the typeface used for items on print and computerbased tests, the layout of an item on the printed page, an item’s temporal position on a test form, the mode effect, and administration issues such as allowing or not allowing access to a calculator on a mathematical test. The use of a new item type and a new item arrangement method in this study are deemed to be two specific context effects implemented in the NBDE Part I in order to enhance test validity. Before the revision, Part I was conjunctive and disciplinedbased, and knowledge in each of the sciences was assessed in relative isolation with little regard for a clinical context; this separation did not adequately reflect the practice of dentistry. After the revision, Part I became comprehensive, interdisciplinary, and more clinically relevant than the traditional fourpart, disciplinebased test.
Validity, which refers to the degree to which logic and evidence support the use of test scores for making critical decisions, has been recommended as one of the imperative factors in assessing the instruments used for any wellestablished examination.^{1,3,5,6} When educational programs change their examination format and/or examination content in an attempt to reflect curriculum changes and/or to meet new educational objectives and policies, the evaluative results derived from a validity study would be deemed an important source of information used to support the changes planned. Only when the validity of an examination is secured would an inference drawn from test scores to explain examinee performance be considered appropriate and meaningful. Therefore, to determine whether the two changes to the NBDE Part I are helping move it toward a more clinically relevant and interdisciplinary assessment context, the first and most important step in our validity study is to determine whether the results of the measurement instrument are constant before and after the test restructuring, so that any differences observed reflect actual change in the test’s internal structure rather than measurement artifacts.^{7,8} After we compared the characteristics of the conjunctive and the comprehensive Part I (Table 1), we concluded that the two formats are fairly compatible.
In addition to the change in the NBDE Part I internal structure, these two specific context effects might result in inconsistent estimates of item performance and examinee score. Item performance with regard to difficulty could be influenced by the characteristics of item types or by the sequencing of items.^{9} With regard to item types, studies that have evaluated the effect of relations among item types on item performance in health care licensure examinations have found that 1) casebased items were significantly more difficult than noncasebased items for the same disciplines,^{10} 2) examinees with higher ability performed significantly better on noncasebased items than on casebased items,^{11} and 3) casebased items were of comparable difficulty to noncasebased items, but noncasebased items were statistically better than casebased items at discriminating highability examinees from lowability examinees.^{12} With regard to item sequence, inconsistent item performance estimates for reading and mathematics tests have been found to be greater on items near the pretest items and/or on items at the end of the test.^{13} When the same set of items was administered in two different contexts due to differing item positions, the calibrations conducted on the different ordering of items yielded different item difficulties, which could render different estimates on examinee ability and varied results on examinee score.^{13–15}
The purpose of this study was thus to explore the impact of the changes in the NBDE Part I and to affirm its validity for its intended purpose. Specifically, does the implementation of the two context effects enhance the test validity of Part I? Since the Standards for Educational and Psychological Testing^{3} underscore the notion that validity refers to inferences about constructs made on the basis of test scores, the answer rests on the following questions relative to the conjunctive Part I and comprehensive Part I:

Are the calibrations on item statistics (i.e., item difficulty and item discrimination) comparable?

Is examinee performance (i.e., test scores) comparable?

Does the dimensionality (i.e., examination internal structure) of Part I change?

Does the level of predictive validity (i.e., positive score correlation) between Part I and Part II remain the same or improve with the restructuring?
Answering “yes” to questions 1 and 2 would mean that the restructured Part I is acceptable from a psychometric perspective; answering “yes” to the third question would demonstrate that a different underlying construct is elicited from analysis of the restructured examination. That is to say, the statistical evidence pertaining to these questions will provide support for whether the two context effects only alter the construct of Part I being measured and if the test scores accurately reflect the intended construct rather than any other extraneous characteristics. Answering “yes” to question 4 would lend support to the enhanced validity of the restructured examination. This study was designed to determine if the following two assumptions were justified: 1) there would be no significant statistical differences on the estimates of item performance and examinee score as well as no significance change on the strength of score correlation between Part I and Part II; and 2) four distinct, disciplinedbased factors would emerge from the validation process for the conjunctive Part I, while one dominant, interdisciplinary factor would emerge for the comprehensive Part I.
Methods
Instruments
The NBDE Part I consists of a battery of 100item tests in four disciplinary areas: Anatomic Sciences, BiochemistryPhysiology, MicrobiologyPathology, and Dental Anatomy and Occlusion.^{16} The conjunctive Part I was delivered as a linear sequential computerbased examination from 2003 to 2007. It consisted of multiplechoice items with 100 items evenly distributed across each discipline. The standalone items associated with the disciplines were administered in a sequential order. The candidates had to respond to the first 200 items (the first set of 100 items in discipline 1 and then the second set of 100 items in discipline 2) in 3.5 hours in the morning session; then, they responded to the remaining 200 items (the third set of 100 items in discipline 3 and the fourth set of 100 items in discipline 4) in 3.5 hours in the afternoon session. Four independent standard scores were converted from the four sets of examinee responses; these plus an additional Part I average score were developed.
Beginning in 2007, the comprehensive Part I has been delivered as a linear randomized computerbased examination. In the comprehensive Part I, approximately 80 percent (320 items) of the items are standalone items distributed across four disciplines, and 20 percent (eighty items) are grouped in testlets with an interdisciplinary focus and clinical application. Each testlet consisted of a clinically relevant scenario and a set of multiplechoice items from the various disciplines associated with the scenario. Usually, each testlet is comprised of nine to twelve test items related to various aspects of a patient’s dental care. Items associated with a particular testlet are independent of one another. The nature of the testlet itself determines the proportion of items for any particular discipline.
In this format, candidates are asked to respond to the first 200 items (presented in the order of 160 randomized standalone items and forty sequential testletbased items) in the morning session. They respond to the second 200 items (presented in the order of forty sequential testletbased items and 160 randomized standalone items) in the afternoon session. The standalone items associated with various disciplines are administered in a randomized order. The testletbased items are administered in a sequential order within each testlet being presented consecutively. An overall standard score, converted from the total number of correct answers based on the entire 400 items, is developed.
Both Part I formats have been delivered as linear, fixedtime, and fixedlength tests. Examination items for both formats were selected by test construction committees comprised of subject matter experts in accordance with examination specifications approved by the Joint Commission on National Dental Examinations. Both examination formats have reported good reliability. The reliabilities for the conjunctive Part I and comprehensive Part I were found to be 0.97 and 0.96, respectively.
Samples
The basic data for the sample consisted of an itembyperson response matrix gathered from the administration of each examination. Each matrix consisted of entries of “1” and “0” for correct and incorrect responses, respectively. A total raw score for each disciplinary area was computed by summing the scored responses for the items in the area. To control for testtaking experience, only the responses of candidates being examined for the first time were gathered and included in the data analyses. Furthermore, to avoid potential measurement disturbances, examinees with a large number of missing responses were removed from the statistical analyses.
Responses from examinees with less than 30 percent correct on the final fifty items were excluded from the study sample because these examinees may have had insufficient time to complete the test, which could add noise and produce a biased estimation on item performance and examinee score. As a result, the responses of 2,100 examinees on one conjunctive Part I and the responses of 2,781 examinees on one comprehensive Part I were selected to form the final study samples. None of the test items administered in these two examinations reported a zero or 100 percent of correct response.
Statistical Analyses
Two parameters, the difficulty of an item and its discrimination index, were used as the estimates of item performance. Item difficulty is defined as the proportion of examinees who answered that item correctly.^{17} The difficulty level of the item is inversely related to the percentage of examinees who answered the item correctly. As the percentage increases, the difficulty decreases. The discrimination index is a pointbiserial correlation coefficient. The coefficient of an item represents the correlation between the scores on that item (correct or incorrect) and the total score on that particular test. A high correlation coefficient would indicate an item that contributes greatly to the consistency, precision, or reliability of the ability being measured. Finally, the estimate of examinee score was based on the total raw score correct computed on each discipline.
To examine the differences in the estimates of item performance and examinee score, ttests for the effect size for mean difference in relation to each discipline between the two examination formats were conducted. The effect size is usually used to determine how much a difference the intervention made.^{18} In this study, the effect size was a statistic employed to determine how much of an impact the format made in the estimates of item performance and examinee score. The standard mean difference (i.e., dividing the difference between the first and second raw score means by the average standard deviation) was computed to obtain the effect sizes to determine the intervention’s effects as a result of the change in Part I format. Based on Cohen’s guideline, an effect size of 0.2 to 0.3 might be considered a “small” effect, around 0.5 a “medium” effect, and 0.8 to infinity a “large” effect. A simple correlation analysis was further conducted on each format to compare the magnitude of test score correlation between two dental board examinations, Part I and Part II.
To assess the examination’s internal structure, the dimensionality underlying the conjunctive and comprehensive Part I examinations was studied with the use of confirmatory and exploratory factor analyses. Factor analysis has been recognized as an appropriate method to evaluate the presence of viable underlying constructs embedded in an examination.^{19–24} In general, the results of factor analyses indicate whether examination data such as item responses correlate highly with one another or whether there are subgroups of data that correlate highly with each other. If subgroups emerge from the analysis, the examination is considered to be multidimensional or conjunctive in nature.
A factor analysis program, TESTFACT, was used in this study to evaluate the examination’s internal structure.^{25} Confirmatory factor analyses and exploratory factor analyses were conducted sequentially to determine if a relationship exists between the observed variables and their underlying latent constructs, as stated in the study’s assumptions. For exploratory factor analysis, the program can estimate tetrachoric correlations for all distinct n(n−1)/2 pairs of items to determine the viable factors. The factor pattern derived from the confirmatory analysis consists of a general factor on which all items have some loading, plus any number of “group factors” to which nonoverlapping subsets of items are assumed to belong.^{25} The results of a chisquare analysis were generated from both the exploratory and confirmatory analyses to provide a test of the statistical significance of factors added successively to the model that might result in a significant improvement.
The Scree plots of 400 latent roots were developed first to determine the number of factors or viable dimensions in the data to form the optimal number of factor solutions.^{26} A varimax rotation was then used to obtain a clear pattern of loadings to ascertain the underlying common factor structure for the two examinations. An item’s factor loading was significantly related to a certain factor when the highest factor loading of that item on one factor was greater than the remaining factor loadings of that item on other factors. Statistics including the parameter estimate of each factor, the percentage of variance explained by each factor, and chisquare analysis for test of fit of each factor solution were used as fit indexes to assess the underlying construct of the two examinations in the validation process.
Results
Item Performance, Examinee Score, and Score Correlation
The item statistics (Table 2) were deemed to be sound and reasonable. According to the Joint Commission, for an item to be considered effective, it must produce a difficulty index between 40 and 89 percent and a corresponding discrimination index of 0.15 or higher.^{2} Of 800 items investigated in this study, 324 standalone items in the conjunctive Part I and 226 standalone items and fortynine testletbased items in the comprehensive Part I met the Joint Commission’s criteria. The item difficulties and item discriminations were similar across the four disciplinary areas.
Specifically, the average item difficulty associated with each discipline ranged from 0.64 to 0.74 for both examination formats. Dental Anatomy and Occlusion was the easiest disciplinary area among the four. All of the items had a positive discrimination value, with average item discrimination ranging from 0.27 to 0.30 for the conjunctive Part I and from 0.24 to 0.28 for the comprehensive Part I. Of the four disciplinary areas, BiochemistryPhysiology had the highest average item discrimination for both examinations. Items in the two formats were of comparable difficulty across the four disciplines. Although items associated with Anatomic Sciences and Dental Anatomy and Occlusion were statistically better for the conjunctive Part I than those for the comprehensive Part I at discriminating highability from lowability examinees, the effect sizes for these two disciplines were either small or moderate.
Regarding the examinee score for each discipline in each examination format, examinees scored significantly higher in Anatomic Sciences on the conjunctive Part I and in BiochemistryPhysiology and MicrobiologyPathology on the comprehensive Part I (Table 2). Although the results from the ttests showed a statistically significant difference in average examinee score for Anatomic Sciences, BiochemistryPhysiology, and MicrobiologyPathology, the corresponding effect sizes in relation to each significant result were either small or medium. Such findings were in line with the conventional assumption that research tends to show smaller effect sizes in this type of study.^{18} Overall, the measures of item performance and examinee score were believed to be rather similar between the two study samples. Regarding predictive validity, the correlation results (Table 3) indicate a consistently significant relationship between the performance of examinees in the two formats and for Part II overall. Regardless of format, it is evident that the Part II standalone items resulted in much higher correlation magnitudes with Part I test items when compared to Part II casebased items.
Examination’s Internal Structure
The results of the first confirmatory analysis showed that the 400 items loaded on one general ability factor for both formats. The second confirmatory analysis showed that items loaded on two factors: the general ability factor and the group factor. The results based on the second confirmatory analysis achieved a significant improvement in model fit in terms of better tapping the underlying structure of Part I over the results derived from the first confirmatory analysis for both examination formats (Table 4). As indicated by the statistical significance on chisquare difference between two analyses, the total percentage of variance explained increased by 7.07 and 4.68 percent for the conjunctive and comprehensive Part I examinations, respectively. Such results suggest that items from all four disciplinary areas should be used to form the group factor. However, the increase was relatively larger for Part I administered in the conjunctive format than in the comprehensive format.
Furthermore, after we compared the percentages of variance explained by the group factors, the results showed that items with item factor loadings on the group factor were considerably more noticeable for the conjunctive Part I than for the comprehensive Part I. The percentage of variance explained in each of the four disciplines in the conjunctive Part I was consistently higher than those in the comprehensive Part I. Discipline 4, which explained the highest percentage of variance, was the strongest group factor among the four disciplines for both examinations. The differences between the percentages of variance explained by each discipline were greater for Part I administered in the conjunctive format than for those found in the comprehensive format.
Summary statistics of the item factor loadings were computed based on the general ability factor and the group factor derived from the second confirmatory analysis (Table 5). For the general ability factor, these results showed similar, moderate average factor loadings across the four disciplines ranging from 0.30 to 0.35 for the conjunctive Part I and from 0.28 to 0.33 for the comprehensive Part I. There were no statistically significant differences between the two formats in the average factor loadings across the four disciplines. For the group factor, there were statistically significant differences on the average factor loadings between the two examinations in three disciplinary areas: Anatomic Sciences, BiochemistryPhysiology, and Dental Anatomy and Occlusion. Based on the overall results (Tables 4 and 5), this study suggests that 1) the general ability factor, though moderate, was confirmed as an underlying construct embedded on both examinations, and 2) as expected, several other factors, in addition to the general ability factor, would be elicited to tap the underlying construct for the conjunctive Part I since the magnitudes on the average factor loading in the group factor were stronger in the conjunctive format than those in the comprehensive format.
The exploratory factor analyses on the overall item response data also resulted in one strong factor and a number of minor factors. Such findings were in agreement with the assumption that it is very common for an abilityrelated examination to obtain one primary factor and several minor ones. The primary factor might best be described as the general ability factor, and the remaining factors might be related to the group factors as manifested from the results of the confirmatory factor analysis. In this study, the Scree plots showed an elbow between the fourth and fifth factors for the conjunctive Part I and between the third and the fourth factors for the comprehensive Part I. With the response data gathered from fewer than 3,000 examinees and 400 items being analyzed, the results of power analysis suggested that the fourfactor solution was the highest factor number practicable for both examination formats. Then, to obtain a clear pattern to interpret the factor results, the fourfactor solution was used to serve as base information to examine the presence of viable dimensions on the examination.
With regard to the percentage of variance explained by each individual factor, the results (Table 6) show that 1) factor 1 was the dominant factor for both examinations; 2) factor 1 derived from the fourfactor solution (10.75 percent) explained a greater percentage of variance on the underlying construct than the factor 1 drawn from the threefactor solution (10.11 percent) for the conjunctive Part I; and 3) factor 1 elicited from the twofactor solution (10.75 percent) explained more percentage of variance on the underlying construct than the factor 1 drawn from the threefactor (10.04 percent) and fourfactor solutions (10.29 percent) for the comprehensive Part I.
With regard to the percentage of total variance explained by the model, the results showed that the fourfactor solution (13.25 percent for the conjunctive Part I and 12.68 percent for the comprehensive Part I) resulted in a significant improvement in model fit over the threefactor solution (12.29 percent for the conjunctive Part I and 12.11 percent for the comprehensive Part I) for both examinations. However, the twofactor solution (12.45 percent) resulted in a significant improvement in model fit over the threefactor solution (12.11 percent) for the comprehensive Part I. Although the fit index results were in favor of the fourfactor solution, it is still imperative to conduct a factor rotation to help obtain a meaningful and concise solution.
For the conjunctive Part I, the fourfactor solution (Table 7) found that factor 1 comprised the items with the highest factor loadings mainly administered in the disciplinary areas of Anatomic Sciences, BiochemistryPhysiology, and MicrobiologyPathology, factor 2 in Dental Anatomy and Occlusion, and factor 3 in BiochemistryPhysiology and MicrobiologyPathology. There were very few items significantly associated with factor 4, which suggests that the fourfactor solution was not viable. The threefactor model provided a much clearer factor pattern, with factor 1 as the primary factor and factors 2 and 3 as the secondary factors. These findings suggest that three distinct factors were present in the underlying structure for the conjunctive Part I. After examining the content specifications included in each disciplinary area (Figure 1), factors 1, 2, and 3 were respectively related to the items comprising the basic sciences, dental science, and the basic sciences. Since there were more items associated with factor 1 than factor 3, factor 1 could be considered as a main factor and factor 3 as a minor factor—all tapping the domain of the basic sciences.
The rotated factor solution results (Table 8) are reported for the comprehensive Part I. The results of the fourfactor solution showed that there were only relatively few items with higher factor loadings loaded on the third and fourth factors, indicating that the twofactor model was a viable solution. Based on the twofactor solution, approximately 60 percent of the items with highest factor loadings loaded on factor 1 and 40 percent of the items loaded on factor 2 in the disciplinary areas of Anatomic Sciences, BiochemistryPhysiology, and MicrobiologyPathology. In the case of Dental Anatomy and Occlusion, about the same number of items with high factor loadings contributed to both factors. A majority of items contributing to factor 1 were from the standalone items, and a majority of testletbased items loaded on factor 2. The overall results suggest that a twofactor solution might be the most favorable solution to depict the underlying structure for the comprehensive Part I. All four disciplinary areas had items contributing to factors 1 and 2 (Figure 2).
Discussion
It has been noted that context effects may be found in an almost infinite variety of places and in a variety of sizes.^{4,27,28} In this case, a construct validity study was conducted to investigate whether the two specific context effects helped increase the test validity of the NBDE Part I. Two examination formats with adequate sample sizes were chosen for the analyses. The study results showed that item difficulty and item discrimination indices were comparable between the two formats. The reliability of the overall examination was found to be acceptable, and examinee scores in relation to each discipline were found to be relatively comparable. The effect size used to account for the extent of difference on the average item performance and examinee score was either small or medium in magnitude across formats. Furthermore, the strength of predictive ability between Part I and Part II did not decline as a result of the Part I restructuring. In all, the study results thus supported the first assumption.
The results derived from confirmatory and exploratory item factor analyses were consistent and supportive of each other in determining the favorable factor solution for the two examination formats. In general, the results showed that there was an underlying viable construct embedded in Part I, and the number of factors being drawn to form its internal structure depended on the examination format. The items with the highest factor loadings were clearly associated with three distinct factors in the conjunctive Part I, while the items with the highest factor loadings were associated with two distinct factors for the restructured Part I. Although the examination’s internal structure was practically different between two examinations, with only three factors instead of four factors being drawn for the conjunctive Part I and two factors instead of one factor for the comprehensive Part I, the study results only partially supported the second assumption.
Difference in the examination’s internal structure was a function of different formats. For the conjunctive Part I, factors 1 and 3 were named Basic Sciences 1 and Basic Sciences 2 in this study because the items with higher factor loadings were mostly related to the disciplinary areas of Anatomic Sciences and MicrobiologyPathology for factor 1 and BiochemistryPhysiology for factor 3. Factor 2 was named Dental Science because a majority of items with higher factor loadings were related to the discipline of Dental Anatomy and Occlusion. However, for the comprehensive Part I, factors 1 and 2 were named Basic Sciences and Dental Science because the items contributing to both factors were rather equally related to all four disciplines. The difference between these two factors was mainly related to the percentage of different item types contributing to each factor. The items contributing to factor 1 were mostly from the standalone items, whereas the items contributing to factor 2 were from both standalone and testletbased items. In all, the study results showed a very interesting and important finding: factors were more disciplinesensitive for the conjunctive Part I, but more itemformatsensitive for the comprehensive Part I.
A study of two specific context effects appears to have resulted in encouraging conclusions for the comprehensive Part I. The restructuring has affected the examination’s underlying construct as anticipated to hopefully enhance the test validity without having an impact on item performance and on examinee score at the same time. That is, after comparing the nature of the two examinations’ internal structures, it is reasonable to claim that the use of a new item type and a new item presentation method created a more interdisciplinary assessment content than did the traditional format. As observed in an earlier dental licensure validity study, the inclusion of testlet items, which might sample a broader range of cognitive behaviors and problemsolving skills, could plausibly enhance the test’s validity through the assessment of examinees’ knowledge on patient treatment planning and diagnosis from a more clinically relevant content domain.^{19}
There were some limitations in this study. First, it would be essential for content experts to assist in interpretation of the factorsolution results. This would help create a meaningful factor name to support and validate the examination structure. It is believed that the statistical results reported in this study and the comments gathered from content experts altogether would provide the Joint Commission with additional valuable validity evidence in determining if the restructuring of Part I actually improved the test validity.
Second, factor analysis might elicit different factorsolution results if more testletbased items were included on the examination to assess examinees’ decisionmaking and problemsolving skills. For instance, an itemformatgroup factor might be more clearly manifested for the comprehensive Part I if a majority of the standalone items with higher factor loadings from all disciplines contributed to one factor while a good number of testletbased items were devoted to another factor. Such findings would provide the Joint Commission with empirical evidence of awareness of the direct interaction between context effect and test validity.
The third limitation was the sample size used in this study. Since the Scree plot showed an elbow between the fourth and fifth factors for the conjunctive Part I, it would be prudent to interpret the item factor loadings derived from a fivefactor solution in the first place to determine how many distinct factors would be elicited throughout the factor analysis processes. With a larger sample size, four distinct factors might be drawn from the factor solution in the final phase, a desirable internal structure for the conjunctive Part I.
The fourth limitation is related to the generalizability of the study results. To deal with this issue, one approach is to conduct a replication study on various forms to revalidate the presence of a disciplinesensitive factor elicited from the conjunctive Part I and the itemformatsensitive factor elicited from the comprehensive Part I. Another approach is to expand the nature of the current study sample to include examinees with various backgrounds to reconfirm that the examination internal structure is independent of an examinee’s testing history and schooling. Furthermore, since in the past the conjunctive Part I showed a good predictive ability relative to dental student performance, a new predictive validity study should be conducted for the comprehensive Part I to examine the correlation of items with significant factor loading with a set of external variables like biomedical and preclinical dental technique grades from the first and second years of dental programs.
Conclusions
Two new context effects were implemented in the test development phase to help enhance the validity of Part I of the National Board Dental Examination. As a result, the examination format was changed from a conjunctive to a comprehensive format. Through consistent comparisons on factor loading results between confirmatory and exploratory analyses, the factors that comprise the examination’s internal structure were found to be different as expected between the two formats. However, the change in examination format only had a slight impact on item performance and examinee score. That is, although a different construct is elicited after the restructuring of Part I, the differences in the magnitude of item difficulty and item discrimination were negligible, and the examinees’ scores were fairly comparable between the two examinations. Moreover, the two context effects (administering the standalone items in a randomized order and the testletbased items in a sequential order) did make the examination measure the candidates’ knowledge and problemsolving skills in a more interdisciplinary context. The disciplinary area was the determining factor in forming the underlying construct for the conjunctive Part I, while the item format was a more crucial factor for the comprehensive Part I. In conclusion, Part I was more disciplinesensitive before Part I was revised and more interdisciplinary after the restructuring of Part I, which is the favorable validity study result anticipated by the Joint Commission.
