Volume: | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

A peer-reviewed electronic journal. ISSN 1531-7714

Copyright is retained by the first or sole author, who grants right of first publication to |

Bacon, Donald (2004). The contributions of reliability and pretests to effective assessment. Practical Assessment, Research & Evaluation, 9(3). Retrieved April 23, 2014 from http://PAREonline.net/getvn.asp?v=9&n=3 . This paper has been viewed 27,745 times since 2/26/2004.
Donald R. Bacon
Schools, colleges, and universities are increasingly turning to the assessment of learning outcomes to evaluate the effectiveness of their programs. Unfortunately, for institutions with few students per year, it may take years to accumulate a large enough sample size to form statistically sound conclusions about the effectiveness of an instructional practice. Even for institutions with larger student populations, the collection of a large sample may be cost-prohibitive. This paper shows how improving the reliability of outcome measures and including pretests or covariates in the assessment process increase statistical power and can dramatically reduce the required sample size, thus enabling an organization to collect supportive evidence more quickly and inexpensively. Some background on statistical power in research designs, and the use of pretests in particular, will be offered before presenting tables that describe the interrelationships among assessment reliability, pretests, power, and sample size.
Assessment measures are often used as part of a tracking study of some sort, and often in cohort designs. Assessment measures may include exams, such as a final given to all students who complete an algebra course. The same exam can be given each semester, and the data then combined to enable tracking of improvements in algebra achievement from semester to semester or year to year. In the cohort design, the people who completed the school’s program in the past can be seen as a control group (e.g., last year’s algebra students) and those who completed the program more recently, after some instructional change occurred in the program, can be seen as a treatment group (e.g., this year’s algebra students). These groups may also be represented in an experimental design using concurrent programs, where a control group receives the conventional form of education and the treatment group receives some new form of instruction. One of the keys to achieving meaningful results with experimental designs is statistical power, which is the ability to detect statistically significant differences. The greater the statistical power in an experiment, the greater the chances of finding a statistically significant result. One type of power analysis is the determination of how large a sample must be drawn in order to have a reasonable chance of achieving statistical significance when an effect is present. A commonly used threshold in these analyses is .80 (Cohen, 1977, p. 56), meaning the researcher asks how large a sample is necessary to have an 80% chance of detecting a statistically significant difference, if the expected difference exists. The analysis of power is dependent on how large a
difference the researcher expects to find. In most statistical analyses, it is
assumed that the null hypothesis is true, that is, the control group and the
treatment group have exactly the same outcomes. In power analyses, it is
assumed that the control group and the treatment group will have some specific
difference in outcomes. Assuming some specific difference, or effect size, is
necessary to be able to estimate the probability of detecting a difference of
that magnitude. Many methods are available for specifying the size of the
expected difference in outcomes, but one common measure of effect size is the
standardized mean difference in outcome measures ( (1) In this equation, In this paper, the analysis of power is developed
from a test for differences across two groups, the
The effect of a pretest on assessment effectiveness is primarily a function of how well that pretest correlates with the final assessment, or posttest. Higher pretest-posttest correlations indicate that the pretest explains more variance in the posttest, leaving less variance unexplained. Thus, if there is any variance in the final assessment due to the effect of a new instructional form, that variance will be more obvious. Although the use of meaningful pretests is generally possible, they are sometimes impractical. For example, in an educational setting, students may have little incentive to carefully complete a pretest on the first day of class, and so the pretest reliability may be poor. In addition, beginning students may have so little knowledge of a content area that their test responses are not consistent with a meaningful scale, and thus the pretest-posttest correlation may be so low that the pretest is virtually useless for statistical purposes. For example, Bacon (2002) reported pretest-posttest correlations from a junior-level business college course of around .30, even though the final exam had a reliability of .92. At this level, the pretest reduces the total variance in the posttest by only about 10% (see also Equation A6), an amount many would consider to be not worth the trouble (e.g., Reichardt, 1979). When pretests are impractical, or when additional explained variance is desired, covariates may be considered. It is important to note that because the key characteristic of the pretest is its correlation with the posttest, any reasonable covariate or set of covariates may be used instead of or in addition to a pretest, as long as the experimental groups are equivalent. For example, in Bacon’s (2002) study, although the pretest was not highly correlated with the final exam, grade point average was correlated with the final exam at the .63 level. Pretests and covariates alike are generally modeled as covariates in subsequent statistical tests (Reichardt, 1979). It should be noted, however, that in the case of non-equivalent group designs, using covariates might lead to bias if there are systematic differences across the groups (see especially Reichardt, 1979, p. 169). A pretest would be preferred under these conditions. For simplicity in describing the tables presented later in this paper, the variable or set of variables that may be used as covariates are lumped together and referred to as “the pretest.” It is also important to note that
the observed pretest-posttest correlation ( ρ) and the reliability of the pretest (_{YY}ρ) through the following the
relation_{XX}(2) Thus, in practice, improvements to the posttest measure reliability will also increase the pretest-posttest correlation.
To understand the effects of assessment
reliability, pretests, and effect size on study design, meaningful values for
these variables will be inserted in the equations from the Appendix and the
required sample sizes tabulated. Cohen (1977, p. 56) makes a compelling case
that a meaningful value for the acceptable level of power would be .80, and
recognizes .05 as a very commonly used In considering meaningful values of
the pretest-posttest relationship for the present analysis, one must recognize
that in practice many studies use no pretest at all, effectively setting the
pretest-posttest correlation ( ρ
it should be noted that the correlation with the pretest cannot exceed the
square root of the reliability of the posttest (Equation 2). Therefore, the
set of pretest-posttest correlations of 0, .3, .5, and .7 will be used here._{XY,}Meaningful values for effect sizes
can be found in Cohen (1977), although different standards may be appropriate
for academic research in educational psychology (see Osborne,
2003). Cohen (1977) describes small, medium, and large standardized
mean differences ( _{} is 1.0 (as standardized
here), the total variance would be 1.43 (1.0/.70) and thus the observed
standard deviation would be 1.20. Therefore, to model Cohen’s observed effect
sizes of .2, .5, and .8 will require “true effect sizes” of .24, .60, and .96,
respectively (.2x1.2, .5x1.2, and .8x1.2, respectively). These true effect
sizes (ES are used in the tables and the figure presented in
this paper. _{T})Setting the left side of Equation
A1 at .80, and substituting the values mentioned above for ρand_{XY},_{ } ES
into the right side (via Equation A9), the sample size value that balanced the
equation was noted. The GoalSeek function in MS Excel was used to solve for
sample size values at various levels of the other variables in the model. In
the special cases where _{T}ρ
= .70 and _{yy}ρ
= 0, the results were identical (within rounding error) to those reported in
Table 2.3 in Cohen (1977). _{XY}Results and Discussion The effects of
reliability, the use of pretests, and the effect size on the required sample
sizes are shown in Figure 1. (The figure was simplified by setting the control
and treatment group sizes equal, but later analyses will allow these to
differ.) Figure 1 shows that over the range of values shown here, the effect
size appears to have the most dramatic impact on the size of the sample
required. An effect size ( ρ,_{XY} = .50 and the outcome measure
reliability, ρ,
= .70). The reliability of the outcome measure and the use of a pretest were
similar in their effectiveness over a reasonable range of values. The required
group sizes would drop from about 150 to about 75 as the reliability increases
from .5 to .95 (assuming _{YY}ρ.50, _{XY
}= ES
= .40), while the required group sizes would drop from about 140 to about 60 as
the pretest-posttest correlation increases from .2 to .75 (assuming _{T} = .70 and _{ }ρ_{YY}ES
= .40). Although effect size is found to be a major driver of sample size, in
practice an assessment professional may not have control over the size of the
improvements in outcomes. Also, as noted earlier, in practice improvements in
outcome measure reliability will generally lead to improvements in the
pretest-posttest correlation. Therefore, we can conclude from this analysis
that outcome measure reliability and the effective use of pretests both warrant
close attention in the design of assessment plans. _{T}
Tables 1, 2, and 3 show the sample sizes required for small, medium, and large effect sizes, respectively, given various assessment reliabilities and pretest-posttest correlations. A hypothetical example will be offered to show how these tables could be used to evaluate an assessment system. Suppose a school decided to implement an assessment system wherein the learning outcomes of students would be assessed each year, using tests or similar measures, and these outcomes would be compared to prior years. The system would include statistical analyses to evaluate the effect of changes in the curriculum, and these analyses would form the basis of decisions to retain or reject changes. Thus, the system can be seen as a closed feedback loop. Suppose this system had been in place for one year, and at the end of that year, a new curriculum was to be introduced that was expected to lead to improvements in learning outcomes. The assessment system didn’t use pretests and used a locally-developed outcome measure with a reliability of only .50. Such a system might be obtained from portfolio-based assessments with poorly trained graders or weak rubrics, or from tests that have not been subjected to item analysis. At this school, 60 students finish this level each year, so the control sample is 60. Suppose further that the changes to be expected in outcomes corresponded to medium effect sizes (based on studies reported in Bloom, 1976). Table 2 (second column from the left) shows that a sample of 166 more students would be needed to comprise the treatment group in order to have an 80% chance of detecting a statistically significant improvement. At this school, the assessment system would take nearly three more years (nearly four all together, including the control year) to collect the data before a sufficiently powerful statistical evaluation could be made.
To continue the hypothetical example, now imagine that an assessment professional noticed the low reliability of the measures in a pilot study and, realizing that four years is too long to wait for feedback, immediately made changes to the measures. Suppose the outcome measure’s reliability were improved to .90, and a pretest (perhaps prior grades, or standardized test scores from another source, if the treatment and control groups were fairly equivalent) were added that correlated with the outcome measure at the .70 level. Still looking in Table 2 (the underlying effect size has not changed), the far right column indicates that now only 16 students would be necessary for the treatment sample. Further, if the school was able to implement the control and the pretest in the same year, perhaps to different groups of students, the size of each group need only be 26 (from the bottom row of Table 2). Thus, the school’s assessment system could make a sufficiently powerful quantitative evaluation in one year or less. Of course, with smaller sample sizes, the researcher must always watch for outliers, and statistical significance at the .05 level may not always be necessary to make reasonable decisions, but from this example it is clear that by improving the reliability of the measures and using predictive pretests, assessment systems may obtain sound conclusions in a much more timely manner.
The analysis and the examples presented here demonstrate how improvements in reliability of outcome measures and the use of predictive pretests (or covariates if appropriate) can lead to striking improvements in assessment systems. By reducing the sample sizes required for sound assessment, assessment systems so improved may provide feedback to program administrators in a much more timely and cost-effective manner. The systematic collection and distribution of measures that may function as useful covariates may thus be seen as an important aspect of building a school’s assessment system. Conversely, if reliability and pretest issues are ignored, assessment systems may amount to little more than bureaucratic overhead with little hope of providing useful information. Increased attention to the measurement issues described here may therefore be essential to the success of assessment programs. NotesThe author gratefully acknowledges the insightful comments received from Kim A. Stewart, Charles S. Reichardt, Melvin M. Mark, and two anonymous reviewers on earlier drafts. The financial assistance from the Daniels College of Business that was instrumental in completing this research is also gratefully acknowledged.
This appendix shows how the power of a (A1) Power = 1 – prob. where n – 2),_{T} The noncentrality parameter in Equation A1
captures the expected difference between the two groups, and is closely related
to Cohen’s (1977) (A2) est. where (A3) est. This analysis is now extended to accommodate
differences in the reliability of the outcome measure, _{}, to total variance, _{} (Nunnally, 1978, p. 200), or(A4) This can be re-arranged to show how the total observed variance is a function of the true variance and the reliability of the measure, or (A5) From Equation 5 it is clear that if
we assume that the variance in the underlying phenomenon ( The total observed variance can also be reduced
by the use of a pretest. The critical property of this pretest in this
analysis is the correlation between the pretest and the final outcome assessment
(the posttest), or _{}, after controlling for the pretest would
be (Reichardt, 1979, p. 157)(A6) From Equation A6 it should be clear
that as the correlation between the pretest and posttest ( (A7) Substituting this more detailed analysis of the variance in outcome measures into Equation A3 yields (A8) est. The relationships among the
quantities of interest can now be generally described by assuming a type of
standardization of the measures. The difference in the means, (A9) est. Substituting Equation A9 into Equation A1 yields
a relation describing the power of a References Bacon, D.R. (2002, October). Bloom, B.S. (1976). Cohen, J. (1977). Hays, W.L. (1981). Lipsey, M.W. (1990). Nunnally, J.C. (1978). Osborne, J.E. (2003). Effect sizes and the disattentuation
of correlation and regression coefficients: Lessons from educational
psychology. Reichardt, C.S. (1979). The statistical analysis of data
from nonequivalent group designs. In Cook, T.D., & Campbell, D.T., Sax, G. (1997).
Address all correspondence to:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Descriptors: Reliability; Pretest; Covariate; Power; Sample Size |