Volume: | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

A peer-reviewed electronic journal. ISSN 1531-7714

Copyright is retained by the first or sole author, who grants right of first publication to |

MacCann, Robert G. & Gordon Stanley (2004). Estimating the standard error of the judging in a modified-angoff standards setting procedure. Practical Assessment, Research & Evaluation, 9(5). Retrieved September 21, 2014 from http://PAREonline.net/getvn.asp?v=9&n=5 . This paper has been viewed 24,407 times since 3/1/2004.
Robert G.
MacCann and Gordon Stanley
Standard setting procedures are now being widely used throughout the world to determine levels of student competence in educational programs. Reviews of such methods have been presented by Hambleton and Eignor (1980), Shepard (1980), Berk (1984), Berk (1986) and Jaeger (1993). One of the most commonly used methods is the Angoff (1971) procedure or its variants, referred to as modified Angoff procedures. This method has been extensively researched and compared to rival standard setting methods (e.g. Harasym, 1981; Livingston and Zieky, 1983; Cross, Impara, Frary and Jaeger, 1984; Hambleton and Plake, 1995; Impara and Plake, 1997; Plake, 1998; Giraud, Impara and Buckendahl, 2000; Buckendahl, Smith, Impara and Plake, 2001). To interpret the results of the Angoff method over time, a knowledge of the standard error of the judges’ decisions is useful. Each year the cut score in raw marks that defines a certain standard of achievement may vary, along with the associated percentages of students above that cut score. An important question when comparing cut scores across years is whether any difference observed may simply be attributable to natural variation in the judging, or whether it is sufficiently large to suggest a difference in the standard reached. Thus, an estimate of the standard error of the judging process is needed. This poses an important practical problem for educational systems, particularly those that are just starting to implement the process and have little data from which to observe the variation. One simple and convenient way to estimate this standard error is through the Central Limit Theorem (e.g. see Hoel, 1961), which estimates the variance of mean cut scores as a function of the variance of the cut scores within a judging group. However, this method requires independence of the judges’ decisions, which may not occur at all stages of a modified Angoff procedure (see the comments of Linn, 2003). For example in the Australian state of New South Wales, the School Certificate (SC), on which the data in this paper is based, is a mandatory examination-based Year 10 certificate which uses a modified Angoff method. In the procedure used, the independence of judges decisions only occurs at Stage 1. Given that in Stages 2 and 3 extra information is considered, it is possible that an estimate based on Stage 1 would over-estimate the error involved. That is, the Central Limit Theorem (CLT) procedure may give a conservative estimate of the standard error. The aim of this paper is to evaluate how well the CLT procedure performs when applied to the modified Angoff method employed in the New South Wales (NSW) programs. This evaluation is obtained by finding a second empirical estimate of the standard error, which it is argued, is approximately correct for the special cases in which it is employed in this paper. If the CLT method gives reasonable agreement with this second method for the special cases here, then it may prove to be a suitable method in general as an approximate estimate of the standard error involved.
The New South Wales educational system holds statewide tests at Year 10 for the School Certificate award. These tests are set and administered by a government body, the Board of Studies, which also sets the curricula for these programs. The Year 10 School Certificate tests to be considered here in English and Mathematics are compulsory and are attempted by approximately 80 000 students. The format of the School Certificate tests is a mixture of multiple choice and extended response items. For English, the test comprises 45 multiple choice items (worth 45 marks), some short answer items where a few lines of response are required (worth 15 marks) and two extended response writing tasks (worth 20 marks each). For Mathematics, the test comprises 25 short items, each requiring the student to supply the answer (worth 25 marks), 55 multiple choice items (worth 55 marks), and four 5-mark items, each requiring a number of steps to be completed for full marks (worth 20 marks). While an attempt is made in the test construction process to ensure that test difficulty does does not vary too greatly from year to year, this ideal is difficult to obtain in practice and some variation in difficulty inevitably occurs. The internal consistency reliabilities of the scores, as determined by Cronbach’s Alpha, are high and uniform as shown in Table 1.
The standards setting procedure for the School Certicate uses a modified version of the Angoff procedure. All items in the School Certificate English and Mathematics tests were rated by the judges. For each test, the raw mark scale was obtained by simply summing the scores on each item to get a total score. The Angoff procedure gave cut scores on these total score scales. Rather than having just one cut score to indicate minimal competence, five cut scores are produced to separate the students into six performance bands, Band 6 being the highest level and Band 1 the lowest. These performance bands are defined by written descriptors which indicate the academic characteristics of a typical student in the band for that subject and which help the judges gain a mental image of the type of student they are considering when judging how well such students would score on each item. In addition to these descriptions, actual examples of students’ responses at each band level from previous years have been published and promulgated. These exemplars, known as Standards Packages, have been placed on CD-ROMs and distributed to all schools in NSW. These Standards Packages are useful to teachers in formulating their internal school assessments and in training the teacher judges for the standards-setting operation. For each course, five cut scores that separate the six bands are finally produced from the Angoff standards setting. Each cut score is then mapped to an arbitrary reporting scale scale that is the same for all School Certificate courses. The Band 6 cut score is mapped to 90, the Band 5 cut score is mapped to 80, Band 4 to 70, Band 3 to 60 and Band 2 to 50. Band 1 has mapped marks below 50. Two other mapping points are used: the maximum possible raw mark score is mapped to 100, and zero is mapped to zero. All scores falling between pairs of anchor points are mapped by linear interpolation. Typically, six to nine experienced teacher judges are used in a subject. These are assisted by Field Officers, staff hired to observe the operation, to aid the judges in interpreting the output and to ensure that the judges have all the materials they need for the task. After being appointed, the judges and Field Officers undertake an intensive training exercise involving a practical simulation of the standards-setting procedure. During this training they gain additional familiarity with the six levels of performance, as defined by the exemplars in the Standards Packages. The standards-setting procedures takes place during the supervised marking. In Angoff’s original note on his method (Angoff, 1971, pp. 514 -5), he discusses judgements about the likely performance of a single hypothetical marginal student. Then as a footnote, he suggests a slight variation in which judges are asked to think about a group of such students and to estimate the proportion of such students who would answer each item correctly. The latter method is used in the NSW procedures. Each judge works independently. For a given band level, the total cut score for the whole paper is obtained by summing the item cut scores. In Stage 2, the judges are given the results of the Stage 1 judging and are able to freely discuss these. They are also given statistical feedback on how student groups, who gained particular marks on an item, performed on the total paper. They are then able to modify their cut scores if desired. In Stage 3, they are given the results of the Stage 2 judging and given the complete scripts of students who obtained a mark close to the total cut scores. After looking at the quality of this work, the judges are given a final opportunity to modify their cut scores again. When the cut scores are finalised, they are averaged across the judges to get a single cut score for each band. These cut score means are then scaled to the arbitrary reporting scale as described above (a Band 6 cut score scales to 90, a Band 5 cut score scales to 80 and so on). The resulting final marks are called aligned marks, and are the ones reported to students. The physical arrangements involve the judges working around a large table which has room to accommodate the various recording sheets, Band descriptors, examination papers, marking scales, scripts and other documents involved in the process. In Stage 1, where the judges give independent judgements, the judges are physically close enough for it to be possible for a judge to observe the cut score decisions of a neighbouring judge but the presence of the Field Officer, who sits at the table and observes the process, would inhibit any collaboration at Stage 1. In any case, the judges know that they are free to collaborate at Stages 2 and 3, so there is little incentive for them to collaborate at Stage 1. At all stages of the process, the judges are prevented from obtaining any information about the associated percentages in the bands as a result of their decisions. They only find out these percentages at the same time as the general public, when the results are published on the Board of Studies website.
One way of estimating the standard error is to use the Central Limit
Theorem. For a given cut score, the variance in the independent judgements is
noted and divided by the number of judges. The result gives an estimate of the
variance of the where As In other examination systems, the median of the judges’ decisions may be used instead of the mean. The standard error of the median is more subject to sampling fluctuations and is therefore less efficient. It is about 25% larger than the standard error of the mean and is estimated by
This formula may be used to give an approximate conversion. Thus the comparative results obtained in this paper can be applied to other systems which use the median.
The method described here attempts to eliminate all sources of
variation except for that attributed to the judging. To remove variation
associated with changes in candidature selection and with changes in
teaching/learning practices, extremely large and stable candidature courses
were selected - the external tests in English and Mathematics at the School
Certificate. About 80,000 students take each of these courses. Given that the
Year 10 School Certificate population have a consistent retention of
approximately 97% from Year 7 (the first year of high school), it is extremely
unlikely that changes in these aligned distributions in recent years could be
attributable to changes in selection. Secondly, while the School Certificate
method of Therefore we are arguing that most of the variation in the aligned distributions would come from two sources. The first is variation in the intrinsic difficulty of the examination papers and their associated marking patterns, as reflected in the raw mark distributions. The second is variation in the judging – the variation we are interested in. To remove the first type of variation, a base year was chosen (1998) and all raw mark distributions in a course were converted to have the same mean, standard deviation and distribution shape as the 1998 raw mark distribution. This was effected by applying an equipercentile transformation to each of the 1999-2002 raw distributions to change the shape to that of the 1998 distribution. For a definition of equipercentile equating, see Angoff (1971, p. 563). In practice, anchor points are obtained for score pairs corresponding to the same percentile rank and scores lying between these anchor points are converted by interpolation (see, for example, the curved line of relationship equating two tests on p. 573 of Angoff, 1971). The equipercentile transformations were then applied to the raw mark total paper cut score means to give a set of cut score means that is comparable over time. These cut score means are to be interpreted as being on the same scale as the 1998 raw marks for each course. The standard deviation of these transformed cut score means across the years from 1998 to 2002 is then the estimate of the standard error of the judges’ means.
Table 2 below shows the comparable cut scores in English for each band level for the EQP method and gives their unbiased standard deviation (SD), which is the standard error estimate. When expressed as a fraction of the standard deviation of the total test, this is the Standardised SD, which is useful for making rough comparisons between the English and Mathematics results. Table 3 shows the comparable statistics for Mathematics. The standard errors have certain similarities in English and Mathematics. Firstly, they are small for the very top band, indicating that the judges have similar views as to what constitutes a Band 6 marginal candidate. Secondly, the standard errors are relatively large at 3.56 and 4.55 respectively for a marginal student at Band 3. This implies that the judges have a much less clear view of the type of student at the Band 3 cutoff than they have of the very best students. The patterns, however, differ in their clarity in defining a bare “pass” candidate at the bottom of Band 2. For English, this gave the widest standard error of all the bands, whereas for Mathematics it gave the second smallest standard error. The Mathematics judging groups apparently have similar views as to what constitutes very good performances and what constitutes barely adequate performances. It is in the middle bands that they disagree the most. The English groups, on the other hand, were fairly consistent for Bands 1 – 4 but disagreed more for the marginal students in Bands 2 and 3. Although the raw standard errors were lower for English than for Mathematics, the test standard deviation was also lower for English. When the standard errors are expressed as a proportion of this standard deviation, it can be seen that English has lower standard errors for bands 4, 5 and 6 and higher standard errors for bands 2 and 3.
The second method uses the Central Limit Theorem to estimate the variation between cut score means as a function of the variation between the individual cut scores of the judges. As this variation is taken from the Stage 1 judging, it is possible that it will give larger standard error estimates than the EQP method. For each band level, the standard error estimates were averaged over the five years and are compared with the EQP estimates in Table 4.
For English, Table 4 shows that the CLT method adequately approximates the EQP method for Bands 4 – 6, for which it gives a conservative estimate. But for the lowest two bands, EQP gives more variation in the standard error than would be predicted by the individual judges’ variation through the CLT. In Mathematics, the CLT method gives good estimates of the EQP method for the highest and lowest bands, but under-estimates the latter for the middle three bands.
It is not easy to account for the differential agreement between the
methods across different bands. In the absence of data, one may have assumed
that the effect of employing Stages 2 and 3 in the standards-setting process
would have improved the accuracy over Stage 1, so that the CLT method (based
solely on Stage 1) would be expected to As expected, by Stage 3 of the modified Angoff procedure the effect of the judges’ collaboration was to make their judgements more alike and hence the standard deviation of their cut scores was substantially reduced in both English and Mathematics. This information, however cannot be validly used to construct standard error estimates as the lack of independence of the decisions makes any estimate based on them virtually meaningless. For example in our Year 12 award, the Higher School Certificate, we have observed in some courses that the collaborating judges at Stage 3 have all agreed on the same values for their cut scores, giving zero variance, and hence the (invalid) use of the CLT formula would give a standard error estimate of zero. The only way to validly take account of the effects of Stages 2 and 3 would be to repeat the entire process from Stage 1 to Stage 3 with additional independent teams of judges. Each team would collaborate within themselves at Stages 2 and 3 but this would be entirely independent of the other teams. Then the observed variation between the team means would give an estimate of the standard error. This could be experimentally established as a “one off” trial but the time and expense involved would preclude it from being a regular part of the standard-setting procedures. It is the impracticality in conducting this type of experiment on a regular basis that provides the motivation to investigate the effectiveness of the CLT estimate based on Stage 1. This analysis has removed the effects of differential distribution shapes caused by varying examination paper difficulties by converting all distributions to the 1998 raw mark scale. A further question of interest is whether differential measurement error in the test scores could effect the standard error estimates. In theory this could occur if the reliabilities of the scores varied markedly from year to year. The reliabilities affect the intercorrelations between items and this affects the amount of regression that would occur relative to a total cut score based on the summing of cut scores across individual items. For example, imagine a test giving unreliable scores where the intercorrelations amongst the items are very low. For simplicity, consider a test where the items are extended response and parallel, and where the judges, being perfectly consistent, identify cut scores on each item that mark off the top 5% of students as being in the top ability band. Then the summing of these cut scores may give a total cut score that marks off only 1% of students achieving the top band, due to the low intercorrelations between items. In contrast, consider a similar situation where all factors are identical except that the item intercorrelations are high. Again the judges choose item cut scores that mark off the top 5% of students, but this time the total cut score identifies 4% of students in the top band, due to the higher item intercorrelations. Any equipercentile mapping that converts the distribution of the unreliable test to the same distribution as the reliable test (or vice versa) will not overcome this disparity in percentages. In fact, if the total cut scores on the two tests are exactly the same mark, then the equipercentile mapping, in equating all moments of the two distributions, will then transform these marks to be different. This effect obviously would be greatest for cut scores near the extremities of the distributions. Thus the total cut scores (and hence the standard error estimates) could vary solely as a a function of differing reliabilities from year to year. The empirical investigation of this effect is beyond the scope of this paper. However, in practice the high level of reliabilities for SC English and Mathematics and their uniformity across time (as shown in Table 1) imply that the differential effect from year to year would be minimal for these data. Given that the CLT method does not quite account for the variation observed in some of the bands, one should be very cautious in making inferences about possible changes in achievement. One course of action would be to use a stringent alpha level in applying constructing confidence intervals around cut scores, using a probability of 0.01 say, instead of 0.05. Secondly, it would be advisable to observe the variation over a few years before drawing any inferences about change, to see if any trends are emerging. For example, consider the hypothetical set of standard error bands shown in Figure 1 below.
In Figure 1, the 2002 error band does not overlap with the 2001 band, but it would be unwise to assume that a significant change in the standard of achievement had been attained. In 2003 the aligned mark drops again and then rises somewhat in 2004. In this case, the error bands (which are certainly fallible estimates themselves) have not been sufficient to account for the observed variation. This type of result should indicate the need to wait for a few years of consistency in the pattern before assuming that there had been a real change in demonstrated achievement. The above method of
reporting may help in enabling changes in the standard of achievement to be
identified. However, the interpretation of why these changes have occurred may
be more difficult. In courses which are not compulsory, a change in
demonstrated achievement may simply be caused by
Angoff, W.H. (1971).
Scales, norms and equivalent scores. In R.L. Thorndike (Ed.), Berk, R.A. (1984). Berk, R.A. (1986). A
consumer’s guide to setting performance standards on criterion-referenced
tests. Buckendahl, C.W., Smith,
R.W., Impara, J.C. and Plake, B.S. (2001). Cross, L.H., Impara, J.C.,
Frary, R.B. and Jaeger, R.M. (1984). A comparison of three methods for
establishing minimum standards on the National Teacher Examinations. Giraud, G., Impara, J.C.
and Buckendahl, C. (2000). Hambleton, R.K. and Eignor,
D.R. (1980). Competency test development, validation and standard setting. In R.M. Jaeger and C.K. Tittle
(Eds.), Hambleton, R.K. and Plake,
B.S. (1995). Using an extended Angoff procedure to set standards on complex performance assessments. Harasym, P.H. (1981). A
comparison of the Nedelsky and modified Angoff standard setting procedure on evaluation outcomes. Hoel, P. G. (1961). Impara, J.C. and Plake,
B.S. (1997). Standard setting: An alternative approach. Jaeger, R.M. (1993).
Certification of student competence. In R.L. Linn (Ed.), Linn, R.L. (2003).
Performance standards: utility for different uses of assessments. Livingston, S.A. and
Zieky, M.J. (1983). Plake, B.S. (1998).
Setting performance standards for professional licensure and certification. Shepard, L.A. (1980).
Technical issues in minimum competency testing. In D.C. Berliner (Ed.),
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Descriptors: Standards Setting; Modified Angoff Procedures; Standard Error; Central Limit Theorem; Equipercentile Equating |