| Volume: | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Permission is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. Please notify the editor if an article is to be used in a newsletter. |
|
Russell, Michael & Wei Tao (2004). The influence of computer-print on rater scores. Practical Assessment, Research & Evaluation, 9(10). Retrieved July 31, 2010 from http://PAREonline.net/getvn.asp?v=9&n=10 . This paper has been viewed 14,162 times since 5/11/2004.
The
Influence of Computer-Print on Rater Scores Michael Russell and
Wei Tao This study partially replicates and
extends the work of Powers, Fowles, Farnum, & Ramsey (1994) and Russell & Tao (2004). As the
Educational Testing Service prepared to offer the Praxis test on computer and
on paper, Powers and his colleagues conducted a small experiment to examine the influence
computer-printed text had on raters’ scores for a composition item. Although
Powers et al. anticipated that responses presented in hand-written form would
receive lower scores, the opposite occurred. Hypothesizing that the perceived
length of passages was one factor that contributed to raters’ lower scores for
computer-printed passages, Powers et al. (1994) conducted a follow-up study in
which all computer-printed responses were presented double-spaced. While this
approach did reduce the size of the effect, computer-printed passages still
received lower scores. More recently, Russell & Tao
(2004) conducted a similar study in which responses produced by students in
grade four, eight and ten were presented in three forms: handwritten,
single-spaced 12 point computer text, and double-spaced 14 point computer
text. Like Powers et al. (1994) , Russell and Tao report that responses
presented in hand-written form received significantly higher scores. However,
unlike Powers et al., Russell and Tao also found that responses presented in
double-spaced form (thus making the passages appear longer), received lower
scores than responses presented in single-space form. Through interviews with the raters,
Russell and Tao (2004) identified three possible reasons why raters tended to
award lower scores to computer-printed responses: 1) Typos, uncapitalized
letters and punctuation errors are easier to overlook when essays are
handwritten than when they are typed; 2) Because readers associate typed text
with final versions, they are more critical of mechanical errors and interpret
these errors as a lack of careful proofreading, whereas they tend to interpret
the same errors in handwriting as something the author would correct if he or
she had more time; and 3) Some readers felt more connected to the writer as
a person as a result of viewing his or her handwriting and thus were more likely
to give the writer the benefit of the doubt. The study reported here extends the work of Powers et al. (1994)
and Russell and Tao (2004) by conducting a series of experiments that: A)
explore possible causes of the presentation effect and, B) attempt to reduce or
eliminate the presentation effect by formatting text with a scripted font and
through training procedures that familiarize raters with the presentation
effect. Background Research on testing via computer goes back
several decades and suggests that for multiple-choice tests, administration via
computer yields about the same results, at least on average, as administering
tests via paper-and-pencil (Bunderson, Inouye, & Olsen, 1989, Mead &
Drasgow, 1993). However, more recent research shows that for young people who
have gone to school with computers, open-ended (that is, not multiple choice)
questions administered via paper-and-pencil yield severe underestimates of
some students’ skills as compared with the same questions administered via computer
(Russell, 1999; Russell & Haney, 1997). In both studies, the effect sizes
(d) for students accustomed to working with a computer ranged from .57 to 1.25,
with students who were accustomed to writing on computers performing
better when they were able to produce responses to an open-ended item using a
computer. Effect sizes of this magnitude imply that
the score for the average student in the experimental group tested on computer
exceeds that of 72 to 89 percent of the students in the control group tested
via paper and pencil. A more recent study conducted during
the spring of 2000 examined the mode of administration effect in grades four,
eight and ten, and for special education students. Focusing on the extended
composition item used as part of the Massachusetts Comprehensive Assessment
System (MCAS), Russell and Plati (2000) report substantial effects in all grade
levels. Moreover, for eighth grade students receiving special education
services for language arts, the effect size was about 1.5 times larger than for
non-special education students. When combined with the effect found for the
MCAS short-answer items, the mode of administration effect could result in
underestimating student performance by four to eight points on an eighty point
scale. In response to these findings, Russell
and Haney (2000) argue that two approaches to improving the quality of
education in U.S. schools, namely standards-based testing and educational
technology, currently work against each other. This conflict results from the
inability of paper-and-pencil tests to provide valid measures of the writing
skills of students accustomed to writing on computers. Anticipating this
conflict, Alberta Learning (2000) began offering students the option of
performing the province’s graduation exams on paper or on computer in 1993.
More recently, Russell and Plati (2000) have advocated that state testing programs
that employ extended open-ended items also allow students the option of
composing responses on paper or on computer. In response to these findings,
ETS recently conducted a study that involved administering the National
Assessment of Educational Progress Writing test on paper and on computer. As
more testing programs offer students the option of producing essay responses on
paper or on computer, the presentation effect reported by Powers et al. (1994)
and Russell and Tao (2004) raises a serious concern about the equivalence of
scores. The experiments presented below explore the factors that contribute to
the presentation effect and explore two approaches that testing programs might
take to reduce the effect. Methods As part of a larger study that focused
on the mode of administration effect on the MCAS Composition items, Russell and
Plati (2000) transcribed verbatim (including all
spelling, grammar, and punctuation errors) approximately 240
hand-written responses to computer text. These responses were generated on
paper by students in grades four, eight and ten and were in response to a
separate extended composition item administered within each grade level. In a
prior study (Russell & Tao, 2004), a subset of 60 responses in grade eight
were used to examine the effect of print versus handwriting on raters scores.
The experiments presented in this paper use a different sample of 60 responses
from grade eight and a different set of twelve raters to explore methods of
counteracting some of the factors believed to contribute to the presentation
effect described above. In this study, all 60 handwritten responses were
transcribed into computer-text and were formatted in 3 different ways: To ensure the
precision of the transcriptions, the following procedures were adopted. When
transcribing responses from their original handwritten form to computer text,
responses were first transcribed verbatim into the computer. The transcriber
then printed out the computer version and compared it word by word with the
original, making corrections as needed. A second person then compared these
corrected transcriptions with the originals and made additional changes as
needed. Following this process, a sample of 10 responses was checked a third
time. Out of 3,524 words of text, only three errors were found and in two
cases a word that had been misspelled in the original was spelled correctly in
the transcribed text. Thus, while slight differences may exist between the
original handwritten and the transcribed versions, these differences are likely
to have a very minor effect on rater’s scores. All the 3 computer-text formats,
together with the original hand-written form were scored by a new set of raters
who had not seen these responses before and who were unaware of the
presentation effect. In total, 12 raters were employed for this study. Eleven
of the twelve raters were middle or high school teachers. Of these teachers,
nine taught English, one taught social studies, and one taught science. The
final rater was an advanced graduate student in education. Note that criteria
used to select raters for the State testing programs were employed when
recruiting raters for this study. Following MCAS scoring procedures, all
responses presented in a given format were double-scored and scores awarded by
each rater were aggregated into a single score. For all of the
items, the scoring criteria developed for MCAS were used (Massachusetts
Department of Education, 2000a). The MCAS scoring guidelines for the
composition items focused on two areas of writing, namely Topic/Idea
Development and Standard English Conventions. The scale for Topic Development
ranged from 1 to 6 and the scale for English Conventions ranged from 1 to 4,
with one representing the lowest level of performance for both scales. Table 1
presents the category descriptions for each point on the two scales. Score Topic Development English Standards 1 Little topic/idea
development, organization, and/or details Errors seriously
interfere with communication AND 2 Limited or weak
topic/idea development, organization, and/or details Errors interfere
somewhat with communication and/or 3 Rudimentary
topic/idea development and/or organization Errors do not
interfere with communication and/or 4 Moderate topic/idea
development and organization Control of sentence
structure, grammar and usage, and mechanics (length and complexity of essay
provide opportunity for students to show control of standard English
conventions) 5 Full topic/idea
development 6 Rich topic/idea
development To be clear, all 12 raters received the same 3 hours of score training that was based on training software provided by NCS Pearson (2000) for the Massachusetts Department of Education. The whole study can be divided into two experiments. The first experiment involved 8 raters, who were not informed about the purpose of the study and the existence of presentation effect. The second experiment involved the remaining four raters, who were informed about the existence of presentation effect. Experiment 1: Altering Appearance of Responses In the first experiment, 8 raters (i.e. Raters 1-8) scored the same 60 responses which were presented in four forms: Handwritten, Single-space 12 point Times New Roman font, Single-Spaced 14 point Lucida Handwriting font, and Single-spaced 12 point Times New Roman font with all spelling corrected. A spiral design was employed so that all 8 raters scored responses in all four formats but scored each response only once. The essay distribution among the 8 raters is exhibited in Table 2. As shown, 4 formats of the same essays were scored by 4 different pairs of raters. Note that all pairs of raters scored essays presented in each format, but none of the raters scored the same essay twice.
Based on findings from previous studies, we hypothesize that 1) if the computer-text essays looked more like handwritten responses, the presentation effect would be reduced or eliminated and 2) if the “eye-catching” errors in computer-text essays were corrected, the presentation effect would be reduced or eliminated. To test our hypotheses, two sets of data analyses were conducted: 1. Altering Appearance – Several studies have shown that essays presented with neat handwriting will typically receive higher scores than will essays written with poor penmanship (Chase, 1986; Marshall & Powers, 1969; Markham, 1976; Bull & Stevens, 1979). Similarly, Powers et al. (1994) and Russell and Plati (2000) report that altering the appearance of computer-printed text can effect the scores raters award. While the reader might expect that essays presented with neat computer-print text would in turn receive higher scores than handwritten essays, previous studies suggest that readers have higher standards for computer-printed text and thus tend to award them lower scores (Powers et al., 1994). One approach to reducing the presentation effect may be to format computer-printed text so that it appears less formal. Comparisons on 3 out of the 4 formats were analyzed:
2. Spelling Errors – Powers et al. (1994) and Russell and Tao (2004) both speculate that computer-printed text makes mechanical errors such as spelling and punctuation more visible and adversely affect rater scores. To examine the influence of spelling errors on rater scores, scores on the following formats were analyzed:
Note that this analysis was intended to provide further insight into the role visibility of errors plays in rater scores. Because this analysis changed the actual text produced by students rather than simply altering the appearance of that text, it clearly is not an appropriate method for reducing the presentation effect. Experiment 2: Altering the Training of Raters The second experiment focused on training away the presentation effect. This experiment builds on the work of Powers et.al. (1994), who provided some evidence that making raters aware of the presentation effect can reduce the size of the presentation effect. For the experiment here, four raters (i.e. Raters 9-12) participated in the initial training session and then received additional training that focused on the issues believed to influence raters’ scores when reading computer-printed text. This additional training included:
After the four raters received this supplemental training, they scored the same responses formatted in two ways:
Again, a spiral design was employed so that all four raters scored responses in both formats but scored each response only once. The essay distribution among the 4 raters is exhibited in Table 3. Scores awarded by the set of 4 raters who received supplementary training were then compared to the scores awarded to the same essays by the 8 raters who did not receive supplementary training.
Inter-rater reliability for responses presented in each format were generally adequate, but not strong. Table 4 displays the correlation coefficients for responses scored in each format (no “outliers” were removed). Table 5 displays the percent agreement and disagreement for each format. It is interesting to note that the lowest reliability occurred with the computer-text responses in “Convention” category scored by raters who were provided with additional training that focused on the presentation effect. Also note that the Massachusetts Department of Education (1999; 2000b) reports inter-rater reliability as percent agreement within one point and typically reports agreement to be above 90%. Although the correlation coefficients reported in Table 4 are all less than .8 and many are below .7, the percent agreement within one point was 100% for Convention category and at least 96% for Topic Development as shown in Table 5.
To examine the effect altering the appearance of the response had on rater scores, two types of changes were examined. First, responses presented as computer text were formatted using block characters (Times Roman font) or as Script font (which appears similar to handwritten text). Second, responses presented as computer text were presented with and without spelling errors corrected. To examine the effect these changes had on the scores awarded by raters, two repeated measures analysis of variance were performed. The first analysis examined differences in scores awarded to the Handwritten, Single-Spaced Block-text and Scripted font responses. The second analysis examined differences in scores awarded to the Handwritten, Single-spaced Block-text uncorrected, and Single-Spaced Block-text spelling corrected responses. Altering Font of Responses As Table 7 displays, the results showed a significant effect of the format in which responses were scored for the total score and both sub-categories. To examine whether the scores differed significantly between the handwritten and single spaced text or between the handwritten and scripted text, Tukey’s method of adjusting for multiple comparisons was employed (Glass & Hopkins, 1984). As table 7 also indicates, responses presented as computer text received significantly lower scores than did the same responses presented as scripted computer text. Moreover, responses presented as scripted computer text did not differ significantly from the handwritten responses but did receive significantly higher scores than did the regular computer text. Thus, it appears that altering the appearance of computer printed text by using a script font, thus making the response appear more similar to a handwritten response, may eliminate the presentation effect.
Correcting Spelling To examine the effect spelling errors had on rater’s scores, a repeated measures analysis of variance was performed with the Handwritten, Single-Spaced and Single-Spaced with Spelling Corrected responses. Table 8 indicates that there were significant differences in the scores awarded to responses presented in handwritten form, verbatim computer text, or as computer text with spelling corrected. To examine whether the scores differed significantly between the handwritten and spell-checked text or between the verbatim computer text and spell-checked text, the Tukey method of adjusting for multiple comparisons was again employed. As table 8 also indicates, responses presented as verbatim computer text received lower scores than did the same responses presented as spell-checked computer text, but the difference was not statistically significant. Conversely, responses presented as handwritten text received higher scores than did the same responses presented as spell-checked computer text, but again the difference was not statistically significant. Thus, it appears that correcting spelling may have a small effect on rater’s scores for responses presented in computer text, but this difference is not statistically significant and accounts for only a portion of the presentation effect, at best.
The second experiment examined whether the presentation effect could be reduced or eliminated through training. Table 9 displays the results of t-tests that compare scores awarded by raters who received additional training. Table 9 also displays the summary statistics for the scores awarded to the same responses by raters who did not receive additional training. To assist in interpreting the mean score difference, Table 9 also displays Glass’s delta effect size (Glass & Hopkins, 1984). As Table 9 shows, for raters that received supplemental training, there were only slight differences between the scores awarded to the responses presented in handwritten and computer text form. And none of these differences were statistically significant. Moreover, the scores awarded by the “trained” raters more closely resembled the scores awarded to the handwritten responses by raters who only received traditional training.
The experiments presented above were intended to explore possible causes of the presentation effect and to explore approaches that might reduce or eliminate the size of the presentation effect. As occurred in the two previous studies (Powers, et al, 1994; Russell & Tao, 2004), a statistically and practically significant presentation effect was found when responses were presented in their original handwritten format and in transcribed computer print. This presentation effect resulted in higher scores awarded to handwritten responses. On average, this difference resulted in computer printed responses receiving scores 1.3 points lower than scores received by the same response presented in handwritten form. The presentation effect, however, seemed to disappear when computer printed responses were formatted with a scripted font that resembled cursive handwriting. On the surface, then, it appears that one approach to eliminating the presentation effect is to simply format computer printed responses with a font that resembles handwriting. By doing so, not only may the response more closely resemble a response produced by hand, but the larger font size also makes the response appear longer - two factors which previous studies suggest may contribute to the presentation effect. However, interviews with raters conducted after they completed scoring provide an alternate explanation. Four of the eight raters who read responses presented in scripted font complained that the passages were “very difficult to read” and “made my [their] eyes tired.” At the bottom of the score sheet, one rater even wrote, “Sorry-my 52-year old eyes can’t read 15 papers of script type. The best I can do is scan for predictable features.” Two other raters also indicated that they had difficulty reading the passages carefully and tended to award scores based on their general impression of the writing. Thus, while presenting responses in scripted font may eliminate the presentation effect, this improvement comes with an important cost: Raters may read responses less carefully and award scores based on a quick rather than careful read of the response. In the two previous studies, the authors suggested that the visibility of mechanical errors combined with higher expectations for computer-printed text contribute to lower scores awarded to responses presented in computer print. The second analysis in the first experiment presented here provides some evidence that spelling may have an effect on raters’ scores. The magnitude of this effect was smaller for the Topic Development scores than the English Conventions scores. But, in both cases, the size of the effect was not statistically significant, although it did result in .8 point increase on average in students’ total score. Clearly, the visibility of spelling errors alone accounts for only a fraction of the effect, at best. Spelling, however, represents only one type of mechanical error. To further explore the influence of the visibility of mechanical errors on raters’ scores, additional studies should be conducted in which a fuller range of mechanical errors such as punctuation, capitalization, and spelling are corrected. The second experiment presented here provides evidence that the presentation effect can be eliminated through training. By describing the presentation effect to raters, discussing the possible causes of the effect, providing samples of responses that appear very different when presented in handwritten and computer-printed form, by suggesting that raters maintain a mental count of the number of mechanical errors they observe while carefully reading a response, and by encouraging raters to think carefully about the factors that influence the scores they award, it appears that raters award similar scores to responses presented in both formats. For testing programs concerned about tracking trends over time, it is important to note that training raised scores for computer-printed responses to the same level as scores awarded to handwritten responses. If this finding holds for other composition items administered as part of other testing programs, this finding suggests that efforts to analyze trends may not be interrupted by allowing students to compose responses by hand or with a computer. Both formatting computer-printed responses with scripted font and providing supplemental training eliminated the presentation effect. However, since scores should be based on a careful reading of a response and scripted text appears to make it difficult for raters to read responses carefully, providing supplemental training is a more desirable method for reducing this effect. It is interesting to note that during scoring, one rater who did not receive supplemental training on the presentation effect stated that it would have been helpful to have seen anchor papers presented in both handwritten and computer text form. This rater went on to say that she found herself applying different criteria and standards to the computer printed responses than to the handwritten responses. As this rater stated, Powers et al. (1994) suggest, and this experiment demonstrates, it appears critical that raters be trained with responses presented in different modes when responses are produced and ultimately scored in different modes. Clearly, the effect training has on reducing the presentation effect needs to be replicated with a larger sample of responses and larger groups of raters. In addition, future studies should examine this issue for a wider variety of open-response test items. Nonetheless, this study provides preliminary evidence that the presentation effect can be eliminated through training. If generalizable, this finding may clear an important obstacle to providing students with the option of composing responses to open-ended items by hand or on a computer. References Alberta Learning. (2000). Directions for Administration, Administrators Manual, Diploma Examination Program. Bull, R. & Stevens, J. (1979). The effects of attractiveness of writer and penmanship on essay grades. Journal of Occupational Psychology, 52, 53-59. Bunderson, C. V., Inouye, D. K. & Olsen, J. B. (1989). The four generations of computerized educational measurement. In Linn, R. L., Educational Measurement (3rd ed.), Washington, D.C.: American Council on Education, pp. 367-407. Chase, C. I. (1986). Essay test scoring: Interaction of relevant variables. Journal of Educational Measurement, 23(1), 33-41. Glass, G. & Hopkins, K. (1984). Statistical Methods in Education and Psychology. Boston, MA: Allyn and Bacon. Markham, L. R. (1976). Influences of handwriting quality on teacher evaluation of written work. American Educational Research Journal, 13(4), 277-283. Marshall, J. C. & Powers, J. C. (1969). Writing neatness, composition errors, and essay grades. Journal of Educational Measurement, 6, 97-101. Massachusetts Department of Education. (1999). Massachusetts Comprehensive Assessment System: 1998 Technical Report. Malden, MA. Massachusetts Department of Education. (2000a). 1999 MCAS Sample Student Work and Scoring Guides. http://www.doe.mass.edu/mcas/student/1999/. Massachusetts Department of Education. (2000b). 1999 MCAS Technical Report. Malden, MA. Mead, A. D. & Drasgow, (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114:3, 449-58. NCS Pearson. (2000). Scoring MCAS Compositions: NCS MentorTM for Massachusetts. Powers, D., Fowles, M, Farnum, M, & Ramsey, P. (1994). Will they think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays. Journal of Educational Measurement, 31(3), 220-233. Russell, M. & Haney, W. (1997). Testing writing on computers: an experiment comparing student performance on tests conducted via computer and via paper-and-pencil. Education Policy Analysis Archives, 5(3). Available online: http://olam.ed.asu.edu/epaa/v5n3.html. Russell, M. & Haney, W. (2000). Bridging the Gap Between Testing and Technology in Schools. Education Policy Analysis Archives, 8(19). Available online: http://epaa.asu.edu/epaa/v8n19.html. Russell, M. & Plati, T. (2000). Mode of Administration Effects on MCAS Composition Performance for Grades Four, Eight and Ten. A report submitted to the Massachusetts Department of Education by the National Board on Educational Testing and Public Policy. Available online: http://nbetpp.bc.edu/reports.html. Russell, M. & Tao, W. (2004). Effects of handwriting and computer-print on composition scores: A follow-up to Powers, Fowles, Farnum, & Ramsey. Practical Assessment, Research & Evaluation, 9(1). Retrieved February 22, 2004 from http://PAREonline.net/getvn.asp?v=9&n=1. Russell, M. (1999). Testing Writing on Computers: A Follow-up Study Comparing Performance on Computer and on Paper. Educational Policy Analysis Archives, 7(20). Available online: http://epaa.asu.edu/epaa/v7n20/ Contacts
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Descriptors: Computer Uses in Education; Essays; Fonts; Handwriting; Scoring; Word Processing; Writing Evaluation | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||