Using Generalizability Theory in the Evaluation of L2 Writing 一般化可能性理論を用いた高校生の自由英作文評価の検討

Page No.: 
山西 博之, 広島大学大学院

This paper aims to investigate the characteristics of the evaluation of L2 writing—particularly free English compositions by Japanese high school students—using Generalizability Theory (G theory). Although usually considered to be a difficult topic to examine, the evaluation of free compositions can be thoroughly investigated by using G theory. It enables researchers to provide sufficient information regarding the main effects and the interactions of complicated factors within an evaluation by examining its measurement errors.
I focused on two factors (more specifically, facets) in order to obtain the data on the evaluation of free compositions. These facets were: (a) the raters—10 high school teachers (expert raters) teaching English at a national high school and two public high schools, and six university students (novice raters) studying English language education at a national university; and (b) the rating scales, which were Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey’s (1981) ESL Composition Profile, and a modified version of Kantenbetsu Hyoka of the National Institute for Educational Policy Research (2002). Using these scales, the raters (expert and novice raters) evaluated free compositions written by 20 high school students studying at a national high school in the Chugoku region of Japan. The type of G theory design used in this paper is termed a two-facet crossed design (all the raters evaluate all the compositions using all the items of the rating scales).
Studies using G theory are usually comprised of two substudies: a Generalizability Study (G study) and a Decision Study (D study). A G study investigates the manner in which the facets and their interactions (termed as sources of variance) affected the evaluation results by estimating the magnitude of variance components. A D study investigates the degree of reliability of the evaluation by examining generalizability coefficients, which correspond to classical test theory’s reliability coefficients, using simulations that vary the number of raters or items of the rating scales. The G study in this paper dealt with seven sources of variance—persons (p), raters (r), rating scale items (i), and their interactions (p x r, p x i, r x i, and p x r x i). The D study in this paper particularly focused on varying the number of raters for simulations.
Several observations resulting from both the G study and the D study were as follows: (a) there was a halo effect tendency in the evaluations by the expert raters because the estimated variance components of the interactions of the sources of variance p x r and r x i were large; (b) the novice raters’ rating experience was insufficient to perform reliable evaluations because the generalizability coefficients of both of the rating scales were low, while the estimated variance component of the interaction of the sources of variance p x r x i, which is regarded as unmeasured error, was large; and (c) the ESL Composition Profile was a more reliable rating scale than the Kantenbetsu Hyoka as shown by the D study simulation results.
This paper presentsseveral pedagogical implications based on the results with reference to improvement in the evaluation of free compositions. In particular, I have presented possible methods of diagnostically utilizing the results of G theory to develop and modify the rating scales, and to train the raters.