Evaluating Oral Ability

Josef Messerklinger, Kanto International High School


  • Key Words: Speaking, Evaluation
  • Learner English Level: False beginner to Advanced
  • Learner Maturity Level: Jr. High school to Adult
  • Preparation Time: Several hours
  • Activity Time: Varies

Our primary concern as teachers is with methodology, but testing is also a necessary and valuable part of our work. Without going into the intricacies of validity and reliability, the advantages and disadvantages of indirect versus direct tests or Item Response Theory (for this you may wish to see: Hughes, 1989; Weir, 1990; Alderson, Clapham & Wall, 1995), I would like to address an issue that should be of concern to anyone who teaches conversation: testing oral ability.

Although paper tests are quicker and easier to administer than oral tests which are admittedly difficult with large classes, it seems best to test speaking by asking students to speak. Many rubrics can be used to get students to produce language for testing, such as making impromptu speeches, interviews, role play, discussion, and picture elicitation. Students can either be graded while they speak or tape recorded for later evaluation. When tape recording, however, an accurate sample of student performance cannot always be obtained unless students are comfortable with speaking in the presence of a tape recorder. Otherwise, they tend to become self conscious of their English and therefore produce longer pauses, more repetitions, and less language.

Ready-made tests of oral proficiency created by testing experts are available (for example, the FSI test and the Test in English for Educational Purposes). But they are not always appropriate for every classroom, and their rating scales have been widely criticized. They suffer from a variety of problems including "squishy prose descriptors" (Hieke, 1985, p.140), poor reliability and questionable validity (Fulcher, 1987; Upshur & Turner, 1995).

On the other hand, Hughes (1989) and Weir (1990) argue that teachers should take an active part in making tests for their students rather than rely on possibly inappropriate ready-made tests because often the results of the test will affect both students and teachers alike. Upshur and Turner (1995) offer a method for creating simple yet reliable and valid rating scales which can be easily produced by any group of teachers. The scales "require the rater to make a series of binary choices about features of student performance that define boundaries between score levels" (Upshur & Turner, 1995, p.6). The scale is developed by using samples of student speech. A team of raters examines the samples and divides them impressionistically into two groups, the better and the worse performances. They then create a scale in the form of a series of ordered yes/no questions based on the differences between the two groups. These questions are then applied to other performances.

This rating system has several advantages. First, "problems of estimation are reduced and so measurement accuracy (and hence reliability) is enhanced" (Upshur & Turner, 1995, p. 10). At our school, 81% agreement between raters was achieved on the first trial with no rater training. In addition, because the scale uses descriptors which are simple and precise and are based on actual performances, the likelihood that raters will misinterpret the scale and make invalid assertions about the performances is reduced (in the example below, ratings correlated highly with quantitative measures of speaking ability). Finally, since the scale makers and users are the teachers themselves, tests using the scale will have a beneficial influence on teaching and curriculum development.

Here is an example of fluency testing: At our school, we recently evaluated a part of our curriculum by testing student performance with an oral test. To develop the rating scale, students were interviewed and recorded for the scorers to examine. The scorers listened to several examples and divided them into two broad categories: students who spoke well and those who did not. The scorers then discussed the rankings of these performances and created a scale based on their impressions. During the discussion of the tapes, it was decided that one of the most important features affecting fluency was the completeness of the answer followed by the number and the length of the pauses, speaking rate and finally the number of repetitions. The examples were then examined for these features which the scorers identified as influencing their perceptions of fluency. This formed the basis for the rating scale which was then used by the scorers to double mark the other performances. The rating scale designed by this process is shown below.

  • 1 partial answer --Content-- complete answer
  • 2 long and many --Pauses-- short and few
  • 3 slow --Rate-- fast
  • 4 many --Repetitions-- few

Oral testing can also be made a regular part of class. For example, at the end of a unit on giving directions, I asked students to role play giving directions from our school to somewhere in Tokyo and used a modified version of the scale above. The rest of the class was instructed to listen and take notes. The notes can either be collected and checked or be used for a listening quiz at the end of class. For the listening quiz on giving directions, I asked the class to name the station or train line that was used to reach the destination. Using this scale helps the students to understand what is expected of them and how they are being evaluated, and because students know they will be expected to speak for their grade, they make a greater effort during in class pair work exercises and discussion activities.


  • Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press.
  • Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria, ELT Journal, 41 (4) , 287-291.
  • Hieke, A. E. (1985). A componential approach to oral fluency evaluation. The Modern Language Journal, 69 (2), 135-142.
  • Hughes, A. (1989). Testing for Language Teachers. Cambridge: Cambridge University Press.
  • Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3-12.
  • Weir, C. (1990). Communicative language testing. New York: Prentice-Hall International.