Test Purposes

Writer(s): 
Greta J. Gorsuch Mejiro University

This is the first of a four part series on testing. This article deals with test purposes, the second will address the use of commercially produced tests, the third will focus on creating classroom tests, and the fourth will be an annotated bibliography of resources in testing.

What are tests used for? They're used to make decisions. Some examples: Should that nursing student be allowed to join the nursing profession? Should student X pass or fail the class? Should we certify so-and-so as a mechanical engineer? What high school should this junior high school student be steered towards? Should this girl be admitted to our university? What class level should that student be placed in? Should we pass this job applicant onto the second round of interviews? The answer to all of these questions in Japan, and indeed in much of the world, is to give a test.

Tests are perhaps best defined as socially accepted devices by which decisions can be made about individuals in relation to their position in education, and in society. Tests are the great gateways and the gatekeepers to social and economic accomplishment. Given this, imagine the effect a single test can have on someone's life. Let's say a nursing student doesn't pass his state mandated test -- he will not be a nurse, at least not until the test is offered again -- in Japan, that may be only annually. If a hapless student fails her English class final exam, she will not get credit for taking the class, meaning a whole extra year repeating the class. On the other hand, if a student does very well on a test cobbled up by her junior high school teachers, she'll be steered towards applying for entrance into better high schools, and on to better universities.

Tests are everywhere, and as educators in a society in which tests mean a great deal, we should know as much as possible about them -- their positive and negative uses, and their potentially dynamic relationship to learning. In this article, basic testing and statistical terminology will be introduced, and two types of tests will be described and matched to specific educational decisions. Finally, these two types of tests will be related to classroom learning. By the time you've finished reading this article you should at least sound like a testing expert in collegial discussions, and be able to consider your own testing practices, and those of your school, in a new light.

Basic Testing Terminology (How to Sound Like an Expert)

There are basic terms which you should know to be able to read books and articles on testing, and to sound like an expert. An item is any opportunity for students to give an answer or provide information which will have some affect on their test score (J. D. Brown, personal communication, June, 1994). In many tests, an item would simply be a single test question, but in some tests, say of some demonstratable skill such as using clarification requests during roleplays, an item would be something on a teacher test checklist where the teacher could check, yes, this student used a clarification request, or no, she didn't.
A discrete point item is a single question on a test which aims to assess a students' ability to provide a correct answer on a single point of knowledge. An ELT example:

1. Nell and Jerry __________ up and down on the bed.
a.  jumping
b.  sat
c.  jumped
d.  sitting

An integrative item (Davies, 1990, also calls this holistic ), however, is one which asks the student to make use of several skills at once. Brown (1996) uses the example of a dictation test, where students not only have to use listening comprehension skills, but writing skills as well.
Both examples above are mainly receptive items where the student has to produce relatively little -- she simply has to fill in the correct answer or, depending on the actual dictation, perhaps write a single word. By and large, though, both examples make use of the receptive skills of reading and listening. Many tests are made up of largely receptive items because they're easy to score and are thought to be precise (this is not always true, however, as will be discussed under reliability below).

Productive items in the ELT field ask students to write or speak in some context and their responses are then assessed by some predetermined criteria. A test in which students write a composition on some preset subject in a given amount of time is one example. Another example would be an interview test, where a student is interviewed and her ability to use, say, fluency fillers (um, well , uh-huh) is evaluated. Productive items can be a devil to score, but have obvious advantages if you want to know how well your student can communicate. And, as we know, communication is much more than knowing answers to discrete item grammar points and listening comprehension items.

This brings us to reliability and validity, two concepts which teachers, whether they are interested in testing or not, must know. Concerning reliability, Alderson, Clapham and Wall (1995, p. 6) say the following: "Reliability is the extent to which test scores are consistent: if candidates took the same test again tomorrow after taking it today, would they get the same result?" Let's say the students take a multiple choice, discrete point test on a stifling hot day, with the windows open and a lawnmower running. The test they're taking is poorly printed, and in many cases the items are on one page and the places for students to write their answers are on another. Not only that, the students haven't been informed the test is printed on both sides of the pages -- in fact, the directions given by the test proctor at the beginning were vague and hard to hear. The language of test items themselves is imprecise, so students really aren't sure how to answer, and worst of all, some of the multiple choice items have more than one correct answer (this is very common with test items that are "first drafts"). Finally, the test is simply too difficult for the students -- it has been written well above their level.

The situation described here will certainly, and seriously, decrease the reliability of the test. Students will be distracted, and will be so busy figuring out how to complete the test they won't be able to demonstrate their knowledge. In the case of items that are simply too difficult or are ambiguously worded, students will just guess. So you can see that if the test were to be given to the same students the next day, they would get very different scores, due to chance variation. The good news is that steps can be taken to increase test reliability -- Brown (1996), has many concrete suggestions, and Alderson, Clapham and Wall (1995), and Bachman (1990) offer helpful general suggestions.

Validity is, in Brown's words "the degree to which a test measures what it claims, or purports, to be measuring" (1996, p. 231). A test that claims to test students' listening comprehension should test listening, not, say, grammar points. Conversely, a test that claims to test students' overall communicative ability should allow students to speak, write, read, and listen in fairly authentic communicative situations. For a test to be valid, educators are required to have a thorough understanding of what students are being asked to do on the test, and a thorough understanding of the thing (trait) within the students being measured -- for a test to be valid, these two understandings must be matched.

Validity is much more slippery a concept than reliability. Whereas reliability can be estimated mathematically, validity is really only an argument, or a claim. Validity can be estimated through indirect means, such as asking a panel of "experts" to judge whether your test is valid, or by giving a test to a group of "masters" (people you know who have the trait you're trying to measure) and to a group of "novices" (people who don't have the trait) and comparing their scores ("masters" should score high, "novices" should score low). There are other ways, but it still comes down to a reasoned argument and hopefully a preponderance of indirect evidence.

The Great Divide -- NRTs and CRTs

All tests are one of two basic types, a norm-referenced test (NRT) or a criterion-referenced test (CRT). NRTs and CRTs are very different animals indeed, and are used to make different kinds of educational decisions. The differences between NRTs and CRTs can be encapsulated in the following ways (adapted from Brown, 1996):

  • How student scores are interpreted
    • With NRTs, a student's performance is compared to all other students who took the test.
    • With CRTs, a student's performance is compared to the amount of course material covered.
  • What is measured
    • With NRTs, general language ability or proficiency is tested.
    • With CRTs, specific course objectives-based language points or skills are tested.
  • Purpose of testing
    • With NRTs, the purpose is to spread students out along a continuum of general abilities.
    • With CRTs, the purpose is to assess the amount of course objectives-based materials the student knows.
  • What students' scores will look like when plotted on a graph
    • With NRTs, students' scores are distributed along a "normal distribution" pattern ("the bell curve" or just "the curve") around a mean (average) score, regardless of when they take the test. (see Figure 1 below)
    • With CRTs, students should get low scores at the beginning of the course or course component and high scores at the end of the course or course component. (see Figure 2 below)
  • Test structure
    • With NRTs, there are a few long subtests with very different contents on the test questions.
    • With CRTs, there are a series of short subtests corresponding to course objectives with similar contents on the test questions.
  • What students know about the test questions in advance
    • With NRTs, students have no idea what content to expect in the test questions.
    • With CRTs, students should know exactly what content to expect in the test questions. J. D. Brown (personal communication, June, 1994), went as far to say that activities students do in their classes should be rehearsals for what they have to do on a CRT.

Different Purposes for CRTs and NRTs

CRTs and NRTs have very different appropriate uses and purposes. Bachman (1990) names the following uses of tests, and Brown (1996, p. 9) matches test purposes to test type (CRT or NRT):

Test Purpose Use CRT or NRT
Selection
  • entrance (to a program)
  • readiness (for a program)
NRT, because you want to compare individual applicants overall with other applicants taking the same test.
Placement NRT, because you want to find the appropriate level within a program for a student who is entering a program with a number of other students.
Diagnosis CRT, because you want to measure specific points of the students' knowledge in relation to the goals of the program or course.
Progress and Grading (achievement) CRT, once again because you want to measure specific points of the students' knowledge in relation to the objectives of the course.

It might be easier to understand why NRTs and CRTs are used for specific purposes if you envision how the scores of the students will appear when plotted on graphs, as in Figures 1 and 2.

 

Note that in Figure 1, the NRT scores, that students' scores are spread out along a continuum of general traits, or abilities. This is exactly what you want when making program admissions decisions or placement decisions -- if students' scores are clumped up together too much, it will be harder to make good (fair) decisions regarding admissions or placement.

But note in Figure 2, the CRT scores, that students' scores are clumped up towards the low end at the beginning of the course or course component and up at the high end at the end of the course or course component, indicating that the students have learned the materials presented in class. This is ideal for measuring achievement (how much students have learned), or locating students' specific problems, both in the context of program or course objectives. For example, if students still score low on a CRT after the material covered by the items has been presented in class, perhaps a rethinking or review of the lessons is in order. Or if only some students do poorly, perhaps they need extra help. CRTs can give students, and teachers, valuable feedback at a time when it is most helpful, in the midst of the program or course.

CRTs: A Help to Objectives Driven Learning

When designing courses, most teachers have in mind what they want students to know, or be able to do at the end of the course. These are course objectives. The clearer we can be about these objectives, the easier it will be to create objectives-based items on CRTs for a course. If these CRTs (which can be course pre-tests, post-tests, and quizzes) are administered throughout the course, we can plot the progress of the students and give appropriate feedback. Remember, with CRTs, students are compared to the course material. Students' scores reflect what they've learned. NRTs, however, compare students to each other, and have no particular connection to the course objectives -- they make no comment whatever about where students are with their learning.

Unfortunately, NRTs are often used, inappropriately, at the course level. This problem is not particular to any education system -- it is endemic worldwide. Witness the experience of an American female corporate accountant, who, in her mid-30s, has returned to university in the U.S. to earn a B.S. in computer science (A. Feisley, personal communication, June 1,1996):

School is so different. The goal is to learn large volumes of material versus demonstrating skills. At times I have had a hard time accepting less than an "A" because it makes me feel I have failed. The major difference between school and work is that in school you are not expected to learn everything, and in fact, courses/exams are frequently designed so that students fall into the bell shaped curve.

Teachers make statements such as "There have been complaints that computer science students do not do well in higher level courses and attribute that to easy grading in the introductory courses. For this reason I grade the intermediate course using a rigorous scale." I found that one interesting. I'd think making sure students had a thorough knowledge of fundamentals would lead to a greater likelihood of success in higher level courses.

So what I'm trying to say is that I find the grading scale as applied does at times take away from the meaning of the grade. Right now I could not say for sure what I received in any of the four courses this quarter since they are all graded on a curve. It's hard to get too bent out of shape about knowing that I might get a "B" because it might only mean that too many others got the A's rather than my having missed some predetermined standard of an "A."

I guess the point I'm trying to make is that the curve concept is somewhat bogus in that it does imply that a group of people will perform in a predetermined manner and deemphasizes the higher goal that everyone should master key concepts.

What Feisley describes are computer courses that operate without clear objectives, and without tests designed to measure students' achievement on those objectives. In using phrases such as "bell shaped curve," "rigorous scale," "a curve," and "the curve concept," she's describing the use of norm-referenced tests (NRTs) to place students in grade categories for the convenience of teachers who simply don't know better --categories which have nothing to do with what students have learned in the course.

Note also that Feisley instinctively knows that there is another way -- when she uses phrases such as "thorough knowledge of fundamentals," "some predetermined standard of an 'A,' and "everyone should master key concepts," she's talking about CRTs written in conjunction with course objectives. Only CRTs will allow teachers to set standards, measure achievement, and give students valuable feedback at the course level.

For many teachers in formal education systems, this concept serves up a ripe, steaming quandary. How can we give students course grades based on "mastery" and "non-mastery" of objectives (pass/fail) when school grading policies demand that we grade students on a norm-referenced scale ("the curve" -- A, B, C, D, etc.)? For answers, stay tuned for the third article in this series, on creating CRTs for your courses.

Conclusion

In this article, basic testing terminology was introduced, along with two major categories comprising all tests, norm-referenced tests (NRTs) and criterion-referenced tests (CRTs). The appropriate uses of NRTs and CRTs in educational decisions were outlined. The dynamic, positive relationship between valid and reliable CRTs, and well defined course objectives was described. And, a final point was made: NRTs should not be used to measure achievement in courses -- even though many, many misguided educators do so.

References

  • Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.
  • Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
  • Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
  • Davies, A. (1990). Principles of language testing. Oxford: Basil Blackwell.