Computer-Adaptive Testing of Listening Comprehension: A Blueprint for CAT Development

Patricia A. Dunkel

Date:

October 1997

Issue:

The Language Teacher - Issue 21.10; October 1997

Writer(s):

Patricia A. Dunkel

Georgia State University

The use of computer-adaptive testing (CAT) for placement/achievement testing purposes has become an increasingly appealing and efficient method of assessment. Simply put, CAT saves testing time and decreases examinee frustration since low-ability examinees are not forced to take test items constructed for high-ability testees, and vise versa. As Bergstrom and Gershon (1994, p. 1) note, "When the difficulty of items is targeted to the ability of candidates, maximum information is obtained from each item for each candidate, so test length can be shortened without loss of reliability."

A number of testing programs and licensing agencies in the United States are switching from paper-and-pencil testing to CAT for the sake of efficiency and effectiveness. The American College Testing program offers COMPASS, a computer-adaptive placement test given to students entering college to assess their preparedness to do college work in reading and math. A number of high-stakes tests in the United States are also now administered in CAT form (e.g., Educational Testing Service's Graduate Record Examination), and those in charge of constructing the Test of English as a Foreign Language (TOEFL) are in the process of developing aTOEFL CAT for administration to various applicants taking the TOEFL in the twenty-first century. A number of first- and second-generation second/foreign language CATs have been developed in the United States fora variety of purposes, contents, and contexts, including: French reading proficiency (Kaya-Carton, Carton & Dandonoli, 1991); listening and reading comprehension (Madsen, 1991); English as a second language (ESL)/bilingual entry- and exit-decision making (Stevenson, 1991); ESL reading comprehension (Young, Shermis, Brutten, & Perkins, 1996), and foreign language reading comprehension (Chaloub-Deville, in press; Chaloub-Deville, Alcaya & Lozier, 1996).

Since CAT will, undoubtedly, become a more pervasive method of assessment in the coming century, it behooves English-as-a-Foreign-Language (EFL) professionals to become more knowledgeable about the capability and potential of computer-adaptive testing. It also behooves them to explore developing and/or using CATs for particular institutional testing purposes (e. g., for achievement and placement testing). To construct valid and reliable CATs, some EFL professionals will be forced by constraints of time and/or resources to create those CATs using generic, commercial software (e. g., MicroCAT of Assessment Systems Corporation, Inc., or CAT ADMINIStrATOR from Computer Adaptive Technologies, Inc.). Others may choose to develop their own testing shells from scratch (perhaps the best but most costly and time-consuming approach to CAT development) if the commercial CAT software packages prove too costly for large-scale CAT administrations or if the commercial software fails to meet particular testing needs (e. g., to develop a CAT requiring video and speech output, or eventually even speech recognition) .

It is the three-fold purpose of this article: (1) to familiarize EFL teachers with some basic information about what a CAT is and how one operates; (2) to describe the structure, content, and operation of the ESL listening comprehension CAT; and (3) to acquaint those EFL professionals considering CAT development de novo with the genesis, planning, and implementation of the listening CAT software that drives the ESL listening CAT. To achieve this final goal, the author describes both the inception and realization of her CAT development project which was undertaken with funding from the United States Department of Education, and with the support of instructional designers and computer specialists at The Pennsylvania State University. The author hopes to inspire readers to learn more about CAT, and to help them decide whether they should use available, commercial CAT software programs (e. g., MicroCAT) for their CAT development and administration, or whether they should undertake creation of their own CAT software from scratch (no pun intended). The information contained in this report may be able to serve as a rough blueprint for those CAT developers.

What is CAT?

Computer-adaptive testing (CAT) is a technologically advanced method of testing which matches the difficulty of test items to the ability of the examinee. Essentially, CAT is tailored testing. That is, when an examinee takes a CAT, the test questions/items received are "tailored" to the listening ability of that particular examinee. This means that responses to an earlier itemtaken in the CAT determine which ensuing items are presented to the test taker by the computer. If an examinee gets an item taken in the test correct, the next item received is more difficult; however, if the examinee gets the item wrong, the next item received is easier. This "adapting" procedure results in examinees taking more individualized tests so that even in a large-scale testing situation in which 50 or more people begin taking a CAT at the same time, it is very unlikely that any of these examinees will take the exact same test.(1)

What is the purpose of the ESL listening CAT?

The purpose of the ESL Listening Comprehension Proficiency CAT is to evaluate the nonparticipatory listening comprehension for general English content of literate ESL examinees. The CAT provides a ranking of the examinees nonparticipatory listening comprehension in terms of nine levels of ability (novice-low; novice-intermediate; novice-high; intermediate-low; intermediate-mid; intermediate-high; advanced; advanced plus; superior). These rankings might be used for purposes of placement in (or out of) a variety of adult ESL programs .

What is the structure and content of the ESL listening CAT?

The CAT is designed to evaluate an examinee's ability to understand a range of overheard utterances and discourse ranging from individual words/phrases (e. g., at the novice level of listening ability), to short monologues and dialogues on various topics, and then to longer and more involved dialogues and monologues at the advanced through superior levels. The use of cultural references is minimalized in the content of the novice-level items but is used more and more in the items of intermediate- and advanced-levels of difficulty. For example, authentic text from radio programs is included in advanced-plus and superior items contain large doses of cultural material (e. g., the superior listener must identify the theme of a country-western song).

The items test four listener functions identified by Lund (1990, p. 107) as being "important to second language instruction" and "available to the listener regardless of the text": (1) recognition/identification; (2) orientation; (3) comprehension of main ideas; and (4) understanding and recall of details. It is expected that many additional listener functions will be included in future iterations of the CAT, but the decision was made initially to focus on creating items around these four listening functions or tasks. Each listener function was embedded in a listening stimulus involving a word or phrase (at the novice level) or a monologue or dialogue (at the novice, intermediate, and advanced levels), and to a listener-examinee response involving (1) a text option which required selection of one of two limited-response options (at the novice level), one of three options (at the intermediate level), or one of four options (at the advanced level); (2) a still-photo-option (requiring selection of one of two pictures (graphics), or (3) an element in a still-photo-option (requiring selection of the correct response among two- or three-elements within a unified photograph).

Items involving the four listener functions (identification, orientation, main idea comprehension, and detail comprehension), the two types of language (monologue and dialogue) and the three response formats (text-, photo-, and element in a photo-options) were written for each of the nine levels of listening proficiency articulated in the ACTFL Listening Guidelines (novice-low, novice-mid, novice-high; intermediate-low, intermediate-mid, intermediate-high; advanced, advanced plus, superior). This approach to development of the 144-item bank was taken to ensure that the test developers and potential users would have a clear understanding of which types of language and listening tasks the items in the pool were aiming to assess. The item-writing framework was also used to guide the ESL specialists with item writing since they were not necessarily specialists in testing and measurement theory and practice. The item writers attempted to devise easier items at the novice level and more difficult items at the advanced level. (2) Field-testing of the items (see discussion below) provided evidence that the item writers were not always "on target" when designating items to be "low, mid, or high" levels within each of the categories of proficiency (e.g., novice, intermediate, or advanced). Still, it was thought that asking the item writers to follow a clearly defined framework of content, listener functions, types of languages, and examinee response formats as they began construction of the item bank would allow them to use a more consistent and enlightened approach to item writing.

The following discussion elaborates further upon the specific listener functions (i. e., the test tasks) contained in the framework that guided construction of the item bank.

Identification. According to Lund (1990), focusing on some aspect of the code itself, rather than on the content of the message requires identification which equates with terms such as recognition and discrimination. According to Lund, identification "is particularly associated with the novice level because that is all novices can do with some texts. But identification can be an appropriate function at the highest levels of proficiency if the focus is on form rather than content" (p. 107).

Orientation involves the listener's "tuning in" or ascertaining the "essential facts about the text, including such message-externals as participants, their roles, the situation or context, the general topic, the emotional tone, the genre, perhaps even the speaker function" (p. 108). Determining whether one is hearing a news broadcast and that the news involves sports is an example of an orientation task, according to Lund.

Main Idea Comprehension, involves "actual comprehension of the message. Initially understanding main ideas depends heavily on recognition of vocabulary. With live, filmed, or videotaped texts, the visual context may also contribute heavily to understanding" (p. 108). "Deciding if a weather reports indicates a nice day for an outing," or "determining from a travelogue what countries someone visited" constitute examples of main idea comprehension, according to Lund (p. 108).

Detail Comprehension items test the listener's ability to focus on understanding specific information. According to Lund (1990), this function "may be performed independently of the main idea function, as when one knows in advance what information one is listening for; or the facts can be details in support of main ideas" (p. 108). Lund's examples of this listener function include: following a series of precise instructions; getting the departure times and the platform numbers for several trains to a certain city, and so on.

In addition to using the Lund taxonomy of listening functions listed above, the item writers also attempted to use the ACTFL Listening Guidelines' generic descriptions for listening in the process of creating the 144 items. For example, the Guidelines describe novice-low listening in the following terms: "Understanding is limited to occasional words, such as cognates, borrowed words, and high frequency social conventions. Essentially no ability to comprehend even short utterances." The item writers attempted to keep this descriptor in mind when creating the initial bank of items. For example, in the novice-low identification item, the listener hears a single word "brother" spoken and is asked identify the text equivalent on the computer screen by selecting one of the following two text options: (a) "brother"; (b) "sister." (This particular item is a text-response item.) Additional words (and cognates) will be added as the item bank expands in number. The Guidelines suggest that the novice-mid listener is able to understand some short learned utterances, particularly where context strongly supports understanding and speech is clearly audible. The novice listener comprehends some words and phrases for simple questions, statements, high-frequency commands and courtesy formulae about topics that refer to basic personal information or the immediate physical setting. Items created with this particular Guidelines description in mind required listeners to indicate comprehension of main ideas presented in the monologues or dialogues heard.

How does the ESL listening CAT function?

After the examinee has completed an orientation to the test which teaches her how to use the computerto answer sample questions, the CAT operates as follows: The computer screen presents the answer choices when a question is called for by the examinee, who clicks on the "Next Question" icon to receive a test item (or another item). The examinee can take as much time as she likes to read the text choices (or the photos/graphics) and get ready to call for the listening stimulus. When ready to listen, the examinee clicks on the "Listen" icon, which looks like a loudspeaker. An alert asks the test taker to "listen carefully." The listening stimulus (e.g., the dialog or monologue)is heard immediately thereafter. (The alert and stimulus are played only when the examinee presses the loudspeaker icon). The comprehension question follows as soon as the dialog/monologue ceases, and the question is spoken by the same voice that provided the "listen carefully" cue.

How was the ESL listening CAT created de novo?

It takes expertise, time, money, and persistence to launch and sustain a CAT development project. Above all, it takes a lot of team work. The prototype ESL listening CAT was, in fact, the product of team effort on the part of many people with various areas of expertise: (a) ESL language specialists; (b) authorities in the field of testing and measurement; (c) computer programmers; (d) instructional-technology designers, (e) graduate research assistants; and (f) ESL instructors and students. The ESL specialists wrote the test items; the testing and measurement authorities provided guidance on test design and data analysis, in addition to providing critiques of the individual test items; the computer programmers and instructional-technology designers created the computer software to implement the test design in computerized form; the graduate student assistants did a variety of tasks from creating the item graphics to supervising field testing (or trialing) of the questions in the item bank; the ESL students field tested the 144 prototype items in the item pool; and the ESL instructors offered their classes for trialing of the CAT and provided feedback on the strengths and weaknesses of particular items.

The ESL CAT was developed with the support and assistance of the Educational Technologies Service, a unit within The Pennsylvania State University's Center for Academic Computing. The extensive support offered by this organization infused the project with the considerable expertise of Macintosh computer programmers, experienced instructional designers and software developers, as well as savvy graduate students in educational technology. A brief explanation of how the project was initiated, planned and started, together with a brief description of the actual programming environment should illuminate some of the varied and complex aspects of the task.

Upon agreeing to take part in the project, the staff of the Educational Technologies Services decided to use a systems approach to the design, development, evaluation, and implementation of the CAT project. The phases of project development included:

Project definition and planning. This phase included needs and task analysis, goal-setting, defining the instructional solution and strategy, determining evaluation methods, assigning personnel, reviewing the budget, and determining technology tools and environment.

Design of a model section. This phase included planning screen layout, instructional strategy, record keeping and reporting techniques, and student assessment procedures.

Identification of all sections and/or modules. During this phase the full scope of the project was planned.

Development and evaluation of model or prototype selected. It was decided that the prototype would have full functionality and would contain a bank of 144 items for initial trialing. Evaluation included several types of formative evaluation including a questionnaire soliciting the sentiments of a subsample of examinees who took the computer-assisted version of the test concerning the design of the screens, the ease of use, the identification of operating problems, and so forth.

Product design. This included content specification, content gathering, analysis and sequencing of learning tasks, and storyboarding. The Educational Technologies Service staff worked closely with the author of this report in designing the testing software that would run the CATs.

Product development and evaluation. This included computer code development, graphics and video/audio development, integration of all content, review, revision, evaluation, optimization, and documentation.

Working with the staff of the Educational Technologies Service, CAT project management activities included the following:

Specifying the time line for development
Selecting graduate student interns to work on the project (four graduate students in the Penn State's Department of Instructional Systems assisted with the project)
Supervising part time assistants
Securing copyright releases for the scanning of textbook photos for ESL CAT and permission to photograph subjects for digitized representations of various activities used in the ESL CAT
Scheduling and leading meetings
Coordinating the integration of the test components
Documenting development processes

Early in the project, the author of this article (and the Lead Faculty Member on the CAT development team) established the framework for test development and item writing, identified the levels of listening comprehension and the listener functions targeted, and decided upon the formats of the questions that would be included in the initial item banks (see discussion below). Some of the ensuing development and implementation procedures included the following tasks:

Developing prototype screen layouts which would be suitable for the various question types;
Maintaining clarity, conciseness, and consistency among the screens and the test taking procedures;
Deciding how students should "navigate" through the test;
Developing appropriate graphics for the ESL CAT with scanning or creating graphics using the MACDRAW program;
Taking photographs with the Cannon XAP-SHOT image-digitizing camera for inclusion in the ESL test;
Touching up photos and graphics using graphics packages;
Designing the title screen, introductory (orientation) screens, and end of the test screen which reports the level of achievement;
Designing and developing procedures to orient the student how to take the test;
Developing the computer programs to create and run the test (described elsewhere);
Assembling the test (this included using the test editor Educational Technologies Service created to bring together all elements, including the audio files, the graphics and photos, the text, and the item format);
Supervising the field tests of the computer-assisted and the computer-adaptive versions of the test in the Educational Technologies Service lab;
Maintaining the quality, accuracy, and consistency of the test items;
Debugging technical problems related to the test creation and/or administration;
Implementing data gathering for research (questionnaires were developed to solicit test takers' reactions to the test design and item displays);
Revising screen design as a result of formative evaluations;
Revising items as necessary (i.e., re-inputting new text, audio, or graphic and photos, as needed);
Implementing and field testing the adaptive testing algorithm.

What is the programming environment of the ESL listening CAT like?

A brief discussion of the computer and the programming environment, as well as the major components of the computer-adaptive test, follows:

1. The Hardware and Programming Environments. All parts of the test were designed to run on an Apple Macintosh IIsi or other Macintosh computers, running System 7.0 or later, with minimum of 5 megabytes of RAM and access to a large amount of mass storage (either a local hard-drive or access to an AppleShare server over ethernet).

The hardware configuration used to create and deliver the test consisted of a Macintosh IIfx with 20 megabytes of RAM (used for programming) and a Macintosh IIci with 8 megabytes of RAM (used for constructing, demonstrating, and field testing of the CAT).

The following software was used to create and run the test: C++, from Macintosh Programmer's Workshop. (It contains the normal set of programming tools [compilers, linkers, assemblers, etc.], including the C++ compiler); MacApp, an object-oriented application framework designed by Apple; InsideOut by Sierra Software, a database engine library.

2. The Penn State Computer-Adaptive Test Shell. The computer-adaptive test is comprised of several components. This division allows the code to be broken up into functional units, which makes debugging and extending the code easier. The units are the Front End, the Test Manager, and the Question Manager.

The Front End includes the title screens, and some student information screens which accept demographic input from the students. The student information is stored so that it can be output with the test results, if it seems desirable to do so.
The Test Manager controls the administration of the test. Its major job is to actually present the questions to the student and accept the student's responses. Essentially, the Test Manager asks the Question Manager for a question, then displays it according to the type of question. When the student has given an answer, it will inform the Question Manager of the student answer and ask for another question.
The Question Manager handles the selection of questions and scoring. A sub-unit of this component also handles the storage of questions and their components (text, pictures, sound names, etc.) using the InsideOut database library. The Question Manager determines the question selection algorithm. The current CAT algorithm used is one suggested by Henning (1987). When the Question Manager determines that the test is over (when the estimate of the student ability is accurate enough using the CAT algorithm), it returns a NULL question to the Test Manager.
The Test Editor. To edit and create the tests, the Test Editor application (called TestEdit) is used. It allows an individual who knows nothing about the technical details of file formats and/or programming to create and edit the CAT, if it is appropriate for the individual to do so (e.g., a designated test supervisor or someone who wishes to add institution-specific items to the bank). A great deal of code is shared between the Test Editor and the computer-adaptive test. This sharing not only saves programming time but it also helps guarantee that the code is bug-free.
The Test Results Files. At the end of a test (if the test creator so desires), there is a results file written to the hard disk. The file is given a unique name so that students (or supervisors) may use it on a shared network, if it is appropriate for them to do so. The file is a standard ASCII text file and can be read by any Macintosh word processor, or transferred easily to other computers. The first part of the file writes out the student information, as entered by the student. The second part writes out the student information for each question (one question per line, items are separated by commas), including the following: Number of lines in the Information section (including this one); Student Name; Student ID; Student Birthdate; Starting Date; Starting Time; Total Duration of the test (hh:mm:ss); Examination Center; How long the student has studied English; How many years the student has lived in an English-speaking country; the student's estimate of how well they speak English (the student's estimate of how well they understand English; the student's estimate of how well they read English; the student's estimate of how well they write English.

The following information is provided, in addition: 1) number of questions (administered for this test); 2) question information, one per line; each question has the following fields, separated by commas: question order (the index of the question); question ID (assigned by the editor); question type; number of choices in question; the correct answer (1-4); the student's answer; the number of times the student played the sound; question duration.

How was the initial item bank field tested (trialed)?

The ESL test was field tested using overhead transparencies of the CAT screen displays and an audiotape of the dialogue and monologue stimuli and the test questions. Two-hundred-fifty five subjects took part in the initial field testing at two testing sites: Georgia State University and The Pennsylvania State University. The field testing provided the necessary statistics (or Item-Response Theory calibrations) for each test item. These statistics (or item calibrations) are associated with each test item and reside in the computeras part of the test-item information.

All 144 items in the bank were administered in linear fashion; all 72 items designated by the item writers to be novice-level items were administered first, then the 47 intermediate-level items and finally the 23 advanced-level items were administered to intact groups of examinees at Georgia State and Penn State. On an overhead transparency, the field-test administrator displayed each of the test-item's options, which consisted of text or visuals, on an overhead transparency. Student viewed the answer choices, heard the audio stimuli and the test question, and then registered their responses on a computer answer sheet. Each transparency was placed on the overhead projector while the examinees were registering their responses to a previous test item on the computer answer sheet; this procedure allowed the subjects the chance to view the options on the overhead transparency before they listened to the stimuli. The reading of the test directions and the administration of the item bank took approximately 90 minutes. The test directions were not presented via audiotape but were read by the test administrator who had the opportunity to answer any questions examinees had about the task, the types of items, and the testing procedures.

Statistical analysis of the responses to the paper and pencil version of the test yielded the IRT (Rasch) ability parameters (qs) that are being used by the algorithm to drive the computer-adaptive version of the test. (Unfortunately, a full discussion of the statistical analysis is not possible within the scope of this paper.) The fact that the ability estimates were gathered, to date, on a small sample of only 255 subject presents a distinct problem when IRT analysis is used since when sample sizes are small, the Rasch parameters can prove to be relatively unstable. Therefore, additional field-testing is needed (and will be done) to establish more valid and stable parameters if the test is to be implemented for placement or achievement testing purposes in language programs

What is the next step?

Field-testing of the paper-and-pencil form of the original 144 items must continue until a substantially large sample of examinees (500 to 1,000 subjects) is tested, and revision of the items in the bank by ESL and testing/measurement specialists must also continue. In addition, The item bank also needs to be expanded in number and variety of test items. Finally, the test must be subjected to further reliability and validity studies in its adaptive form, as well as its paper-and-pencil form. An English-for-academic purposes (EAP) item banks should also be constructed so that the CAT can be used for EAP testing purposes. The testing shell will permit creation of various kinds of listening CATs for particular purposes (e.g., for helping select those admissible to universities in Japan, etc.). Above all, the ESL CAT development project should inspiremore teachers and researchers to begin thinking about using and developing CATs for their own institution's assessment purposes. However, we must be sure to recognize and acknowledge that CAT, in and of itself, is no panacea or philosopher's stone of assessment. In addition to being concerned about all the difficulties involved in learning about computer adaptive testing, and finding out how (and whether) to use/create a CAT, we must, above all, be concerned about developing valid, reliable, and useful instruments, be they listening tests or others. We must, therefore, be sure to recognize and agree that core principles of good test development remain in full force, whether we are developing a CAT or a classroom exam (see, for example, Bachman & Palmer, 1996; Buck, 1991; Dunkel, Henning & Chaudron, 1993; Gorsuch & Griffee, 1997). Computerization is beginning to open a whole new world of testing, but the world of the CAT developers (and users) differs little from the world of the paper-and-pencil test developers (and users) when it comes to theirabiding by the core principles of competent language testing set forth by Bachman and Palmer (1996, p. 9), including having:

An understanding of the fundamental considerations that must be addressed at the start of any languages testing effort, whether this involves the development of new tests or the selection of existing language tests;
An understanding of the fundamental issues and concerns in the appropriate use of language tests;
An understanding the fundamental issues, approaches, and methods used in measurement and evaluation;
The ability to design, develop, evaluate and use language tests in ways that are appropriate for a given purpose, context, and group of test takers;
The ability to critically read published research in language testing and information about published tests in order to make informed decisions.

The author hopes that she has given some insight into the item mentioned first in this list of requirements. The rest is up to you, the readers, and to those who will be the developers and users of CATs in the coming century.

Notes

1. Such was the case when the author of this report began in 1988-89 to think about creating a listening comprehension CAT. At that time, commercial software CAT packages (e.g. MicroCAT) were not equipped with an audio interface to provide a listening CAT so the author had to begin a software development project that created a CAT able to interface text and graphics/photographs with digitized speech in the test items. back

2. Work is presently underway at Georgia State to create a bank of CAT items that assess students' preparedness to be effective nonparticipatory listeners of English lectures at a university. The English-for-academic (EAP) purposes listening CAT will evaluate examinees listening comprehension in terms of these same nine levels of achievement (novice-low through superior). back

3. The item writers' intuition and pedagogical experience guided the construction of the initial pool of items. Item Response Theory (IRT) statistical analysis was then used to check the level of difficulty (or easiness) associated with each item. Since the sample size providing the IRT item parameters was quite small (n=255), caution needs to be exercised when interpreting the initial set of IRT parameters that drive selection of test items for examinees. Copntinued field testing should help determine whether the parameters are accurate or not. back

References

American Council on the Teaching of Foreign Languages. (1986). ACTFL Proficiency Guidelines. Hastings-on-Hudson, NY. ACTFL.

Bachman, L., & Palmer, A. (1996). Language Testing in Practice. New York: Oxford University Press.

Bergstron, B., & Gershon, R. (Winter 1994). Computerized adaptive testing for licensure and certification. CLEAR Exam Review, 25-27.

Buck, G. (1991). The testing of listening comprehension: An introspective study. Language Testing, 8, 67-91.

Chaloub-Deville, M. (in press). Important considerations in constructing second language computer adaptive tests. Shiken, Japan: The Japanese Association for Language Teaching.

Chaoub-Deville, M., Alcaya, C., & Lozier, V. (1996). An operational framework for constructing a computer-adaptive test of L2 reading ability: Theoretical and practical issues. CARLA Working Paper Series #1. Minneapolis, MN: University of Minnesota.

Dunkel, P., Henning, G., & Chaudron, C. (1993). The assessment of a listening comprehension construct: A tentative model for test specification and development. Modern Language Journal, 77, 180-191.

Dunkel, P. (1991). Computerized Testing of Nonparticipatory L2 Listening Comprehension Proficiency: An ESL Prototype Development Effort. Modern Language Journal, 75(1), 64-73.

Gorsuch, G., & Griffee, D. (1997). Creating and using classroom tests. Language, 21, 27-31.