Dimensions in the Diversity of Language: A Language Testing Perspective

Date:

October 1997

Issue:

The Language Teacher - Issue 21.10; October 1997

Writer(s):

Don Porter, Centre for Applied Language Studies, University of Reading, England

Diversity in language performance

Bachman (1990) draws powerful attention to the instability of learner behaviour on different tests of the same skill, e.g.:

Some test takers, for example, may perform better in the context of an oral interview than they would sitting in a language laboratory speaking into a microphone in response to statements and questions presented through a pair of earphones. And individuals who generally perform well in oral interviews may find it difficult to speak if the interviewer is someone they do not know... 'Live' versus recorded presentation of aural material, personality of examiner, filling in the blanks of isolated sentences...are but a few examples of the ways in which the methods we employ in language tests can vary. (p.111)

The result, it is implied, is that two tests which are purportedly measuring the same linguistic ability, but by different methods, may differ in the account they give of an individual test taker's linguistic ability-- a situation which ought to give rise to concern (see also Negishi, 1996, with respect to reading). Until relatively recently, surprisingly little attention was paid to this phenomenon, although it is widely attested in the informal comments of teachers and testers alike, and has the potential to seriously distort attempts at interpreting test results. As is so often the case in language testing, the explanation may lie in the fact that the domain to be tested is frequently not defined with any precision. Teachers, consumers of test results, and often testers themselves rest contented with such vague and general concepts as 'reading comprehension', 'oral proficiency', etc., while at least some of the features noted by Bachman as varying from test method to test method, and as eliciting diverse performances from a test taker, are part of the normal conditions of natural language use. Such features -- it would seem reasonable to suggest -- need to be systematically built into test specifications. On the other hand, features of test tasks which affect learner performance but which are not normal conditions of natural language use need to be controlled for or eliminated. Following Guttman (1970), Bachman refers to such characterising features of test tasks as 'test method facets'.

Sources of diversity in performance on language tests

Of course, no test will have perfect reliability, so even if a single very good test is given to two language learners who have equal ability, simple measurement error will ensure that the results will probably not be absolutely identical. So when one learner takes the same language test twice, the sum of the measurement errors is likely to produce even greater differences in the assessment. Measurement error is known to arise from misleading prompts, errors in the test key, ambiguities in instructions, etc., as well as from unpredictable and predictable features of the test taker (Kunnan, 1995). It is obvious that every effort should be made to eliminate potential test-based sources of unreliability in test-taker performance, as these will lead to inaccuracy in assessment.

Similarly, lack of validity in one or both tests, or different interpretations of validity in the sense that competing 'models' of the ability in question form the bases for the tests, may produce markedly different assessments of a learner's ability. To avoid lack of validity, every effort must be made in the process of test-construction and development to ensure both that the test is based on an adequate theoretical model of linguistic ability, and that the test is itself an adequate embodiment of that model. Differences in assessment of language ability which stem from inadequacies in the underlying linguistic model, or in the incorporation of that model in a test, must be regarded as error.

However, in the case of tests satisfactorily based on competing but reasonable linguistic models, perhaps capturing different insights into the nature of the ability being measured, and thus having what we might call competing validities, some differences in the eventual assessment are to be expected, and should not be ascribed to measurement error. Users of a test should be made aware that tests may differ in their approach to the assessment of linguistic ability, and that the test they are using has its own special focusses and characteristics. We have to accept however that the finer points of the theoretical bases of a test will often be beyond most test users.

As mentioned at the beginning of this paper, in recent years attention has increasingly been paid to the effects on learner-performance of 'test method facets', as discussed in Bachman (1990). Attention was drawn to the fact that some of these facets are peculiar to language tests (e.g. speaking to a microphone and responding to pre-recorded utterances presented over head-phones; filling in blanks in isolated sentences), while others are a natural part of normal every-day language use (e.g. speaking to a 'live' person and responding to spontaneous utterances; speaking to both known and unknown people; speaking to people with evidently different personalities). What Bachman and others do not make clear is that (a) facets peculiar to language tests which affect test performance are undesirable, and their effects should be minimised if outright elimination is not possible, while (b) facets which affect test performance and which are a natural part of normal language use are desirable in test methods, or even requirements if methods are to be fully valid.

Implications for testing

In this section we consider the general implications for testing of the diversity of method facets found in normal language use. We then consider some specific implications of the gender of the interlocutor in interview tests, and of mutual acquaintanceship of participants in pair-tasks. Finally, we consider implications of addressee age in letter-writing tasks. The intention of the discussion is less to focus on the specific facets involved, and more to consider issues raised when these facets are built into the test design.

General implications: Research into the effects of test method facets is still in its infancy. Research into those facets which (a) significantly affect foreign language performance, and (b) are a natural part of normal language use, is embryonic. Candidates for inclusion in this latter category do however suggest that while the systematic inclusion of such facets would substantially enrich the validity of tests, it could simultaneously make them more complex and more time-consuming to administer or to take, more complex to report, and more difficult to report and interpret.

Let us take as an example Bachman's entirely plausible suggestion that the personality of the interlocutor in an interview test might affect the performance of the test-taker. It is possible that where the interlocutor and the test-taker have similar personalities, the test-taker's performance will be enhanced, but where the personality-types differ, performance will be weakened. To be fair to all, then, the personalities of all concerned would need to be assessed, and each test taker would need to be interviewed twice - in two comparable but not identical interviews, of course - once by a similar-personality interlocutor, and once by a different-personality interlocutor. The question would then arise: Should each of the two performance-types be reported separately, as representing two separate sub-types of oral proficiency, or should oral ability be represented by the average of the two performances? The latter would be more practical, of course -- but which would be the more valid?

As so often in life, the solution would doubtless need to be some form of compromise, in which as much of the diversity implied by the test method facet would be included as was compatible with a practical test.

The gender of the interlocutor: Research with learners from many different cultural backgrounds ( O'Sullivan & Porter, 1996; Porter & Shen Shu-Hung, 1991) indicates that the gender of the interlocutor is almost always a significant method facet. While learners from some cultures perform better when the interlocutor is a man, it is usually the case that learners of either gender perform better when their interlocutor is a woman: It seems that women from many cultures tend to use language in interaction in a more facilitative way. The issues here, then, are directly comparable to those described in relation to the hypothetical case of interlocutor personality. It would seem that, where possible, students should interact with both a male and a female interlocutor.

Mutual acquaintanceship: In a study of the effect of learner-acquaintanceship (Japanese students) on pair-task performance, O'Sullivan and Porter (1997) found some indication that mutual acquaintanceship might have a beneficial effect on the performance of students at higher levels of proficiency. The reasonably practical implication might be that in pair-tasks students should always be placed in acquaintance-pairs, as even at lower proficiency levels no actual impairment of performance would result.

Addressee age: O'Sullivan and Porter (1995) found that Japanese learner-writers consistently produced better quality writing when writing to someone identified as being older than themselves. This clearly implies the importance for the learner of having a specified reader: A generalised writing task may well not elicit the student's best performance.

Conclusion

The incorporation in test tasks of a degree of naturalness in the form of facets from normal language use would seem to be both desirable and feasible.

Bibliography

Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Guttman, L. (1970). Integration of test design and analysis. In Proceedings of the 1969 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service.
Kunnan, A.J. (1995). Test taker characteristcs and test performance. Cambridge: Cambridge University Press.
Negishi, M. (1996). Unpublished PhD thesis: University of Reading.
O'Sullivan, B., & Porter, D. (1995). The importance of audience age for learner-speakers and learner-writers from different cultural backgrounds. Paper presented at the RELC conference, Singapore.
O'Sullivan, B., & Porter, D. (1996). Speech style, gender and oral proficiency interview performance. Paper presented at the REL:C conference, Singapore.
O'Sullivan, B., & Porter, D. (1997). The effect of learner acquaintanceship on pair-task performance. Paper presented at the RELC conference, Singapore.
Porter, D., & Shen Shu-Hung. (1991). Gender, status and style in the interview. The Dolphin 21. Aarhus University Press.