Chapter Two
BasicConsiderationsinTestDesign
The concepts of validity, reliability and efficiency affect all aspects of test design,
irrespective of the prevailing linguistic paradigm. In Chapter One the relative importance
of these concepts in recent approaches to language testing was reviewed. In this chapter
the nature of these key concepts is examined in more detail. The status of the various
types of validity, and how the concept of validity relates to those of efficiency and reliability
are examined. Chapters Three and Four are more practically oriented; they examine the
specification and realisation of the theoretical foundations discussed in the first two chapters.
2.1 The concept of validity
2.1.1 Construct validity
The concept of validity (does the test measure what it is intended to measure?) can be
approached from a number of perspectives; the relationship between these is interpreted
in a number of ways in the literature.
The most helpful exegesis regards construct validity as the superordinate concept
embracing all other forms of validity. Anastasi (1982, p. 153) is of the opinion that: 'content,
criterion-related and construct validation do not correspond to distinct or logically co-
ordinate categories. On the contrary, construct validity is a comprehensive concept which
includes the other types.'
Cronbach (1971, p. 463) commented that: 'Every time an educator asks "but what does
the instrument really measure?" he is calling for information on construct validity.' Anastasi
(1982, p. 144) defined it as:
the extent to which the test may be said to measure a theoretical construct or trait .... Each
construct is developed to explain and organise observed response consistencies. It derives
from established inter-relationships among behavioral measures . . . . Focusing on a broader,
more enduring and more abstract kind ofbehavioral description ... construct validation requires
the gradual accumulation of information from a variety of sources. Any data throwing light
Basic considerations in test design 23
on the nature of the trait under consideration and the conditions affecting its development
and manifestations are grist for this validity mill.
She argued that the theoretical construct, trait or behaviour domain measured by any
test can be defined in terms of the operations performed in establishing the validity of
the test. She was careful to emphasise that the construct measured by a particular test
(1982, p. 155): 'can be adequately defined only in the light of data gathered in the process
of validating that test ... . It is only through the empirical investigation of the relationship
of test scores to other external data that we can discover what a test measures.'
The view expressed below differs only insofar as external empirical data are seen as
a necessary but not a sufficient condition for establishing the adequacy of a test for the
purpose for which it was intended. Though there is a lack of an adequate theoretical
framework for the construction of communicative tests, this does not absolve test
constructors from trying to establish a priori construct validity for a test conceived within
the communicative paradigm. A test should always be designed on a principled basis,
however limited the underlying theory, and, wherever possible after its administration,
statistical validation procedures should be applied to the results to determine how successful
the test has been in measuring what it intended to measure.
In the past little attention has been accorded to the non-statistical aspects of construct
validity. In the earlier psychometric-structuralist approach to language testing (see Section
1.2) the prevailing theoretical paradigm lent itself easily to testing discrete elements of
the target language and little need was seen for much a priori deliberation on the match
between theory and test. Additionally the empiricism and operationalism of those working
in educational measurement made the idea of working with non-objective criteria
unattractive. The notions of concurrent and predictive validity, more consistent with the
principles of operationalism and the desire for an objective external criterion, took
precedence.
Construct validity is viewed from a purely statistical perspective in much of the recent
American literature (see Palmer et al. 1981; Bachman and Palmer, 1981a). It is seen
principally as a matter of the a posteriori statistical validation of whether a test has measured
a construct which has a reality independent of other constructs. The concern is much more
with the a posteriori relationship between a test and the psychological abilities, traits and
constructs it has measured than with what should have been elicited in the first place.
To establish the construct validity of a test statistically, it is necessary to show that it
correlates highly with indices of behaviour that one might theoretically expect it to correlate
with and also that it does not correlate significantly with variables that one would not expect
it to correlate with. An interesting procedure for investigating this is the convergent and
discriminant validation process first outlined by Campbell and Fiske (1959) and later used
by Bachman and Palmer (1981b). The latter argue that the strong effect of test method
that they discovered points to the necessity of employing a multi-trait multi-method matrix
as a research paradigm in construct validation studies. They found that the application
of confirmatory factor analysis to these data enabled them to quantify the effects of trait
and method on the measurements of proficiency employed and provided a clearer picture
of this proficiency than was available through other methods.
24 Communicative Language Testing
The experimental design of the multi-trait/multi-method matrix has been criticised (see
Low, 1985) especially in relation to more direct tests of language proficiency, but
nevertheless is deserving of further empirical investigation as so few studies have been
reported, particularly from this side of the Atlantic. It is a potentially useful, additional
measure for clarifying what it is that we have measured in a particular application of a
test. The only difficulty in employing this technique is that to be effective a high degree
of test reliability is essential as error variance is likely to confound the results.
In contrast to this emphasis on a posteriori statistical validation there is a body of opinion
which holds that there is an equally important need for construct validation at the a priori
stage of test design and implementation.
Cronbach (1971, p. 443) believes that: 'Construction of a test itself starts from a theory
about behaviour or mental organisation derived from prior research that suggests the ground
plan for the test.' Davies (1977, p. 63) argued in a similar vein: 'it is, after all, the theory
on which all else rests; it is from there that the construct is set up and it is on the construct
that validity, of the content and predictive kinds, is based.' Kelly (1978, p. 8) supported
this view, commenting that: 'the systematic development of tests requires some theory,
even an informal, inexplicit one, to guide the initial selection of item content and the division
of the domain of interest into appropriate sub-areas.'
It would seem self-evident that the more fully we are able to describe the theoretical
construct we are attempting to measure, at the a priori stage, the more meaningful might
be the statistical procedures contributing to construct validation that can subsequently be
applied to the results of the test. Statistical data do not in themselves generate conceptual
labels. We can never escape from the need to provide clear statements concerning what
is being measured, just as we are obliged to investigate how adequate a test is in operation,
through available statistical procedures.
2.1.2 Content validity
Because we lack an adequate theory of language in use, a priori attempts to determine
the construct validity of proficiency tests involve us in matters which relate more evidently
to content validity. The more a test simulates the dimensions of observable performance
and accords with what is known about that performance, the more likely it is to have content
and construct validity. We can often only talk about the communicative construct in
descriptive terms and, as a result, we become involved in questions of content relevance
and content coverage. Thus, for Kelly (1978, p. 8) content validity seems 'an almost
completely overlapping concept' with construct validity, and for Moller (1982b, p. 68):
'the distinction between construct and content validity in language testing is not always
very marked, particularly for tests of general language proficiency.'
Given the restrictions on the time and resources available to those involved in test
construction, especially for use in the classroom, it is often only feasible to focus on the
a priori validation of test tasks. In these cases, particular attention must be paid to content
validity in an attempt to ensure that the sample of activities to be included in a test is as
representative of the target domain as is possible.
A primary purpose of many communicative tests is to provide a profile of the student's
Basic considerations in test design 25
proficiency, indicating in broad terms the particular modes where deficiencies lie. Content
validity is considered especially important for achieving this purpose as it is principally
concerned with the extent to which the selection of test tasks is representative of the larger
universe of tasks of which the test is assumed to be a sample (see Bachman and Palmer,
1981a).
Anastasi (1982, p. 131) defined content validity as: 'essentially the systematic examination
of the test content to determine whether it covers a representative sample of the behaviour
domain to be measured.' She (p. 132) provided a set of useful guidelines for establishing
content validity:
1. 'the behaviour domain to be tested must be systematically analysed to make certain
that all major aspects are covered by the test items, and in the correct proportions';
2. 'the domain under consideration should be fully described in advance, rather than being
defined after the test has been prepared';
3. 'content validity depends on the relevance of the individual's test responses to the
behaviour area under consideration, rather than on the apparent relevance of item
content.'
The directness of fit and adequacy of the test sample is thus dependent on the quality of
the description of the target language behaviour being tested.
J.B. Carroll (1961) pointed to the importance of and the difficulties involved in defining
the area of language from which the sample is to be taken and the resultant problems this
has for sampling. Moller (1982b, p. 37) also referred to the problems involved: 'In the
case of a proficiency test, however, the test constructors themselves decide the "syllabus"
and the universe of discourse to be sampled. The sampling becomes less satisfactory because
of the extent and indeterminate nature of that universe.'
Establishing content validity is problematic, given the difficulty in characterising language
proficiency with sufficient precision to ensure the representativeness of the sample of tasks
included in a test. Additional threats to validity may arise out of attempts to operationalise
real-life behaviour in a test especially where some sort of quantification is necessary either
in the task or the method of assessment. These difficulties do not, however, absolve the
test constructor from attempting to make tests as relevant in terms of content as is possible.
The procedure of designing a test from a skills specification (see the list drawn up by
Munby, 1978 and its attempted implementation by B.J. Carroll, 1981b) may lead to
variability in opinions as to what is being tested by specific items. There is a need to establish
clear procedures that might reduce this variability.
Further, there is a need to look closely at test specifications to make sure that they describe
adequately what ought to be tested. A close scrutiny of the specification for a proficiency
test by experts in the field (or colleagues in the case of classroom achievement tests) and
the relating of the specification to the test as it appears in its final form is essential (see
Weir, 1983a). This would provide useful information as to what the test designer was
intending to test and how successful the item writers had been in implementing the
specification in the test realisation.
Mere inspection of the modules in the test, even by language and subject experts, does
26 Communicative Language Testing
not necessarily guarantee the identification of the processes actually used by candidates
in taking them. In addition, it would be valuable to employ ethnographic procedures to
establish the validity of items in a test.
A useful procedure is to have a small sample of the test population introspect on the
internal processes that are taking place in their completion of the test items (see Aslanian,
1985; Cohen, 1985). This would provide a valuable check on experts' surface-level
judgements on what was being tested and would contribute to the establishment of guidelines
for the conduct of this type of methodological procedure in future investigations of test
validity.
It is crucial for a test supposedly based on specified enabling skills to establish that it
conforms to the specifications, especially if claims are made for these being representative
of the domain in question. To the extent that the content is made explicit the concern also
becomes one of face validity which Porter (1983) describes as perhaps the most contentious
validity that might be invoked.
2.1.3 Face validity
Anastasi (1982, p. 136) pointed out that face validity:
is not validity in the technical sense; it refers, not to what the test actually measures, but
to what it appears superficially to measure. Face validity pertains to whether the test 'looks
valid' to the examinees who take it, the administrative personnel who decide on its use, and
other technically untrained observers. Fundamentally, the question of face validity concerns
rapport and public relations.
Lado (1961), Davies (1965), E. Ingram (1977), Palmer (1981) and Bachman and Palmer
(1981a) have all discounted the value of face validity. Bachman and Palmer (1981a, p. 55)
argue as follows:
Since there is no generally accepted procedure for determining whether or not a test
demonstrates this characteristic, and since 'it is not an acceptable basis for interpretative
inferences from test scores', we feel it has no place in the discussion of test validity.
If a test does not have face validity though, it may not be acceptable to the students
taking it, or the teachers and receiving institutions who may make use of it. If the students
do not accept it as valid, their adverse reaction to it may mean that they do not perform
in a way which truly reflects their ability. Anastasi (1982, p. 136) took a similar line:
Certainly if test content appears irrelevant, inappropriate, silly or childish, the result will
be poor co-operation, regardless of the actual validity of the test. Especially in adult testing,
it is not sufficient for a test to be objectively valid. It also needs face validity to function
effectively in practical situations.
The usual empirical caveat of course applies (Anastasi, 1982, p. 136): 'To be sure, face
validity should never be regarded as a substitute for objectively determined validity . . . .
The validity of the test in its final form should always be directly checked.'
Stevenson (1985b) expresses a similar concern that construct and content validities should
not be sacrificed at the altar of an increased lay acceptance of non-technical face validity.
Basic considerations in test design 27
2.1.4 Washback validity
The difficulties ofprecisely determining what it is that needs to be measured perhaps argues
for a greater concern with what has recently been termed 'washback validity' (Morrow,
1986) or more commonly (Porter, 1983 and Weir, 1983a) the washback of the test on
the teaching and learning that precedes it.
Given that language teachers operating in a communicative framework normally attempt
to equip students with skills that are judged to be relevant to present or future needs, and
to the extent that tests are designed to reflect these, the closer the relationship between
the test and the teaching that precedes it, the more the test is likely to have construct validity.
In other circumstances the tail may wag the dog in that a communicative approach to
language teaching is more likely to be adopted when the test at the end of a course of
instruction is itself communicative. A test can be a very powerful instrument for effecting
change in the language curriculum as recent developments in language tests in Sri Lanka
have shown.
A suitable criterion for judging communicative tests in the future might well be the degree
to which they satisfy students, teachers and future users of test results, as judged by some
systematic attempt to gather quantifiable data on the perceived validity of the test. If the
test passes the first a priori validity hurdle it is then worthwhile establishing its validity
against external criteria, through confirmatory a posteriori statistical analysis. If the first
stage, with its emphasis on construct, content, face and washback validities, is bypassed
then we should not be too surprised if the type of test available for external validation
procedures does not suit the purpose for which it was intended.
For construct, content, face and washback validity, knowing what the test is measuring
is crucial. There is a further type of validity which we might term criterion-related validity
where knowing exactly what a test measures is not so crucial.
2.1.5 Criterion-related validity
This is a predominantly quantitative and a posteriori concept, concerned with the extent
to which test scores correlate with a suitable external criterion of performance: what Ingram
(1977, p. 18) termed 'pragmatic validity'. Criterion-related validity divides into two types
(see Davies, 1977), concurrent validity, where the test scores are correlated with another
measure of performance, usually an older established test, taken at the same time (see
Kelly, 1978; Davies, 1983) and predictive validity, where test scores are correlated with
some future criterion of performance (see Bachman and Palmer, 1981a).
For many authorities, external validation based on data is always superior to the 'armchair
speculation of content validity'. Davies (1983, p. 1) has argued forcefully that external
validation based on data is always to be preferred: 'The external criterion, however hard
to find and however difficult to operationalise and quantify remains the best evidence of
a test's validity. All other evidence, including reliability and the internal validities is
essentially circular.' And he quotes Anastasi on the need for independently gathered external
data: 'Internal analysis of the test, through item-test correlations, factorial analysis of test
items, etc., is never an adequate substitute for external validation.'
28 Communicative Language Testing
Though this concept of criterion-related validity is more in keeping with the demands
of an empiricist-operationalist approach, the problem remains that a test can be valid in
this way without our necessarily knowing what the test is measuring, i.e., it tells us nothing
directly about its construct validity. Morrow (1979, p. 147) drew attention to the essential
circularity of employing these types of validity in support of a test:
Starting from a certain set of assumptions about the nature oflanguage and language learning
will lead to language tests which are perfectly valid in terms of these assumptions but whose
value must inevitably be called into question ifthe basic assumptions themselves are challenged.
For Jakobovits (1970, p. 75) the very possibility of being able to construct even one
communicative test appeared problematic: 'the question of what it is to know a language
is not well understood and, consequently, the language proficiency tests now available
and universally used are inadequate because they attempt to measure something that has
not been well-defined.'
Even if it were possible to construct a valid communicative test there would still be
problems in establishing sufficiently valid criterion measures against which to correlate
it. Hawkey (1982, p. 153) felt this to be particularly problematic for tests conceived within
a communicative paradigm: 'At this developmental stage in communicative testing, other
tests available as criteria for concurrent validation are likely to be less
integrative/communicative in construct and format and thus not valid as references for
direct comparison.'
There is a distinct danger that one might be forced to place one's faith in a criterion
measure which may in itself not be a valid measure of the construct in question. One cannot
claim that a test has criterion-related validity because it correlates highly with another
test, if the other test itself does not measure the criterion in question.
It seems pointless to validate a test conceived within the communicative paradigm against
tests resulting from earlier paradigms if they are premised on s'/ch different constructs.
Similarly if equivalent but less efficient tests are not available against which to correlate,
other external non-test criteria might need to be established.
Establishing these non-test criteria for validating communicative tests could well be
problematic. Even if one had faith in the external criterion selected, for example, a sample
of real life behaviour, the quantification process which it might be necessary to subject
this behaviour to in order for it to become operational might negate its earlier direct validity.
Though caution is advocated in the interpretation of these criterion-related validity
measures, they are still considered to be potentially useful concepts. For example, one
might be very wary of tests that produced results seriously at variance with those of other
tests measuring the same trait, especially if the latter had been found to have construct
validity.
It is particularly important to try to establish criterion-related validity for a test through
empirical monitoring whenever the candidates' futures may be affected by its results. For
example, given the variety of language qualifications currently acceptable as evidence of
language proficiency for entry into tertiary-level study in Britain (see Weir, 1983a), there
is some cause for concern about the equivalence of such a broad spectrum of tests. Where
is the empirical evidence for the equivalence of one entry qualification with another?
Basic considerations in test design 29
In the case of predictive validity, it may be that in certain circumstances the predictive
power of the test is all that is of interest. If all one wants is to make certain predictions
about future performance on the basis of the test results, this might entail a radically different
test from that where the interest is in providing information to allow effective remedial
action to be taken. If predictions made on the basis of the test are reasonably accurate
then the nature of the test items and their content might not be so important.
Incidentally, both validity and reliability estimates based on correlational data must be
treated with caution. A high correlation may indicate the measurement of two different
attributes which are themselves quite highly correlated among the population of examinees.
On the other hand, a low correlation may indicate that two quite different attributes are
being measured or may merely reflect a high level of error variance in one or both of
the tests.
2.1.6 How should a test be known?
Most GCSE examinations and existing language proficiency examinations, e.g., the
University of Cambridge Local Examinations Syndicate (UCLES) Certificate of Proficiency
in English (CPE) and the Joint Matriculation Board (JMB) Test in English (Overseas),
because of their public, operational nature, are not over interested in concurrent or predictive
validity, whereas, as Davies (1982) has pointed out, these are matters of major concern
for most standardised, closed EFL tests. Correlating the results of one year's examination
with other examinations or against some future criterion is perhaps viewed as a pointless
exercise when a new set of examinations is already in preparation for the following year
and results already issued for current candidates.
Only closed tests, such as the Associated Examining Board's Test in English for
Educational Purposes, the TEEP test (see Appendix I), or UCLES and the British Council's
English Language Testing Service's (ELTS) battery (see Appendix V), feel obliged to
concern themselves with a posteriori validation procedures. Open examinations which are
held annually tend to rely more heavily on construct (non-statistical), content and face
validity.
In situations where the test is to have a diagnostic function a high degree of explicitness
at the a priori stage of test construction is felt to be necessary. This is particularly so where
the aim is to provide meaningful statements on a candidate's performance which would
be of use to those providing remedial support for candidates with known difficulties.
If the concern is to collect appropriate information on a candidate's performance for
the purposes of profile reporting rather than to establish a test's predictive validity, then
there is more obligation to improve the content/construct validity of the test by identifying,
prior to test construction, appropriate communicative tasks which it should include. This
a priori validation is essentially a first, though crucial, step in the total validation process
of a test.
Having made rigorous attempts at an a priori stage to make the test as valid as possible,
there is then a need to establish the validity of the test against external criteria. If the first
stage with its emphasis on content validity is bypassed, then the type of test available for
30 Communicative Language Testing
external validation procedures would not, in all likelihood, suit the purpose for which the
test is intended.
To illustrate the recent awakening of interest in a priori validation of tests it might be
useful to take a concrete example of the construction of a test for a particular purpose.
Let us assume the task is to construct a proficiency test in English for Academic Purposes
(EAP) which can also provide through profiling some diagnostic information on the
language-related study skills candidates are weak in.
A test of discrete grammatical items constructed for this purpose might be found to
correlate highly with an external criterion, e.g., another established test concurrently
administered or a measure taken at a later date, such as final academic grades. That is,
it might possess criterion-related validity. It would, however, be of less value to those
providing remedial English language support, who, rather than a single score, require
information about the particular study modes in which a student has difficulty operating,
i.e., they might be better served by a test exhibiting construct, content and face validity.
One would not be able to allocate students effectively to remedial language programmes
on the basis of performance in a discrete-point structuralist test lacking these validities.
The a priori validation of an EAP proficiency test with diagnostic potential would seem
to demand that we test integrated macro-skills rather than micro-elements in isolation.
Ifthe aim is to test the communicative competence of overseas students in an EAP setting,
it is doubtful whether tests of linguistic competence alone are appropriate because the
constructs for such tests are necessarily based on discrete linguistic levels, not on integrative
work samples. Since the essence of communication is an ability to combine discrete
linguistic elements in a particular context it seems essential that this ability should be assessed
by tests of integrated skills rather than by tests of discrete linguistic levels in isolation.
The content of an EAP proficiency test based on work samples from the target situation
would be qualitatively different from the content of a test of linguistic competence based
upon discrete linguistic items. In the case of the EAP proficiency test which aims at assessing
communicative competence the main justification for item selection would be a careful
sampling of the communicative tasks required of students in English-medium study. In
the case of a test of linguistic competence a test may be considered valid if its content
is based on an adequate sample of 'typical' discrete linguistic elements.
According to Canale and Swain (1980, p. 34) communicative testing: 'must be devoted
not only to what the learner knows about the second language and about how to use it
(competence) but also to what extent the learner is able to actually demonstrate this
knowledge in a meaningful communicative situation.'
The proficiency tester today is influenced by what Moller (1981b) has described as the
sociolinguistic--communicative paradigm. The nature of communicative testing was
discussed in Section 1.4 above. Briefly a test within this communicative paradigm might
be expected to exhibit the following features:
ˇ There would be an emphasis on interaction between participants, and the resultant
intersubjectivity would determine how the encounter evolves and ends.
ˇ The form and content of the language produced would be to some extent unpredictable.
ˇ It would be purposive in the sense of fulfilling some communicative function.
Basic considerations in test design 31
ˇ It would employ domain-relevant texts and authentic tasks.
ˇ Abilities would be assessed within meaningful and developing contexts and a profile
of performance on these made available.
ˇ Where deemed appropriate and feasible, there might be an integration of the four skills
of reading, listening, speaking and writing.
ˇ The appropriateness of language used for the expression of functional meaning would
have high importance.
ˇ It would use direct testing methods, with tasks reflecting realistic discourse processing.
ˇ The assessment of productive abilities would most probably be qualitative rather than
quantitative, involving the use of rating scales relating to categories of performance.
Thus a good deal more attention will have to be paid to content and face validity than
was the case within previous orthodoxies. However, given the rudimentary state of the
art in communicative approaches to language testing, some authorities still feel it would
be prudent to retain a number of components which sample major linguistic categories,
for, as Moller (1981b, p. 44) argued:
It is clear that communicative testing does test certain aspects ofproficiency. But it is important
to be aware that testing language proficiency does not amountjust to communicative testing.
Communicative language performance is clearly an element in, or a dimension of, language
proficiency. But language competence is also an important dimension oflanguage proficiency
and cannot be ignored. It will also have to be tested in one or more of the many ways that
have been researched during the past thirty years. Ignoring this dimension is as serious an
omission as ignoring the re-awakening oftraditional language testing in acommunicative setting.
The revision of the British Council/UCLES ELTS Test 1986-89, which has resulted
in the new IELTS battery (see Appendix V), had originally planned to include a test of
lexis and grammar in the General Component. In the earlier trials of the TEEP test 1979-82
a discrete item multiple-choice test of grammar had been included in the trial battery and
proved to be a robust and valid indicator of General Language Proficiency. The TEEP
research (Weir 1983a) indicated clearly, however, that the grammar component added
no additional information to the picture of a candidate's language ability already available
from the more communicative, use-based components. For this reason it was dropped from
the battery. For similar reasons the tests of lexis and grammar have been dropped from
the IELTS battery.
So far we have concentrated on examining ways of improving the validity of tests and
neglected the crucial fact that unless a test is reliable it cannot be valid. The need for
reliability in order to guarantee the validity of our tests is the next issue we address.
2.2 The concept of reliability
A fundamental criterion against which any language test has to be judged is its reliability
(see Anastasi, 1982; Guilford, 1965). The concern here is with how far can we depend
on the results that a test produces or, in other words, could the results be produced
consistently.
32 Communicative Language Testing
Three aspects of reliability are usually taken into account. The first concerns the
consistency of scoring among different markers, e.g., when marking a test of written
expression. The degree of inter-marker reliability is established by correlating the scores
obtained by candidates from marker A with those from marker B. The consistency of each
individual marker (intra-marker reliability) is established by getting them to remark a
selection of scripts at a later date and correlating the marks given on the two occasions.
(See Anastasi, 1982 for a clear and accessible introduction to the necessary statistics. Also
of use are Crocker, 1981 and more recently Woods et al., 1986.)
The concern of the tester is how to enhance the agreement between markers by
establishing, and maintaining adherence to, explicit guidelines for the conduct of this
marking. The criteria ofassessment need to be established and agreed upon and then markers
need to be trained in the application of these criteria through rigorous standardisation
procedures (see Murphy, 1979). During the marking of scripts there needs to be a degree
of cross-checking to ensure that agreed standards are being maintained.
It is also considered necessary to try and ensure that relevant sub-tests are internally
consistent in the sense that all items in a sub-test are judged to be measuring the same
attribute. The Kuder-Richardson formulae for estimating this internal consistency are readily
available in most statistics manuals (see Anastasi, 1982, pp. 114--6).
The third aspect of reliability is that of parallel-forms reliability, the requirements of
which have to be borne in mind when future alternative forms of a test have to be devised.
This is often very difficult to achieve for both theoretical and practical reasons. To achieve
it, two alternative versions of a test need to be produced which are in effect clones of
each other. The reliability of the versions is directly proportional to the similarity of the
results obtained when administered to the same test population. Less frequently reliability
is checked by the test-retest method where the same test is readministered to the same
sample population after a short intervening period of time.
The concept of realiability is particularly important when considering language tests
within the communicative paradigm (see Porter, 1983). For as Davies (1965, p. 14) stressed:
'reliability is the first essential for any test; but for certain kinds of language test may
be very difficult to achieve.'
2.3 Validity and reliability -- an inevitable tension?
Given the normal limitations affecting test development (especially of achievement tests
in the classroom), concern usually centres on validation at the test construction stage and
only to a lesser extent with a posteriori validation at the performance stage. The resources
to do large-scale concurrent and predictive validity studies, such as conducted by Moller
(1982b) and by the Institute of Applied Language Studies at the University of Edinburgh,
on the ELTS battery, are not normally available.
The concern is often by necessity with content, construct and face validity though the
predictive and concurrent validity of tests should always be examined as far as circumstances
allow. Validation might prove to be a sterile endeavour, however, unless care has also
been taken over test reliability.
Basic considerations in test design 33
The problem is that while one can have test reliability without test validity a test can
only be valid if it is also reliable. There is thus sometimes said to be a reliability--validity
tension (see Guilford, 1965 and Davies, 1978). This tension exists in the sense that it is
sometimes essential to sacrifice a degree of reliability in order to enhance validity. If,
however, validity is lost to increase reliability we finish up with a test which is a reliable
measure of something other than what we wish to measure. The two concepts are, in certain
circumstances, mutually exclusive, but if a choice has to be made, validity 'after all, is
the more important', (see Guilford, 1965, p. 481).
Rea (1978) argued that simply because tests which assess language as communication
cannot automatically claim high standards of reliability in the same way that discrete-item
tests are able to, this should not be accepted as a justification for continued reliance on
highly reliable measures having very suspect validity. Rather, we should first be attempting
to obtain more reliable measures of communicative abilities. This seems a less extreme
and more sensible position than that adopted by Morrow (1979, p. 151), who argued
polemically that: 'Reliability, while clearly important, will be subordinate to face validity.
Spurious objectivity will no longer be a prime consideration.'
Rea's viewpoint was shared by Read (1981a, p. x--xi), who reported that a recurring
theme at the April 1980 RELC Seminar on 'Evaluation and Measurement of Language
Competence and Performance' was that: 'subjective judgements are indispensable if we
are to develop testing procedures that validly reflect our current understanding of the nature
of language proficiency and our contemporary goals in language teaching.'
Read went on to emphasise that: 'this does not mean a return to the old pre-scientific
approach. It is generally accepted that a substantial, verifiable level of reliability must
also be attained, if test results are to have any meaning.' Moller adopted a similar approach
(1981a, p. 67):
While it is understood that a valid test must be reliable, it would seem that in such a highly
complex and personal behaviour as using a language other than one's mother tongue, validity
could be claimed for measures that might have a lower than normally acceptable level of
reliability.
He argued that, although reliability is something we should always try to achieve in our
tests, 'it may not always be the prime consideration' and offers a possible compromise
position (p. 67):
In constructing test batteries that contain different types of task, for example, certain of the
sub-tests may be required to exhibit a high degree of reliability. Other sub-tests, particularly
tests of communicative use, may quite properly exhibit lower reliability without adversely
affecting the overall validity of the battery.
Hawkey (1982, p. 149) commented in a similar vein:
the reliability of a test cannot be ignored without a harmful effect on the validity of the
instrument. But it is likely that, ifthe construct validity ofcommunicative tests is to be ensured,
the reliability question is going to have to be accepted as subordinate, though worked at fairly
hard by item analysis and correlational operations.
Validity is important also because it is related to the way in which test performance
34 Communicative Language Testing
levels are defined. Houston (1983) describes the difference between norm- and criterion-
referenced methods of defining levels and discusses some of the difficulties of specifying
appropriate performance criteria when the latter method is chosen. Popham (1978, p. 2)
provided the following functional definitions of these approaches:
a criterion-referenced test is designed to produce a clear description of what an examinee's
performance on the test actually means. Rather than interpreting an examinee's test performance
in relationship to the performance of others as is the case with many traditional tests, a good
criterion-referenced test yields a better picture ofjust what it is that the examinee can or cannot
do.
Davies (1978, p. 158) made the connection with language testing and expressed certain
reservations about criterion-referenced tests:
there are difficulties in using criterion-referenced tests for language: there is no finite inventory
of learning points or items; there are very many behavioural objectives; there are variable
(or no) external criteria of success, fluency, intelligibility, etc; there is no obvious way of
establishing adequate knowledge, of saying how much of a language is enough.
Thus some of the difficulties referred to later by Houston (1983) are put in a language
testing context. Clearly, criterion-referencing of performance levels is possible only to
the extent that the test has a high degree of content validity.
2.4 Test efficiency
A valid and reliable test is of little use if it does not prove to be a practical one. This
involves questions of economy, ease of administration, scoring, and interpretation of results.
The longer it takes to construct, administer and score, and the more skilled personnel and
equipment that are involved, the higher the costs are likely to be.
The duration of the test may affect its successful operation in other ways, e.g., a fatigue
effect on the candidates, administrative factors such as staff to invigilate and the availability
of rooms in which to sit the examination; all have to be taken into consideration. It is
thus highly desirable to make the test as short as possible, consistent with the need to meet
the validity and reliability criteria referred to above. If the aim is to provide as full a profile
of the student's abilities as is possible then there is obviously a danger of conflict, for
although hard-pressed administrators seem to want a single overall grade, remedial language
teachers would prefer as much information as possible (see Moller, 1977; Alderson and !
Hughes, 1981; and Porter, 1983).
To provide profiles rather than standard scores, each part of the profile will need to
reach an acceptable degree of reliability. To achieve satisfactory reliability, communicative
tests may have to be longer and have multiple scoring. The difficulties in ensuring that
the test contains a representative sample of tasks may also serve to lengthen it. To enhance
validity by catering for specific needs and profiling, more tests will be needed, thus further
raising the per-capita costs as compared to those of single general tests available for large
populations.
Basic considerations in test design 35
Efficiency in the sense of financial viability, may prove the real stumbling block in the
way of the development of communicative tests. Tests of this type are difficult and time
consuming to construct, require more resources to administer, demand careful training
and standardisation of examiners and are more complex and costly to mark and report
results on. The increased per-capita cost of using communicative tests in large-scale testing
operations may severely restrict their use.
However problematic, there is clearly an imperative need to try and develop test formats
and evaluation criteria that provide the best overall balance among reliability, validity and
efficiency in the assessment of communicative skills. In the survey of developments in
language testing in Chapter One we noted that the pendulum of change had swung several
times and differing emphases had been given in test design and implementation to the
demands of reliability, validity and efficiency. In this chapter these concepts have been
examined from a deeper, theoretical perspective.
In Chapter Three we return to more practical concerns and the stages in the development
of a test are briefly outlined to give an idea of the processes that are normally followed
in the design and implementation of a language test. This is followed in Chapter Four
by an examination of a range of formats available for testing language skills within the
communicative paradigm and an assessment is made of their advantages and disadvantages
in the light of the discussion in Chapters One and Two.
Chapter Four is intended to be of immediate practical use to the reader faced with the
problem of selecting appropriate test formats. It outlines what might be the best choices
given the uncertain state of the art in communicative testing and a desire, nevertheless,
to make a test as communicative as possible within the constraints imposed by considerations
of practicality and reliability.